[PDF] Codon Bias Patterns of E.coli 's Interacting Proteins

Abstract

Synonymous codons, i.e., DNA nucleotide triplets coding for the same amino acid, are used differently across the variety of living organisms. The biological meaning of this phenomenon, known as codon usage bias, is still controversial. In order to shed light on this point, we propose a new codon bias index, CompAI , that is based on the competition between cognate and near-cognate tRNAs during translation, without being tuned to the usage bias of highly expressed genes. We perform a genome-wide evaluation of codon bias for E.coli , comparing CompAI with other widely used indices: tAI , CAI , and Nc . We show that CompAI and tAI capture similar information by being positively correlated with gene conservation, measured by ERI, and essentiality, whereas, CAI and Nc appear to be less sensitive to evolutionary-functional parameters. Notably, the rate of variation of tAI and CompAI with ERI allows to obtain sets of genes that consistently belong to specific clusters of orthologous genes (COGs). We also investigate the correlation of codon bias at the genomic level with the network features of protein-protein interactions in E.coli . We find that the most densely connected communities of the network share a similar level of codon bias (as measured by CompAI and tAI ). Conversely, a small difference in codon bias between two genes is, statistically, a prerequisite for the corresponding proteins to interact. Importantly, among all codon bias indices, CompAI turns out to have the most coherent distribution over the communities of the interactome, pointing to the significance of competition among cognate and near-cognate tRNAs for explaining codon usage adaptation.

Full PDF

aa r X i v : . [ q - b i o . GN ] J u l Codon Bias Patterns of

E.coli ’s Interacting Proteins

Maddalena Dilucca, ∗ Giulio Cimini, Andrea Semmoloni, Antonio Deiana, and Andrea Giansanti

1, 3 Dipartimento di Fisica, Universit`a “Sapienza”, Rome, Italy Istituto dei Sistemi Complessi (ISC)-CNR, UoS Universit`a “Sapienza”, Rome, Italy INFN Roma1 unit, Rome, Italy (Dated: September 4, 2018)Synonymous codons, i.e. , DNA nucleotide triplets coding for the same amino acid, are useddiﬀerently across the variety of living organisms. The biological meaning of this phenomenon,known as codon usage bias , is still controversial. In order to shed light on this point, we propose anew codon bias index,

CompAI , that is based on the competition between cognate and near-cognatetRNAs during translation, without being tuned to the usage bias of highly expressed genes. Weperform a genome-wide evaluation of codon bias for

E.coli , comparing

CompAI with other widelyused indices: tAI , CAI , and Nc . We show that CompAI and tAI capture similar informationby being positively correlated with gene conservation, measured by ERI, and essentiality, whereas,

CAI and Nc appear to be less sensitive to evolutionary-functional parameters. Notably, the rateof variation of tAI and CompAI with ERI allows to obtain sets of genes that consistently belongto speciﬁc clusters of orthologous genes (COGs). We also investigate the correlation of codon biasat the genomic level with the network features of protein-protein interactions in

E.coli . We ﬁndthat the most densely connected communities of the network share a similar level of codon bias (asmeasured by

CompAI and tAI ). Conversely, a small diﬀerence in codon bias between two genes is,statistically, a prerequisite for the corresponding proteins to interact. Importantly, among all codonbias indices,

CompAI turns out to have the most coherent distribution over the communities of theinteractome, pointing to the signiﬁcance of competition among cognate and near-cognate tRNAs forexplaining codon usage adaptation.

INTRODUCTION

The genetic information carried by the mRNA and then translated into proteins is encoded into nucleotide tripletscalled codons . Four alternate nucleotidic bases (A,U,C,G) compose the mRNA, so that there are 4 = 64 possiblecodons that have to code for only 20 naturally occurring amino acids. The genetic code is therefore redundant: while afew amino acids correspond to a single codon, most amino acids can be encoded by diﬀerent codons. Diﬀerent codonscoding for the same amino acid are known as synonymous codons, and in a wide variety of organisms synonymouscodons are used with diﬀerent frequencies—a phenomenon known as codon bias . With the advent of whole-genomesequencing of numerous species, genome-wide patterns of codon bias are emerging in the diﬀerent organisms. Variousfactors such as expression level, GC content, recombination rates, RNA stability, codon position, gene length, environ-mental stress and population size, can inﬂuence codon usage bias within and among species [1]. While the biologicalmeaning and origin of codon bias is not yet fully understood, there is a large consensus that the degeneration ofthe genetic code might provide an additional degree of freedom to modulate accuracy and eﬃciency of translation[2]. Indeed, population genetic studies [3] have shown that synonymous sites are under weak selection, and thatcodon bias is maintained by a balance between mutation-selection (random variability in genetic sequences followedby ﬁxation of the optimal codons) and genetic drift (allowing for the occurrence of non-optimal codons). In fact,highly expressed genes feature an extreme bias by using a small subset of codons, optimized by translational selection[4–6]. On the other hand, the persistence of non-optimal codons in less-expressed sequences causes long breaks duringprotein synthesis; this could be the result of genetic drift and have a key role in the protein folding process [7, 8].In addition, codon usage appears to be structured along the genome, with neighboring genes having similar codoncompositions [9], and codon bias seems positively correlated to gene length (as a result of selection for accuracy inthe costly production of long proteins) [10]. In the last years there has been a wide eﬀort in developing eﬀective waysto measure codon bias [11]. The most widely used indices include the Codon Adaptation Index ( CAI ) [12], the tRNAAdaptation Index ( tAI ) [13], and the Eﬀective Number of Codons ( N c ) [14], each of them having speciﬁc advantagesand drawbacks. For instance,

CAI and tAI correlate well with gene expression levels, however such correlation is anatural consequence of their deﬁnition: they are tuned on a reference set of highly expressed genes.

N c is insteadbasically a measure of the entropy of the codon usage distribution, and thus shows a lower correlation with expressionlevels. ∗ [email protected] In this work we propose a novel codon bias index named

Competition Adaptation Index ( CompAI ), which does notrely on information about gene expression levels, but instead has a self-consistent biological meaning—based on tRNAavailability and competition between cognate and near-cognate tRNAs. In other words,

CompAI is a parameter-freeindex that does not require a set of reference genes for its calibration, a fact that constitutes its main advantage withrespect to

CAI and tAI . Moreover,

CompAI is designed to extract genetic signals that could be directly correlated toexperimental measures for translation speeds, an emerging and challenging issue still to be explored. In order to showthe advantage of the novel codon bias index, we perform a genome-wide comparison of

CAI , tAI , N c and

CompAI for

Escherichia Coli (E.coli). Our analysis reveals that the information on gene conservation across species and geneessentiality is better captured by codon bias metrics that build on tRNA availability ( tAI and

CompAI ). We alsostudy codon bias in relation to the connectivity patterns of the protein-protein interaction network (PIN) [15] of E.coli.We thus show that translational selection systematically favors proteins with the highest number of interactions andbelonging to the most densely connected community of the network, at least when the bias is measured by

CompAI and, to a smaller extent, by tAI . Additionally, we address the issue of how much a similarity in the codon usage biasof a set of genes is reﬂected on the interactions among the corresponding proteins. A principal component analysisfor the variability of codon bias indices indeed reveals that closeness of a set of genes in the space of the two principalcomponents likely results in the corresponding proteins to interact—in comparison with an appropriate null model.Overall, our study reveals that

CompAI captures more information than the other indices about the connectionbetween codon bias and the topology of the interactome. Besides, we recall that

CompAI does not require calibrationon gene expression levels and has a consistent biological meaning based on the competition between cognate andnear-cognate tRNAs. These observations stress the potential of the new index to both measure and explain codonusage bias.

MATERIALS AND METHODSSequences

In this work we investigate the genome of E.coli K-12 substr. MG1655, whose 4005 coding mRNA sequences havebeen collected from GenBank [16]. The gene copy numbers coding for each tRNA (tGCN) were derived from theGenomic tRNA database [17].

Conservation and Essentiality of E.coli genes

In order to have an index for gene “conservation”, we use the normalized Evolutionary Retention Index (ERI) [18]:for each gene in E.coli’s genome, its ERI measures how much that gene is shared among other 32 bacterial species(having at least an ortholog of the given gene). A low ERI value thus denotes that a gene is speciﬁc to E.coli, whereas,high ERI is characteristic of highly shared (and therefore conserved) genes. Concerning gene “essentiality”, we usethe classiﬁcation of Gerdes et al. [18] for the E.coli genome into 606 essential and 2940 non-essential genes, based onexperimental measures of gene resistance against transposon insertion.

Codon Bias Indices

Codon usage bias can be assessed, for each gene in a given genome, by various indices that can be classiﬁed intobroad groups based on: (i) codon frequencies; (ii) reference gene sets; (iii) deviation from a postulated distribution;(iv) information theory; (v) interactions among tRNAs (see [11] for an overview). We focus here on the most widelyused indices: tAI [13], that belongs to groups (ii) and (v) by requiring calibration on a set of highly expressed genes;

CAI [12], a group (i) and (ii) index built on local statistics of codon usage and on a reference list of optimally expressedgenes;

N c [14], a group (i) index based on the number of diﬀerent codons used in a coding sequence. The novel codonbias index we propose in this work,

CompAI , is instead based on the competition of cognate and near-cognate tRNAsto bind to the A-site on the ribosome during translation, and is thus a group (v) index that does not need tuning ona reference set of highly expressed genes. While the formal deﬁnition for

CompAI and the rationale behind are givenbelow, we refer to the Supporting Information for the operative deﬁnition of

CAI , tAI and N c . Competition Adaptation Index ( CompAI ). It is generally accepted that translation speed depends on theeﬃciency of the codon/anticodon pairing in the A site of the ribosome [19]. Hence, for a given codon, the rate ofamino acid synthesis is essentially inﬂuenced by two dominant processes: the number of collisions of the correspondingtRNA with the ribosomal A site (which strongly depends on tRNA concentration in the cell) and the speciﬁcity ofthe codon-anticodon pairing. Such a pairing process satisﬁes the Watson-Crick base-pairing rules (G-C and A-U, andvice versa) for the ﬁrst two bases, whereas, the rule on the third (or wobble ) base is more relaxed and non-standardpairing is allowed in some cases (wobble complementarity)[20]. Hence, there are cases in which several tRNAs pairwith the same codon (provided that these are identical in the ﬁrst two bases) and are called isoacceptor or cognatetRNAs. Codon-anticodon interactions are thus characterized by competition between cognate tRNA (with perfect orwobble complementarity between mRNA codon and tRNA anticodon), near-cognate tRNA (with a mismatch in onlyone of the ﬁrst two bases) and non-cognate tRNA (with at least two mismatches). Discrimination between correctand wrong tRNA according to base pairing features very high ﬁdelity (error rate f ∼ − ÷ − ). Rejection of thewrong tRNA can occur in two distinct phases [19, 21]: initial selection of the ternary complex EF-Tu-GTP-aa-tRNAand subsequent proofreading of aa-tRNA after GTP hydrolyzation. The ﬁrst interaction is fast and does not dependon the choice of codons, in order to allow the ribosome to quickly screen for the available tRNAs. The second stepis instead sensitive to base complementarity, featuring the ﬁrst selection between cognate and near-cognate tRNAs:non-cognate are excluded almost immediately with f ∼ − , and then a more strict and eﬃcient proofreading takesplace, excluding near-cognate with f ∼ − . This means that near-cognate tRNA (unlike non-cognate tRNA) canenter into the interaction process between the ternary complex and the site A of the ribosome, and (when not accepted,in very few cases) can be rejected at the stage of initial recognition or during proofreading. In any event, this processresults in a time delay of translation, because near-cognate rejection brings the ribosome back to the initial state ofwaiting for the correct tRNA.The rationale behind the deﬁnition of CompAI is precisely that of building an index which is based both on tRNAavailability and on competition between cognate and near-cognate tRNAs that could modulate the speed of translationof mRNAs into proteins. Note that, since in vivo experimental determinations of tRNA concentrations are availableonly for few organisms, we will implement

CompAI using the number of tRNA gene copies (tGCN) which, at least insimple organisms, has a high and positive correlation with tRNA abundance [22–25] (a similar approach is adoptedin the deﬁnition of tAI [13]). For each codon i we deﬁne its absolute adaptiveness value ( W i ) as: W i =  m i X j =1 tGCN ij   m i P j =1 tGCN ijm i P j =1 tGCN ij + m nci P j =1 tGCN ncij  . (1)Here m i is the number of isoacceptor tRNA sequences (anticodons) that recognize codon i ( i.e. , containing either theanti-codon i or all its cognates that are read by i ) and tGCN ij is the gene copy number of the j -th of such tRNAs,whereas m nci is the number of tRNA sequences that are near-cognate of i and tGCN ncij is the gene copy number ofthe j -th of such tRNAs (see also Fig. S1 of Supporting Figures). The amount in square brackets represents a penaltyintroduced by the competition with near-cognate tRNAs, assuming unit or zero values in the cases of smaller andhigher competition, respectively. This term thus assumes the role of selective constraint on the eﬃciency of the codon-anticodon coupling. Importantly, and at odds with tAI , these terms do not result from optimization on expressionlevels, but have a biological justiﬁcation based on cognate/near-cognate competition. Note that, in the computationof W i for a given codon, we count as isoacceptor tRNAs those with perfect or wobble base pairing that also carrythe same amino acid of i ’s anticodon. Computation of CompAI continues by deﬁning for each codon i its relativeadaptiveness value w i = W i /W max , where W max is the maximum value between all the W i of codons. CompAI ofgene g is ﬁnally deﬁned as the harmonic mean of the relative adaptiveness of its codons: CompAI g = l gl g P i =1 w − i (2)The choice of the harmonic mean (rather than geometric as for CAI and tAI ) is consistent with the association of

CompAI with the rate of protein synthesis. Indeed, the translation speed of codon i can be deﬁned as the reciprocalof the concentration of the corresponding tRNA isoacceptors [26]. Therefore, if codon i is read at a speed proportionalto w i , then the average translation speed of a gene is given by the harmonic mean of the { w i } associated to its codons. CompAI takes values between 0 and 1, where values close to 0 (1) indicate highest (lowest) competition, and thereforea low (high) translation rate.

Protein-Protein Network Analysis

In this study we use protein interaction data collected in STRING (Known and Predicted Protein-Protein Interac-tions) [15]. In such database, each predicted interaction is assigned with a conﬁdence level or probability w , evaluatedby comparing predictions obtained by the diﬀerent techniques [27–29] with a set of reference associations, namely thefunctional groups of KEGG (Kyoto Encyclopedia of Genes and Genomes) [30]. In this way, interactions with high w are likely to be true positives, whereas, a low w likely corresponds to a false positive. Since the percentage of falsepositives can be very high [31], we select a stringent cut-oﬀ Θ = 0 . P ( w ) in the left panel of Fig. S2 of Supporting Figures). Wethus build the protein-protein interaction network (PIN) of E.coli by placing a link between each pair of proteins(nodes) i, j provided that w ij > Θ. The resulting number of connections or degree for a given protein i is denoted as k i .To detect communities of PIN we resort to Molecular Complex Detection (MCODE) [32]. In a nutshell, MCODEiteratively groups together neighboring nodes with similar values of the core-clustering coeﬃcient, which for eachnode is deﬁned as the density of the highest k -core of its immediate neighborhood times k [50]. MCODE detectsthe densest regions of the network and assigns to each found community a score that is its internal link densitytimes the number of nodes belonging to it [51]. We also characterize each found community c with the mean value¯ x c and standard deviation σ c of codon bias values within the community, and use them to compute a Z -score as Z c = (¯ x c − ¯ x n ) / p σ c + σ n (where ¯ x n and σ n are, respectively, the mean value and standard deviation of codon biasvalues computed on the whole network). In this way, a value of Z c > Z c < −

1) indicates that community c featuressigniﬁcantly higher (lower) codon bias than the population mean.Finally note that each node of PIN is photogenically classiﬁed according to the Clusters of Orthologous Groups [52](COGs) of proteins [34]. COGs are generated by comparing predicted and known protein sequences in all completelysequenced genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous acrossat least three lineages and likely corresponds to an ancient conserved domain [34]. Principal Component Analysis

Principal Component Analysis (PCA) [35] is a multivariate statistical method to transform a set of observationsof possibly correlated variables into a set of linearly uncorrelated variables (called principal components) spanning aspace of lower dimensionality. The transformation is deﬁned so that the ﬁrst principal component accounts for thelargest possible variance of the data, and each succeeding component in turn has the highest variance possible underthe constraint that it is orthogonal to ( i.e. , uncorrelated with) the preceding components.We use this technique on the space of codon bias indices, so that each gene of E.coli is represented as a 4-dimensionalvector with coordinates (

CompAI , CAI , tAI , N c ). Such coordinates are separately normalized to zero mean andunit variance over the whole genome. We then obtain the associated covariance matrix between the four dimensionsof codon bias and diagonalize it. The eigenvectors of the covariance matrix, ordered according to the magnitude ofthe corresponding eigenvalues, are the principal components of the original data.

Conﬁguration Model

In order to assess how signiﬁcant are the codon usage patterns observed for the PIN, we need to compare theE.coli interactome with a suitable null model for it, i.e. , an appropriate randomization of the network. Here we followthe most common approach in statistical mechanics of networks of using the

Conﬁguration Model (CM) [36]. Thebasic idea is to build the null model as an ensemble Ω of graphs with maximum entropy, except that the ensembleaverage of the node degrees are constrained to the values observed for the real network: h k i i Ω ≡ k i ∀ i . This leads to aprobability distribution over Ω which is deﬁned via a set of Lagrange multipliers { x i } (one for each node) associated tothe constraints [37]. Once all { x i } are found, the CM reduces to having a link between nodes i and j with probability p ij = x i x j x i x j , independently on all other links. Then, the null hypothesis is that any given network property χ varies in the range h χ i Ω ± σ Ω [ χ ], where both average and standard deviation of χ over the ensemble can be obtainedeither analytically or numerically (by drawing sample networks from Ω) [37]. The number of standard deviations bywhich the empirical and expected values of χ diﬀer is given by the Z -score Z [ χ ] = ( χ − h χ i Ω ) /σ Ω [ χ ]: large positive(negative) values of Z [ χ ] indicate that X is substantially larger (smaller) than expected, whereas, small values signalno signiﬁcant deviation from the null model. RESULTS AND DISCUSSIONSpeciﬁcity, Essentiality and Codon Bias of E.coli genes

Correlations between Codon Bias indices.

As the starting point of our analysis, we ﬁrst check how the diﬀerent codon bias indices correlate over E.coli’sgenome. Fig. 1 shows that, interestingly,

CompAI is strongly (and positively) correlated with tAI , whereas it doesnot show any signiﬁcant correlation with

CAI nor with

N c . This result can be easily explained as

CompAI and tAI elaborate on the same genetic information, that is the abundance of tRNAs, whereas

CAI and

N c are based on codonusage statistics (see the Supporting Information).

CAI C o m p A I tAI Nc c=0.04 c=0.74 c=-0.04 FIG. 1.

Correlation between codon bias indices.

Values of Pearson’s correlation coeﬃcients show that

CompAI is stronglyand positively correlated with tAI ( c = 0 . CAI nor Nc ( c ≃ Codon Bias and ERI.

We move further and analyze the correlation between the various codon bias indices and the evolutionary retentionindex (ERI) [18] for E.coli genes (we recall that a gene with a low ERI value is peculiar to E.coli, whereas a gene withhigh ERI is shared among diﬀerent species). Fig. 2 reports the average values and standard deviations of the codonbias indices for every group of genes having the same ERI value. Interestingly, the evolutionary codon adaptationmeasured by

CompAI and tAI tends to increase for genes that are less speciﬁc to E.coli. Fig. 2 also suggests that itis possible to make a threefold separation of genes by looking at the rate of variation of tAI and

CompAI with ERI.We thus identify group A (ERI < < ERI < > CAI and

N c are instead less structured with respect to ERI, asshown by the very small correlation coeﬃcients (and by the impossibility to identify gene groups).

Codon Bias and Gene Essentiality.

We now study the patterns of codon usage bias in essential and non-essential genes, according to the classiﬁcationscheme of Gerdes et al. [18] (see Materials and Methods). As a preliminary result, Fig. 3 reports the percentage ofessential genes in each set of genes sharing the same ERI. We see that the three groups A, B, C of genes identiﬁed asin the previous paragraph feature diﬀerent percentages of essential genes: approximately, 10% for group A, 15% forgroup B and above 30% for group C. Essentiality and ERI thus seems to capture similar genetic features. Fig. 4 showsinstead that

CompAI and tAI are more sensitive than

CAI and

N c in distinguishing essential from non-essentialgenes. Overall, Figs. 2 and 4 provide a clear indication that codon bias, as measured by tAI and

CompAI , is morepronounced for genes that are highly conserved ( i.e. , with high ERI) and essential, on the other hand

CAI and

N c are less sensitive to these quantities.

COGs.

We now perform a kind of gene ontology to check how the three gene groups A, B, C are projected over the clustersof orthologous genes (COGs) and their functional annotations [34]. To this end, for each group we evaluate the C o m p A I C A I t A I N c c =0.93, c =0.96, c =0.96 c=0.27c =0.93, c =0.93, c =0.96 c=-0.30 FIG. 2.

Correlation between the various codon bias indices and ERI . Codon bias average values and standard deviation(error bars) are determined for each set of E.coli genes having the same ERI value. In each panel, the solid lines are linearregression ﬁts, with c denoting the corresponding correlation coeﬃcients. In the left panels, the ﬁts are performed separatelyfor the three groups of genes A (ERI < < ERI < > CompAI and tAI monotonouslyincrease with ERI, whereas

CAI and Nc show a low correlation with ERI. ERI E ss e n ti a l g e n e s ( % o f E / ( E + N E )) FIG. 3.

Essentiality for E.coli genes . The percentage of essential genes is reported for each set of genes sharing the sameERI. Horizontal solid lines represent average values of essentiality percentage for each group A, B, C of genes (deﬁned by amaximum correlation between

CompAI -ERI and tAI -ERI). The groups have diﬀerent incidences of essential genes: 10% forgroup A (ERI < < ERI < > Bayesian probability that its genes belong to a given COG:

P r (COG | group) = P r (group | COG)

P r (COG) /P r (group),where

P r (group) is estimated as the fraction of the genome belonging to the group,

P r (COG) as the fraction of thegenome belonging to the COG and

P r (group | COG) is the fraction of genes in the COG that belong to a particulargroup. Fig. 5 shows the histogram of

P r (COG | group) over the 17 COGs, for the three groups A, B, C deﬁned above.Assuming an arbitrary discriminating threshold of 10%, we observe that each group is prevalently projected over alimited set of COGs (reported in the legend box of Fig. 5). Group A genes (those with low ERI values) mostly insistover COGs K and G (transcription, carbohydrate metabolism); group B (genes with intermediate ERI) is enriched inCOGs G and E (again, carbohydrate metabolism, amino acid metabolism and transport); ﬁnally, group C (genes withthe highest ERI) is dominated by the functional annotations associated with COGs J and L (translation, ribosomestructure and biogenesis, replication, recombination and repair). Indeed, group C, composed of the highly adapted,essential, and conserved genes of E.coli, is the set of genes that code for ribosomal proteins. NE E0.0840.0860.0880.09 C o m p A I NE E0.7440.7460.748 C A I NE E t A I NE E N c NE = not essential E = essential

FIG. 4.

Codon bias indices for essential and non-essential genes.

Error bars are standard deviations within each group.Then mean value of codon bias is systematically higher for essential genes, however, only

CompAI and tAI can eﬀectivelyseparate essential from non-essential genes. In fact, in the left panels the average codon bias values for essential and non-essential genes have a relative variation of about 5%, whereas, in the right panels such values are almost coincident and theerrors overlap.

L J K D O M N P T C G E F H I R S P r ( C OG | g r oup ) Group A (K, G)Group B (G, E)Group C (L, J)

FIG. 5.

Histogram of

P r (COG | group) over the COGs for the three gene groups A, B, C . Each group is characterizedby one or a few predominant COGs, indicated within parenthesis in the legend (assuming a threshold of 0 . Codon Bias and the Connectivity Patterns of E.coli’s Protein Interaction Network

Communities.

We now turn our attention to the network of interacting proteins in E.coli. We start by studying codon bias inrelation with the connectivity patterns of the network. First, note that the degree distribution of proteins is scale-free(see the right panel of Fig. S2 of Supporting Figures), meaning that the network features a large number of poorlyconnected proteins and a relatively small number of highly connected hubs. Fig. 6 notably shows that these hubproteins are systematically characterized by higher values of codon bias of the corresponding genes—when this ismeasured by tAI and

CompAI . CAI and

N c are instead clearly less sensitive to protein connectivity. C o m p A I C A I t A I N c FIG. 6.

Relation between the various codon bias indices of genes and the degree k of the corresponding proteinsin the PIN of E.coli . Solid lines are linear ﬁts. CompAI and tAI of a gene deﬁnitely increase with the connectivity of thecorresponding protein in the PIN, whereas the other two indices are less sensitive to this parameter.

We move further and consider codon bias in relation with the community structure of the PIN. We recall that acommunity is a group of proteins that are more densely connected within each other than with the rest of the network.Table I shows the features of the communities that are assigned by MCODE a score higher than 10, together withtheir COG composition, average degree and, for ERI and the various codon bias indices, the internal average value ¯ x c and the Z -scores (comparing the distribution of bias inside the community with that of the whole network). We seethat such topologically determined communities, ordered by score, are evolutionarily and functionally characterizedby a dominant COG, shared by the majority of the proteins in the community. This suggests that the identiﬁedcommunities can be associated with speciﬁc metabolic functions: they correspond to functional modules, essential forthe life-cycle of the organism.Let us focus on the ﬁrst community, that includes only 60 proteins (4.5% of the whole network) but as much as32.6% of the total number of links in the network, and that basically overlaps with the main core of the PIN ( i.e. , the k -core with the highest possible degree). Notably, proteins belonging to this community have on average a codon biasindex (as measured by tAI and, even more, by CompAI ) that is signiﬁcantly higher than the average of the rest ofthe network (the Z -score is bigger than 1). As noticed above, this core is essentially composed of ribosomal proteins,that are usually highly expressed, have the highest codon usage bias, and are broadly conserved and essential acrossdiﬀerent taxa [38]. TABLE I.

Features of top-scoring communities . Number of nodes ( n ), community score ( n times the internal density),predominant COG label and percentage; then, for ERI and the various codon bias indices, mean values ¯ x c internal to thecommunity and Z scores (between square brackets). Values of Z > n score < k > COG ERI

CompAI CAI tAI Nc ] [ ] [0.05] [ ] [-0.06]2 31 28.6 35.03 N (74.2%) 0.38 0.08 0.75 0.32 49.88[0.21] [-0.35] [0.07] [0.14] [0.12]3 21 19.1 25.85 C (97.6%) 0.53 0.09 0.74 0.34 50.18[0.65] [0.38] [-0.13] [0.72] [0.2]4 15 13.9 18.40 M (66.7%) 0.82 0.09 0.75 0.31 49.32[ ] [0.07] [0.15] [0.07] [-0.02]5 13 11.7 10.77 P (76.9%) 0.20 0.08 0.77 0.33 48.57[-0.29] [-0.26] [0.40] [0.54] [-0.22]6 12 11.5 11.50 U (48.9%) 0.20 0.07 0.76 0.28 48.92[-0.29] [-0.82] [0.26] [-0.63] [-0.12]7 11 10.6 19.82 P (63.6%) 0.56 0.09 0.76 0.34 48.74[0.70] [0.44] [0.26] [0.72] [-0.14]8 10 10.0 11.60 C (75.0%) 0.04 0.07 0.76 0.29 47.78[-0.86] [-0.66] [0.28] [-0.45] [-0.30]

Principal Component Analysis.

Finally we perform PCA over the space of the four codon bias indices (

CompAI , CAI , tAI , N c ) measured for eachE.coli gene. The two ﬁrst principal components (PC1 and PC2) turn out to represent for as much as 85% of thetotal variance of codon bias over the genome (left plot of Fig. 7). Projection of the ﬁrst two principal componentson the individual codon bias indices (loadings) shows that none of the four indices predominantly contributes to thedata variability (right plot of Fig. 7). Thus, the placement of a gene in the PC1-PC2 plane depends on a weightedcontribution of all the indices. Interestingly, the genes encoding for the proteins of the eight top MCODE communitiesare well localized and separated in this reduced space (Fig. 8). In particular, the ﬁrst community ( i.e. , the core ofribosomal proteins characterized by high values of both

CompAI and tAI ) is located in the upper left part of thegraph, isolated from the others. This represents an important evidence: proteins that belong to the densest connectedcores of the interactome are well-localized in the space of the two principal components. In other words, if a set of

PCA components no r m a li ze d e i g e nv a l u e -0.6-0.300.30.6 P C C o m p A I C A I t A I N c -0.6-0.300.30.6 P C FIG. 7. Left plot:

Eigenvalues of the correlation matrix between the codon bias indices on expressed sequences.

Right plot:

Projection of the ﬁrst two PCA components on the individual codon bias indices.

Recalling that Nc is anticorrelated with the other codon bias indices, PC1 results from a weighted and coherent contribution of all the indices,whereas, for PC2 the contribution of CompAI and tAI is opposite to that of

CAI and Nc . proteins are physically and functionally connected in a module, then their corresponding genes should share commoncodon bias features. Conversely, we can obtain an estimate for the conditional probability P r (link | d ) of a functionalinteraction between proteins, provided that their relative genes fall within a distance d in the plane of the two principalcomponents PC1 and PC2. Reasonably, we compare P r (link | d ) estimated on the real interactome with h P r ( link | d ) i Ω -3 -1,5 0 1,5 PC1 -2-1,5-1-0,500,511,522,53 P C FIG. 8.

Centroids of the top MCODE communities in the space of the ﬁrst two PCA components.

The errorbars denote the variance of the centroids. estimated on the Conﬁguration Model (CM) which, we recall, is a degree-conserving randomization (re-wiring) ofthe network. Fig. 9 shows the Z -score for P r (link | d ) as a function of d , and reveals a peculiar behavior: for smalldistances ( d ≤

2) the probability of ﬁnding a connection between two proteins is much higher than what could havebeen expected from a (degree-conserving) random link placement. Conversely, for medium distances (3 ≤ d ≤ d -20-1001020 Z s c o r e FIG. 9.

Histogram of the Z -score for P r (link | d ) for each pair of genes and their respectively encoded proteins. d is the Euclidean distance between pairs of genes in the space of the ﬁrst two PCA components, and P r (link | d ) is the conditionalprobability of having a link in the PIN between two proteins given that their encoding genes are localized within a distances d in the PC1-PC2 plane. The Z -score is obtained as Z [ P r (link | d )] = [ P r (link | d ) − h P r (link | d ) i Ω ] /σ Ω [ P r (link | d )]. The graydashed lines mark the signiﬁcance interval of ± σ . CONCLUSION

In this work we have introduced

CompAI , a novel codon bias index that is inspired by tAI , though conceptuallydistinct. In fact,

CompAI does not make reference to lists of highly expressed genes, and is thus unsupervised andbased on intrinsic information about co-evolution of genes that code for proteins and tRNAs. Conceptually, thedeﬁnition of

CompAI is based on a model that postulates a competition between cognate and near-cognate tRNAsfor the same codon, exposed on the ribosome at each step of protein synthesis. Competitive mechanisms in themachinery of ribosomal translation of genes into proteins have been repeatedly suggested and studied in the literature[39] and deserve further attention in order to understand their role for translation eﬃciency. In particular,

CompAI was designed in order to provide information about the speed of protein synthesis, being based on proofreading delaymechanisms.1Our genome-wide analysis of codon bias in E.coli using

CompAI as well as other commonly used indices revealedthat codon usage metrics resting on counting tRNA genes (

CompAI and tAI ) are strongly and positively correlatedamong themselves—in spite of their conceptually diﬀerent deﬁnition. It would then be quite interesting to check in thefuture whether this correlation is speciﬁc to E.coli or it is universally observed in the genomes of bacterial species thatare either ecologically and evolutionarily close or, by contrast, very far from E.coli. We also found that both

CompAI and tAI correlate with ERI, the degree of conservation for a gene among similar species, and gene essentiality, whereas,

CAI and

N c are less sensitive to these quantities.

CompAI and tAI values also allow to distinguishing three groups ofgenes, that are diﬀerently characterized by both codon choice adaptation, ERI and degree of essentiality, and that alsofeature speciﬁc predominant COG signatures. In particular the third group (C), composed of the few genes that arehighly conserved and with the strongest codon bias adaptation, consists for 30% of essential genes with predominantCOGs J and L—that refer to translation, ribosome structure and biogenesis, replication, recombination and repair.These represent house-keeping and control functions that must be continuously executed by the cell, meaning that thegenes responsible for them have to be expressed most of the time during the cell cycle. These observations stronglysupport the idea that an increasing selection of codons and, in parallel, a correlated modulation of tRNA availabilityco-evolved along the evolutionary history of a species.Finally, we have addressed a theme as relevant as the connection between codon usage bias and protein functionalor physical interactions. Our main result indicates that, in the course of the evolution of a genome, the functionalstructuring of the complex of interactions between proteins has interfered with the peculiar codon-coding formulationof the corresponding genes. In particular we have shown here, for the ﬁrst time to our knowledge, that communitiesof highly connected proteins in the interactome of E.coli correspond to encoding genes that share the same degree ofevolutionary adaptation, as expressed by codon bias indices that synthetically represent genetic information encodedin the tRNAs sector of the genome. Indeed,

CompAI , that is based on a simple representation of tRNA competition,seems to detect the codon bias signal behind communities more consistently than the other indices here considered.Conversely, we have provided evidence that if two genes have similar codon usage patterns then the correspondingproteins have a signiﬁcant probability of being functionally connected and interacting. This result points out thatcodon bias should be a relevant parameter in the fundamental problem of predicting unknown protein-protein in-teractions from genomic information. This study is a ﬁrst exploratory step towards a more complete investigationon how communities within protein-protein interaction networks rest on a consistent but still to be decoded codonbias signal. Indeed, the connection of the topology of a network with an underneath semantics is far from trivial, asrecently pointed out in the specialized literature [40]. Biological PINs and codon bias oﬀer an interesting case studyworth to be investigated in a wider perspective [41].

SUPPORTING INFORMATION

Here we deﬁne and explain the rationale behind the various codon bias indices that have been proposed in theliterature.

Codon Adaptation Index ( CAI ) [12]. The pattern of codon usage is ruled by two simultaneous eﬀects [42]:translational selection towards optimal codons for each amino acid, and genetic drift that allows the persistence ofnon-optimal codons. It is natural to assume that selection is stronger for codons of highly expressed genes, whichthus feature a more pronounced bias in the use of codons. The principle behind

CAI is exactly that codon usage inhighly expressed genes can reveal the optimal ( i.e. , most eﬃcient for translation) codons for each amino acid. Hence,

CAI builds on a reference set of highly expressed genes to assess, for each codon i , the relative synonymous codonusages ( RSCU i ) and the relative codon adaptiveness ( w i ): RSCU i = X i n i n i P j =1 X j ; w i = RSCU i max j =1 ,...,n i { RSCU j } ; (S1)In the RSCU i , X i is the number of occurrences of codon i in the genome, and the sum in the denominator runsover the n i synonyms of i ; RSCU s thus measure codon usage bias within a family of synonymous codons. w i is thendeﬁned as the usage frequency of codon i compared to that of the optimal codon for the same amino acid encoded by i — i.e. , the one which is mostly used in a reference set of highly expressed genes.The CAI for a given gene g is calculated as the geometric mean of the usage frequencies of codons in that gene,2normalized to the maximum CAI value possible for a gene with the same composition of amino acid:

CAI g =  l g Y i =1 w i  /l g , (S2)where the product runs over the l g codons belonging to that gene (except the stop codon). The critical aspect inthe deﬁnition of CAI is that it requires to deﬁne a priori reference set of highly expressed genes that is speciﬁc for agiven organism.

CAI is then not always transferable; yet, since it is tuned on highly expressed genes, it is generallywell correlated with gene expression levels in genomes for which reference gene sets are available. In this work,

CAI for E.coli genes was computed using the DAMBE 5.0 package [43] tRNA Adaptation Index ( tAI ) [13]. The speed of protein synthesis is bound to the waiting time for the correcttRNA to enter the ribosomal A site [44], and thus depends on tRNA concentrations [45] (and, indirectly, on gene copynumbers). The consequent adaptation of codon usage to tRNA availability [22, 46] is at the basis of tAI , which followsthe same mathematical model of CAI —deﬁning for each codon i its absolute ( W i ) and relative ( w i ) adaptivenessvalue: W i = m i X j =1 (1 − s ij ) tGCN ij ; w i = ( W i /W max if W i = 0 w mean otherwise ; (S3)where m i is the number of isoacceptor tRNAs that recognize codon i ( i.e. , tRNAs that carry the same aminoacidthat is encoded by i ), tGCN ij is the gene copy number of the j -th tRNA that recognizes the i -th codon, s ij is aselective constraint on the eﬃciency of the codon-anticodon coupling, W max is the maximum W i value and w mean isthe geometric mean of all w i with W i = 0.The tAI of gene g is eventually deﬁned as the geometric mean of the relative adaptiveness values of its codons, thusestimating the amount of adaptation of gene g to its genomic tRNA pool: tAI g =  l g Y i =1 w i  /l g . (S4)The critical issue for tAI is the selection of a meaningful set of s ij values, i.e. , weights that represent wobble inter-actions between codons and tRNAs. Assuming that tRNA usage is maximal for highly expressed genes, these valuesare chosen in order to optimize the correlation of tAI values with expression levels—exactly as CAI . Besides, whilethe eﬃciencies of the diﬀerent codon-tRNA interactions are expected to vary among diﬀerent organisms, s ij valuesare based on the gene expression in Saccharomyces cerevisiae [13]—thus lacking universality [47]. In this work wehave evaluated tAI values of E.coli genes using the CodonR package [48].

Eﬀective Number of Codons ( N c ) [14].

N c is a measure that quantiﬁes the departure of a gene from therandom usage of synonymous codons. Given a sequence of interest, the computation of

N c [49] starts from thequantity—deﬁned for each family α of synonymous codons: F CF α = m α X k =1 (cid:18) n k α n α (cid:19) (S5)where m α is the number of codons in α (each appearing n α , n α , . . . , n m α times in the sequence) and n α = P m α k =1 n k α . N c then weights these quantities in order to measure amount of entropy in the codon usage of the sequence:

N c = N S + K K P α =1 n αK P α =1 ( n α F CF α ) + K K P α =1 n αK P α =1 ( n α F CF α ) + K K P α =1 n αK P α =1 ( n α F CF α ) (S6)where N S is the number of families with one codon only and K m is the number of families with degeneration m (families with degeneration 6 are divided into two families of degeneration 2 and 4, as they often are subject todiﬀerent selective forces). Note that N c reaches is maximal value (61) when all codons are used equally and itsminimal value (23) when only one codon is used per amino acid (extreme bias). Diﬀerently from both

CAI and3 tAI , N c is a more immediate measure of codon usage that does not require any a priori information nor makes anybiological hypothesis (which constitute its weakness and, and the same time, its strength). Yet, since the eﬀect ofselection is a reduction of entropy for codon usage in a sequence,

N c provides a reliable measure for this eﬀect. Inthis paper we have obtained

N c values through DAMBE 5.0 [43].

SUPPORTING FIGURES

FIG. S1.

Abundance of tGCN cognate and near cognate (according to Watson-Crick base pairing) for eachanti-codon in E.coli.

Data taken from [17]

10 100 100010 -3 -2 k -3 -2 -1 P ( k ) w -4 -3 -2 Q ( w ) FIG. S2. Left plot:

Distribution of conﬁdence levels Q ( w ), with the vertical line indicating the cut-oﬀ we use to separatetrue from false positives. Right plot: Distribution of degrees P ( k ) when Θ = 0 .

9, with the insets showing the samedistribution for the original network (Θ = 0).[1] Hershberg R, Petrov DA. Selection on codon bias. Ann Rev Genet. 2008;42:287–99.[2] Shabalina SA, Spiridonov SA, Kashina A. Sounds of silence: synonimous nucleotides as a key to biological regulation andcomplexity. Nucleic Acids Res. 2013;41(4):2073–2094. [3] Tuller T. Challenges and obstacles related to solving the codon bias riddles. Biochem Soc Trans. 2014;42(1):155–159.[4] Bennetzen JL, Hall BD. Codon selection in yeast. J Biol Chem. 1982;257(6):3026–3031.[5] Gouy M, Gautier C. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 1982;10(22):7055–7074.[6] Jansen R, Bussemaker HJ, Gerstein M. Revisiting the codon adaptation index from a whole-genome perspective: analyzingthe relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res.2003;31(8):2242–2251.[7] Li GW, Burkhardt D, Gross C, Weissmann JS. Quantifying absolute protein synthesis rates reveals principles underlyingallocation of cellular resources. Cell 2014;157(3):624–635.[8] Pop C, Rouskin S, Ingolia NT, Han L, Phizicky EM, Weissman JS. Causal signals between codon bias, mRNA structure,and the eﬃciency of translation and elongation. Mol Syst Biol. 2014;10:770.[9] Daubin V, Perri`ere G. G+C3 structuring along the genome: a common feature in prokaryotes. Mol Biol Evol.2003;20(4):471–483.[10] Eyre-Walker A. Synonymous codon bias is related to gene length in Escherichia coli: selection for translational accuracy?Mol Biol Evol. 1996;13(6):864–872.[11] Alexander Roth MA, Cannarozzi GM. Measuring codon usage bias. In: Cannarozzi GM, Schneider A, editors. CodonEvolution Mechanisms and Models. Oxford University Press; 2012. p. 189–217.[12] Sharp PM, Li WH. The codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potentialapplications. Nucleic Acids Res. 1987;15(3):1281–1295.[13] dos Reis M, Savva R, Wernisch L. Solving the riddle of codon usage preferences: a test for translational selection. NucleicAcids Res. 2004;32(17):5036–5044.[14] Wright F. The “eﬀective number of codons” used in a gene. Gene. 1990;87(1):23–29.[15] Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al. The STRING database in 2011: functionalinteraction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39(Database issue):D561–D568.[16] Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2012;40(Databaseissue):D48–D53.[17] Chan PP, Lowe TM. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res.2009;37(Database issue):D93–D97.[18] Gerdes SY, Scholle MD, Campbell JW, Bal´zsi G, Ravasz E, Daugherty MD, et al. Experimental determination and systemlevel analysis of essential genes in Escherichia coli MG1655. J Bacteriol. 2003;185(19):5673–5684.[19] Rodnina MV, Wintermeyer W. Fidelity of aminoacyl-tRNA selection on the ribosome: kinetic and structural mechanisms.Annu Rev Biochem. 2001;70:415–435.[20] Crick FH. Codon–anticodon pairing: the wobble hypothesis. J Mol Biol. 1966;19(2):548–555.[21] Gromadski KB, Rodnina MV. Kinetic determinants of high-ﬁdelity tRNA discrimination on the ribosome. Mol Cell.2004;13(2):191–200.[22] Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respectivecodons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system.J Mol Biol. 1981;151(3):389–409.[23] Dong H, Nilsson L, Kurland CG. Co-variation of tRNA abundance and codon usage in Escherichia coli at diﬀerent growthrates. J Mol Biol. 1996;260(5):649–663.[24] Percudani R, Pavesi A, Ottonello S. Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae.J Mol Biol. 1997;268(2):322–330.[25] Kanaya S, Yamada Y, Kudo Y, Ikemura T. Studies of codon usage and tRNA genes of 18 unicellular organisms andquantiﬁcation of Bacillus subtilis tRNAs: gene expression level and species-speciﬁc diversity of codon usage based onmultivariate analysis. Gene. 1999;238(1):143–155.[26] Zhang G, Ignatova Z. Generic algorithm to predict the speed of translational elongation: implications for protein biogenesis.PLoS One. 2009;4(4):e5036.[27] Chien CT, Bartel PL, Sternglanz R, Fields S. The two-hybrid system: a method to identify and clone genes for proteinsthat interact with a protein of interest. Proc Natl Acad Sci USA. 1991;88(21):9578–9582.[28] Phizicky EM, Fields S. Protein-protein interactions: methods for detection and analysis. Microbiol Rev. 1995;59(1):94–123.[29] Puig O, Caspary F, Rigaut G, Rutz B, Bouveret E, Bragado-Nilsson E, et al. The tandem aﬃnity puriﬁcation (TAP)method: a general procedure of protein complex puriﬁcation. Methods. 2001;24(3):218–229.[30] Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.[31] Huang H, Jedynak BM, Bader JS. Where have all the interactions gone? Estimating the coverage of two-hybrid proteininteraction maps. PLoS Comput Biol. 2007;3(11):e214.[32] Bader GD, Hogue CWV. An automated method for ﬁnding molecular complexes in large protein interaction networks.BMC Bioinformatics. 2003;4:2.[33] Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci USA. 2006;103(23):8577–8696.[34] Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, et al. The COG database: newdevelopments in phylogenetic classiﬁcation of proteins from complete genomes. Nucleic Acids Res. 2001;29(1):22–8.[35] Jolliﬀe IT. Principal Component Analysis. Series: Springer Series in Statistics. Springer; 2002. 487: p. 28.[36] Park J, Newman MEJ. Statistical mechanics of networks. Phys Rev E 2004;70(6): 066117.[37] Squartini T, Garlaschelli D. Analytical maximum-likelihood method to detect patterns in real networks. New J Phys.2011;13:083001.[38] Butland G, Peregr´ın-Alvarez JM, Li J, Yang W, Yang X, Canadien V, et al. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature. 2005;433(7025):531–537.[39] Fluitt A, Pienaar E, Viljoen H. Ribosome kinetics and aa-tRNA competition determine rate and ﬁdelity of peptidesynthesis. Comput Biol Chem. 2007;31(5-6):335–346.[40] Hric D, Darst RK, Fortunato S. Community detection in networks: Structural communities versus ground truth. PhysRev E. 2014;90(6):062805.[41] Boccaletti S, Bianconi G, Criado R, del Genio CI, G´omez-Garde˜nes J, Romance M, et al. The structure and dynamics ofmultilayer networks. Physics Reports. 2014;544(1):1–122.[42] Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991;129(3):897–907.[43] Xia X. DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol.2013;30(7):1720–8.[44] Varenne S, Buc J, Lloubes R, Lazdunski C. Translation is a non-uniform process. Eﬀect of tRNA availability on the rateof elongation of nascent polypeptide chains. J Mol Biol. 1984;180(3):549–576.[45] Sørensen MA, Kurland CG, Pedersen S. Codon usage determines translation rate in Escherichia coli. J Mol Biol.1989;207(2):365–377.[46] Ikemura T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol. 1985;2(1):13–34.[47] Sabi R, Tuller T. Modelling the eﬃciency of codon-tRNA interactions based on codon usage bias. DNA Res. 2014;21(5):511–26.[48] http://people.cryst.bbk.ac.uk/~fdosr01/tAI/index.html [49] Sun X, Yang Q, Xia X. An improved implementation of Eﬀective Number of Codons (Nc). Mol Biol Evol. 2012;30(1):191–196.[50] The density of a graph G with n nodes and l links is the ratio between l and the maximum number of possible links,namely n ( n − /

2, whereas, a k -core is a graph G of minimal degree k , meaning that each node belonging to G has degreegreater or equal than kk