A machine learning approach to investigate regulatory control circuits in bacterial metabolic pathways
AA machine learning approach to investigate regulatory controlcircuits in bacterial metabolic pathways.
Francesco Bardozzo (1) , Pietro Li`o (2) , Roberto Tagliaferri (1) (1)NeuRoNe Lab, DISA-MIS, University of Salerno,Via Giovanni Paolo II, 132, Fisciano, Italy , [email protected](2) Computer Laboratory, University of Cambridge,15 JJ Thomson Ave, CB3 0FD, Cambridge, United Kingdom, [email protected]
Keywords : metabolic pathways, alternating sequences, multi-omic circuits, metabolicfunctional data, bacteria.
Abstract.
In this work a machine learning approach for identifying the multi-omicsmetabolic regulatory control circuits inside the pathways is described. Therefore, theidentification of bacterial metabolic pathways that are more regulated than others in termof their multi-omics follows from the analysis of these circuits . This is a consequenceof the alternation of the omic values of codon usage and protein abundance along thecircuits. In this work, the E.Coli’s Glycolysis and its multi-omic circuit features areshown as an example.1
Background
In the bacterial metabolic pathways, it is possible to identify different small circuitsthat lead from an intermediate compound to another. Each bacterial pathway couldbe considered as a highly specific directed graph that presents more than one multi-omic circuit (MOC). In standard conditions is possible to identify which pathways aremore regulatory than others in terms of their alternating multi-omic contribution. TheMOCs that belong to specific pathway could be discovered through the flux-balanceanalysis [1]. On the other hand, in this work, we propose a machine learning approachto study the MOCs. This method takes into account multi-omic values and looks forthe information derived from alternate sequences of multi-omic values. The proteinsare ordered in a sequence with respect to their position on the circuit. Omic valuesof protein abundance and codon usage are associated with these proteins. An upperbound corresponding to the ideal sequence of alternated omic values is given. Then,the alternated sequence of omic values is compared with the ideal alternation. It ispossible that the presence of alternated omic values reflects the metabolic regulatorycontrol behaviour of the specific circuit inside the metabolic pathways. The more thealternated values in the sequence are different from the upper bound, the less the circuitis regulated and vice versa [2]. Another important consequence strictly related to theidentification of the MOCs is concerning the importance of identifying the intermediatecompound in the circuit output. For example, the intermediate compound at the end ofa circuit could be considered as a result of a regulatory control path and could have forthis reason a strategical importance in the design of a metabolic network. In this setting,this work may be useful in the metabolic networks reconstruction based on metabolicfunctional data [3, 4].2
Materials and Methods
Our analyses are based on the Glycolysis of Escherichia coli K-12 MG1655. Thecodon adaptation index (CAI) is a codon usage index computed as described by Sharpand Li [5]. The second omic value considered is the protein abundance (PA) in standard roceedings of CIBB 2016
Machine Learning Approach
An MOC presents a starting point, that we could call starting protein (SP) and anend point, that we could call ending protein (EP). Between this two points a differentnumber of proteins with their own PA and CAI could be present.Figure 1: (a) In this metabolic pathway 2 MOCs of length 5 are found. (b) The positionof the genes on the double strand and the proteins codified at the network level areshown. (c) The two MOCs are selected with 2 different criteria. The former is based onthe selection of SP and EP merging the double strand and considering the positions ofthe genes as merged. The latter selects the SP and EP from the strand (5’-3’) or from thestrand (3’-4’). In this last is impossible to have a circuit that has an SP from the 5’-3’strand and an EP from the 3’-5’ strand.In Figure 1 (a) two circuits individuated from two different SPs (s) (1** and 1*) andthat lead to a single EP (5 (t)) are shown. We can see that the EP is chosen as the end ofthe circuit because there is a directed path from the SPs to this protein. Obviously, weare maintaining the order of the genes (related to these proteins) on the double strandwhile we identify the SP and EP . For example, in Figure 1 (b) both the 1* and the1** genes are positioned before the gene labelled as 5 (t). The found paths (violet andorange) are the shortest paths of length 5 but, in this case, there is a relevant differencebetween the position of the SP (s). In fact, the first SP is positioned in the same direction(5’-3’) of the EP. Instead, the second one is positioned in the direction 3’-5’ with respectto the (5’-3’) EP. This exchange of direction between the genes may be also relate togenes that are located in between the circuit and not only at the extremes. Therefore, it roceedings of CIBB 2016 Ψ to compute the number of classes for thenormalised multi omic values of the MOCs. In this way, as illustrated in Figure 2 wecan transform the multi omic values into a vector of integers v .Figure 2: The multi-omic values of the orange MOC are transformed in integers by thefunction Ψ . This circuit presents alternating multi-omic values ( v ).For each v we can obtain a score of the relative distances between the omics in theirincremental position, as described in [12]. Then, it is possible to obtain a similarity mea-sure σ between the scored MOC’s v and the ideal sequence. Figure 3 plots the similaritymeasures against the circuit lengths, from zero to one corresponding to increasing sizedots. Moreover, in Figure 3 the proteins produced by a single DNA strand are colouredin yellow, those produced by double strands are in violet. roceedings of CIBB 2016 Results
We have extracted all the possible MOCs from the Glycolysis. Ψ returns the numberof classes equal to 7. The number of proteins of this metabolic pathway is of 40. Wecomputed for all the MOC’s sequences the score for the multi-omic contributions ofthe CAI and PA. All the values before being transformed by Ψ are normalised with astandard normalisation, summed in and averaged. The metabolic pathway of Glycolysispresents 134 potential MOCs, in the direction from the SP to the EP.Figure 3: The plot shows the circuits extracted from the Glycolysis metabolic path-way. The similarity σ is displayed on the axis x, while the circuit length is on the axisy. Furthermore, this plot underlines the relationship between the circuit length and itssimilarity σ to the ideal alternate sequence of multi-omic values. The similarity in-creases with the size of the dots. The proteins produced by a single DNA strand (SS)are coloured in yellow, those produced by double strands (DS) are in violetIn particular, in Figure 3 the MOC of Glycolysis presents a length varying from 2to 6 proteins. The number of single strand MOCs on the Glycolysis are 47 over 134MOCs in total. The yellow dots in Figure 3 are only on MOCs of length 2 and 3. In theGlycolysis, there are only 4 operons, and only two of them form two MOCs that coverthe proteins aceE, aceF, lpd and ascF, ascB, with a σ ≤ . . In this example there isnot any operon that in the metabolic network represents a MOC with a high σ .4 Conclusion
We presented a machine learning approach for the individuation of metabolic regu-latory control circuits inside the bacterial metabolic pathways. The MOCs were inves-tigated in relation to multi omics data. We have shown that the proteins related to theoperons have not a key role when their proteins are present in the MOCs. Moreover inthis pathway, a different distribution with respect to the length of the MOCs between thesingle strand and double strand models is showed. Obviously, is necessary to extend theanalyses to the whole genome and to the study the variations caused by external factors. roceedings of CIBB 2016 [1] Oyarzn, Diego A., and Guy-Bart V. Stan. ”Synthetic gene circuits for metabolic control: designtrade-offs and constraints”.
Journal of The Royal Society , 10.78, 20120671, 2013.[2] Orth, Jeffrey D., Ines Thiele, and Bernhard Palsson. ”What is flux balance analysis?”.
Naturebiotechnology , 28.3, pp. 245-248, 2010.[3] Raman K., and Nagasuma C.. ”Flux balance analysis of biological systems: applications and chal-lenges”.
Briefings in bioinformatics , 10.4, pp. 435-449, 2014.[4] Feist, Adam M., et al. ”Reconstruction of biochemical networks in microorganisms”.
Nature Re-views Microbiology , 7.2, pp. 129-143, 2009.[5] Sharp,P.M., Li,W.H. ”The codon Adaptation Index, a measure of directional synonymous codonusage bias, and its potential applications”.
Nucleic Acids , 15, pp. 1281-1295, 1987.[6] Wang,M., Herrmann,C.J., Simonovic,M., Szklarczyk,D., Mering,C. ”Version 4.0 of PaxDb: Proteinabundance data, integrated across model organisms, tissues, and cell lines”,
Proteomics , 2015.[7] Kanehisa,M., Goto,S. ”KEGG: kyoto encyclopedia of genes and genomes”.
Nucleic acids research ,28(1), pp. 27-30, 2000.[8] Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Wheeler,D.L. ”GenBank”.
Nucleic acidsresearch , 33(1), pp. 34-38, 2005.[9] Maier, T., Guell, M., Serrano, L. ” Correlation of mRNA and protein in complex biological sam-ples”.
FEBS letters , 583(24), pp. 3966-3973, 2009.[10] Newbury, Sarah F., Noel H. Smith, and Christopher F. Higgins. ”Differential mRNA stability con-trols relative gene expression within a polycistronic operon”.
Cell , 51.6, pp. 1131-1143, 1987.[11] Bardozzo, F., Li`o P., Tagliaferri R. ”Multi omic oscillations in bacterial pathways”.
InternationalJoint Conference on Neural Networks IJCNN, IEEE , pp. 1-8, 2015[12] Bardozzo, F., Li`o P., Tagliaferri R. ”Novel algorithms to detect oscillatory patterns in multi omicmetabolic networks”., pp. 1-8, 2015[12] Bardozzo, F., Li`o P., Tagliaferri R. ”Novel algorithms to detect oscillatory patterns in multi omicmetabolic networks”.