Abstract

We consider the problem of inferring the probability distribution of flux configurations in metabolic network models from empirical flux data. For the simple case in which experimental averages are to be retrieved, data are described by a Boltzmann-like distribution (\propto e^{F/T}) where F is a linear combination of fluxes and the `temperature' parameter T\geq 0 allows for fluctuations. The zero-temperature limit corresponds to a Flux Balance Analysis scenario, where an objective function (F) is maximized. As a test, we have inverse modeled, by means of Boltzmann learning, the catabolic core of Escherichia coli in glucose-limited aerobic stationary growth conditions. Empirical means are best reproduced when F is a simple combination of biomass production and glucose uptake and the temperature is finite, implying the presence of fluctuations. The scheme presented here has the potential to deliver new quantitative insight on cellular metabolism. Our implementation is however computationally intensive, and highlights the major role that effective algorithms to sample the high-dimensional solution space of metabolic networks can play in this field.

Full PDF

CConstraint-based inverse modeling of metabolic networks: a proof of concept

Daniele De Martino and Andrea De Martino

2, 3 Institute of Science and Technology Austria (IST Austria), Klosterneuburg, Austria Soft and Living Matter Lab, Institute of Nanotechnology (CNR-NANOTEC), Consiglio Nazionale delle Ricerche, Rome, Italy Human Genetics Foundation, Turin, Italy

We consider the problem of inferring the probability distribution of ﬂux conﬁgurations in metabolic networkmodels from empirical ﬂux data. For the simple case in which experimental averages are to be retrieved, dataare described by a Boltzmann-like distribution ( ∝ e F/T ) where F is a linear combination of ﬂuxes and the‘temperature’ parameter T ≥ allows for ﬂuctuations. The zero-temperature limit corresponds to a FluxBalance Analysis scenario, where an objective function ( F ) is maximized. As a test, we have inverse modeled,by means of Boltzmann learning, the catabolic core of Escherichia coli in glucose-limited aerobic stationarygrowth conditions. Empirical means are best reproduced when F is a simple combination of biomass productionand glucose uptake and the temperature is ﬁnite, implying the presence of ﬂuctuations. The scheme presentedhere has the potential to deliver new quantitative insight on cellular metabolism. Our implementation is howevercomputationally intensive, and highlights the major role that effective algorithms to sample the high-dimensionalsolution space of metabolic networks can play in this ﬁeld. Building a system-level understanding of metabolism, thehighly conserved set of chemical processes devoted to en-ergy transduction, growth and maintanance in living cells, isa major challenge for systems biology. Constraint-based insilico methods like Flux Balance Analysis (FBA) and its re-ﬁnements play the central role in this endeavour [1]. Suchschemes provide a coherent theoretical framework to analyzethe capabilities of large (genome-scale) metabolic networksstarting from minimal genomic input and physico-chemicalconstraints. Optimal ﬂux conﬁgurations are usually hypothe-sized to optimize a function of the ﬂuxes, like biomass pro-duction, leading to results that can be quantitatively testedagainst experimental data obtained in controlled conditions[2–4]. On the other hand, the choice of an objective func-tion is in many situations not straightforward, and living cellsoften appear to be multi-objective (Pareto) optimal [5–7]. Inaddition, while bulk properties can usually be well describedby optimal ﬂux patterns, experiments performed at single-cellresolution display a considerable degree of cell-to-cell vari-ability [8], which has been linked to the inevitable presence ofrandomness in metabolic processes [9]. At the simplest level,this suggests that ﬂux patterns observed empirically can bethought to be sampled from a stationary probability distribu-tion deﬁned over the space of feasible metabolic states. Thisidea is validated by the fact that empirical growth rate distribu-tions are reproduced by a Maximum Entropy principle at ﬁxedaverage growth rate, according to which ﬂux conﬁgurations v occur with a Boltzmann-like probability distribution P ( v ) ∝ e βλ ( v ) , (1)with λ ( v ) the growth rate and where β > controls the mag-nitude of ﬂuctuations [10]. While more comprehensive stud-ies will hopefully reﬁne this picture, recent work has providedfurther support to and insight into the MaxEnt scenario [11].A tightly related question is whether one can infer the prob-ability distribution of metabolic states from empirical datarather than postulating it. Inverse problems and related tech-niques have a long history in applied ﬁelds [12] and have at-tracted much attention from more theoretical areas in recent years as they directly connect machine learning, artiﬁcial neu-ral networks and statistical mechanics [13]. In the biologicalcontext, they have lead to new insights in domains as apartas protein science [14], neural dynamics [15] and immunol-ogy [16]. In this short note we discuss a method to obtainthe probability distribution of ﬂuxes in a metabolic networkfrom the experimental characterization of a subset of ﬂuxes,and present a proof-of-concept validation based on ﬂux datafor Escherichia coli steady state growth. Taking experimen-tal means as the key features to be captured, we look for ﬂuxdistribution of the form P ( v ) ∝ e F ( v ) /T , (2)where F is a function of ﬂuxes (to be inferred) and T ≥ is anadjustable parameter. In order to learn F , we combine a Boltz-mann learning scheme with a Monte Carlo sampling method.The F thus obtained is found to involve a linear combinationof the biomass output and of the glucose intake rate, and dataare best described by setting T to a ﬁnite (non-zero) value,suggesting, in agreement with previous studies, that empiricalﬂuctuations reﬂect at least in part some unavoidable (and pos-sibly functionally relevant) noise in the organization of ﬂuxconﬁgurations.We shall denote by K the number of different samples inwhich a subset X of ﬂuxes has been experimentally quanti-ﬁed, and by v ( k ) j the value of ﬂux v j in the k -th sample. Theempirical mean of ﬂux v j is given by (cid:104) v j (cid:105) emp = 1 K K (cid:88) k =1 v ( k ) j ( j ∈ X ) . (3)To deﬁne the space of a priori feasible ﬂux conﬁgurationsfor the entire metabolic network, we shall follow the stan-dard route of assuming that viable ﬂux vectors v are non-equilibrium steady states of the underlying system of reac-tions. The feasible space F then corresponds to the solutionsof Sv = , where S stands for the M × N stoichiometric ma-trix ( M denoting the number of chemical species, and N thatof reactions) and where each ﬂux v i is constrained to lie within a r X i v : . [ q - b i o . M N ] A p r an interval [ v min i , v max i ] , whose bounds encode for the physio-logically relevant regulatory, kinetic and thermodynamic con-straints. Geometrically, such an F is a convex polytope. Inprinciple, every point v ∈ F is a feasible ﬂux conﬁgurationand conﬁgurations are assumed to be a priori equiprobable.However, the empirical means (3) represent information thatcan reﬁne this assumption. In particular, one may expect thatﬂux vectors should occur with a probability distribution P ( v ) such that (cid:90) F v j P ( v ) d v = (cid:104) v j (cid:105) emp ( j ∈ X ) . (4)Following the Maximum Entropy idea, we focus on the leastconstrained distribution satisfying (4), which maximizes theentropy S [ P ] = − (cid:90) F P ( v ) log P ( v ) d v (5)subject to (4). This is given by P ( v ) ≡ P ( v | c ) = e (cid:80) j ∈X c j v j Z ( c ) ( v ∈ F ) , (6)where c = { c j } j ∈X denotes the vector of Lagrange multipli-ers enforcing (4) for each v j , while Z ( c ) = (cid:90) F e (cid:80) j ∈X c j v j d v (7)ensures proper normalization.A key question at this point concerns the values of the con-stants c j . More precisely, can we set them so as to repro-duce empirical means most accurately via (4)? By straightfor-wardly maximizing the log-likelihood of the parameters giventhe empirical data, i.e. L ( c | data) = 1 K K (cid:88) k =1 log P (data | c ) , (8)one sees that ∂ L ∂c j = (cid:104) v j (cid:105) emp − (cid:104) v j (cid:105) c , (9)where (cid:104) v j (cid:105) c = (cid:90) F v j P ( v | c ) d v . (10)This suggests that the optimal vector c can be found by anupdating dynamics driven by the difference between the em-pirical mean and the mean computed using the current vector c , i.e. via a Boltzmann learning such as c j ( τ + δτ ) − c j ( τ ) = (cid:104) (cid:104) v j (cid:105) emp − (cid:104) v j (cid:105) c ( τ ) (cid:105) δτ . (11)Ideally, the vector c (cid:63) obtained as the asymptotic ﬁxed pointof (11) ensures the best agreement between empirical and the-oretical means (i.e. between (3) and (4)), while P ( v | c (cid:63) ) pro-vides our best guess for the (stationary) probability distribu-tion compatible with empirical means that has generated our dataset. In this scenario, the quantity E = (cid:80) j ∈X c (cid:63)j v j wouldrepresent the key physical parameter regulating the probabil-ity of occurrence of feasible ﬂux vectors. Note that such an E plays the role of F/T in (2).We have implemented the above scheme using data re-trieved from [17], where

E. coli ’s central carbon metabolismis characterized in minimal glucose-limited aerobic condi-tions, at growth/dilution rates below . / h. We collected K = 35 control experiments and the corresponding valuesfor |X | = 24 reactions ﬂuxes (see [11] for more details).We have then studied the inference problem on the feasiblespace F deﬁned by the E. coli ’s core metabolic network [18].The dimension of F when a minimal aerobic glucose-limitedmedium is used to constrain the exchanges between the celland its surroundings is dim( F ) = 23 .To compute the optimal values of the coefﬁcients c j from(11) we have used the following procedure:1. initialize c j (0) = 0 for all j ∈ X

2. at each time step τ : compute (cid:104) v j (cid:105) c ( τ ) from (10) by sam-pling the distribution P ( v | c ( τ )) via Hit and Run MonteCarlo [19]; then3. ﬁnd the j ∈ X for which the difference (cid:104) v j (cid:105) emp −(cid:104) v j (cid:105) c ( τ ) is largest, update its value according to (11),and iterate.Finally, we set δτ = 10 − .By studying numerically the dynamics of the c j ’s one ﬁndsthat, at sufﬁciently long times τ , coefﬁcients generically be-have as c j ( τ ) (cid:39) L j τ ( j ∈ X ) , (12)where L j are ﬂux-speciﬁc constants. For most ﬂuxes, though, L j = 0 , i.e. the corresponding c j ’s converge in time to ﬁnite(small) values, while L j (cid:54) = 0 for a small number of ﬂuxes (seeFig. 1). This in turn implies that the function E (cid:39) τ (cid:88) j ∈X L j v j (13)is asymptotically dominated by terms with L j (cid:54) = 0 . Exploit-ing the linear dependencies between variables, one can ex-press E in terms of biologically signiﬁcant ﬂuxes other thanthose in X . Quite remarkably, one ﬁnds that the dominantcontribution to E has the form E (cid:39) τ ( L λ λ + L u u ) , (14)where λ and u denote, respectively, the biomass output rateand the glucose in-take ﬂux while L λ and L u are numericalcoefﬁcients given by approximately by L λ (cid:39) / and L u (cid:39) / .The potentially high-dimensional function (13) thereforereduces, after Boltzmann learning, to a simple form that com-bines two variables of the highest biological signiﬁcance. Inparticular, the solution of the inverse problem suggests a sce-nario very similar to that derived in [10]. In addition, though, FIG. 1. Values of c j as a function of the Boltzmann learning time τ for the ﬂuxes j ∈ X . Only a representative sample of the compo-nents with L j (cid:39) is shown, while diverging components are dis-played in the inset.FIG. 2. Values of χ (left) and MSE (right) as a function of theBoltzmann learning time τ obtained by comparing experimental datawith the inferred distribution (15) (grey markers) and the guessed one(16) (red markers). the network’s state also appears to be sensitive to the glucoseimport rate u .We have validated these results by comparing empirical av-erages against the means obtained from the inferred distribu-tion P inf ( v ) ∝ e τ ( L λ λ + L u u ) , (15)as well as against means computed from the simpler form (1),i.e. P ( v ) ∝ e τλ , (16)which was obtained by focusing on growth rate distribu-tions (rather than individual ﬂuxes). In both cases, we haveperformed an asymptotic extrapolation by studying how χ and the mean squared error (MSE) between data and modelschange upon varying τ , which –comparing (2) and (15)– thusplays the role of /T (while L λ λ + L u u plays the role of F ).Results are shown in Fig. 2. One sees that both probabilitydistributions generate clear minima in χ and MSE as func-tions of τ , where data are optimally reproduced. However, theinferred distribution (15) outperforms (16) (deeper minimum, -30 0 30 60 90 120 150 180experimental value-300306090120150180 t h e o r e ti ca l p r e d i c ti on FBA (optimal)Inferred (means) rpitkt, tala gapdenopdhglc_ptsicl,malsg6pdh,gndrpe icdhakgdh,acont,mdh,cs,pgi,tpi,fba,pfk

FIG. 3. Flux-by-ﬂux comparison between experimental values andtheoretical predictions obtained from FBA (blue markers) and fromthe inferred distribution (grey markers). Flux intensities are in per-centage with respect to the glucose uptake. Reaction acronyms aretaken from [18]. albeit slightly) both in terms of χ and in terms of MSE. Inter-estingly, the inferred function provides better results at a largervalue of τ compared to P , which suggests that metabolic con-ﬁgurations are closer to optima of F = L λ λ + L u u thanto optima of λ . In order to get a more precise idea of theimprovement obtained via inference, we have displayed inFig. 3 a detailed ﬂux-by-ﬂux comparison between inferredmeans (with errors) and experimental means, also showing forsakes of completeness the predictions obtained from standardbiomass-maximizing FBA.It is important to note that the Boltzmann learning dynamicsdoes not converge in general to a vector c such that theoreticalmeans match empirical ones perfectly as τ → ∞ . A possiblereason is that empirical means lie outside the feasible space F . Flux values are in fact tightly constrained by the mass bal-ance equations Sv = , and small numerical errors that mayarise e.g. from experimental noise can lead to violations ofthese conditions. While not surprising per se , this representsa further complication for the inference problem, which formetabolic networks relies on the deﬁnition of F , and furtherinvestigations are needed.To summarize, previous work has shown that maximum en-tropy distributions at ﬁxed average growth rate in the space offeasible metabolic states reproduce empirical growth rate dis-tributions measured in exponentially growing populations atsingle-cell resolution [10] and outperform standard FBA (re-trieved in the limit β → ∞ ) in reproducing experimental dataon ﬂuxes [11]. The approach discussed here generalizes theabove results by extending the input dataset to a subset ofmetabolic ﬂuxes. Following standard inference schemes, wehave computed the function F ( v ) that best describes the ﬂuxdataset as being generated from a Boltzmann-like distribution ∝ e F ( v ) /T . Quite remarkably, the biomass output and theglucose intake emerge from the inference process as the keyﬂuxes that govern the statistics of ﬂuxes. The type of inversemodeling discussed here represents a novel and potentiallypowerful tool to analyze metabolic networks and characterizetheir large-scale organization in a way that is fully data-driven.‘Energy’ functions F obtained in this way may provide a newperspective on metabolic functions and objectives in complexsettings where biomass growth optimization alone does notsufﬁce to explain observations. However, the predictive ca-pacity of inverse methods is intrinsically limited by the qualityof available data. In addition, the scheme employed here has ahigh computational cost. In fact, as a Monte Carlo sampling ofthe feasible space F is required at every Boltzmann learning step, performing the same analysis on larger (genome-scale)networks, with a feasible space F whose dimension can bean order of magnitude larger than that addressed here, is verydifﬁcult. This problem may however be effectively solved bythe use of approximate representations of the feasible spaceand/or of more efﬁcient computational heuristics [20, 21]. Inthis sense, the present note represents merely a proof of con-cept and more work is needed to make these ideas viable atgenome resolution.We thank A. Braunstein, A. P. Muntoni and A. Pagnani fora critical revision of these ideas and for important commentsand suggestions. We acknowledge the support of the PeopleProgramme (Marie Curie Actions) of the European Union’sSeventh Framework Programme (FP7/2007-2013) under REAgrant agreement no. [291734] (D.D.M). [1] Bordbar A, et al. Constraint-based models predict metabolicand associated cellular functions. Nature Rev Genet (2014)15:107-120[2] Schuetz R, Kuepfer L, Sauer U. Systematic evaluation of objec-tive functions for predicting intracellular ﬂuxes in Escherichiacoli. Molec Sys Biol (2007) 3: 119[3] Knorr AL, Jain R, Srivastava R. Bayesian-based selection ofmetabolic objective functions. Bioinformatics (2007) 23:351-357[4] Feist A and Palsson B. The biomass objective function. CurrOpin Microbiol (2010) 13:344-349[5] Schuetz R, et al. Multidimensional optimality of microbialmetabolism. Science (2012) 336:601[6] Hart Y, et al. Inferring biological tasks using Pareto analysis ofhigh-dimensional data. Nature Methods (2015) 12:233[7] Mori M, Marinari E, De Martino A. A yield-cost trade-off governs Escherichia coli’s decision between fermentationand respiration in carbon-limited growth. BiorXiv preprint,https://doi.org/10.1101/113183 (2017)[8] Wallden M, et al. The synchronization of replication and divi-sion cycles in individual E. coli cells. Cell (2016) 166:729-739[9] Kiviet, DJ, et al. Stochasticity of metabolism and growth at thesingle-cell level. Nature (2014) 514: 376-379[10] De Martino D, Capuani F and De Martino A. Growth againstentropy in bacterial metabolism: the phenotypic trade-off be-hind empirical growth rate distributions in E. coli. Phys Biol(2016) 13:036005[11] De Martino D, Andersson A, Bergmiller T, Guet C, TkacikG, Statistical mechanics for metabolic networks during steady-state growth. ArXiv preprint, https://arxiv.org/abs/1703.01818(2017) [12] Tarantola, A. Inverse problem theory and methods for modelparameter estimation (Society for Industrial and Applied Math-ematics, 2005)[13] Chau Nguyen H, Zecchina R, Berg J. Inverse statistical prob-lems: from the inverse Ising problem to data science. ArXivpreprint https://arxiv.org/abs/1702.01522 (2017)[14] Cocco S, et al. Inverse statistical physics of pro-tein sequences: a key issues review. ArXiv preprint,https://arxiv.org/abs/1703.01222 (2017)[15] Roudi Y, Aurell E, Hertz J. Statistical physics of pairwise prob-ability models. Front Comput Neurosci (2009) 3:22[16] Kaplinsky J and Arnaout R. Robust estimates of overallimmune-repertoire diversity from high-throughput measure-ments on samples. Nature Comm (2016) 7:11881[17] Zhang Z et al. CeCaFDB: a curated database for the documen-tation, visualization and comparative analysis of central carbonmetabolic ﬂux distributions explored by13