Constraint-based inverse modeling of metabolic networks: a proof of concept
CConstraint-based inverse modeling of metabolic networks: a proof of concept
Daniele De Martino and Andrea De Martino
2, 3 Institute of Science and Technology Austria (IST Austria), Klosterneuburg, Austria Soft and Living Matter Lab, Institute of Nanotechnology (CNR-NANOTEC), Consiglio Nazionale delle Ricerche, Rome, Italy Human Genetics Foundation, Turin, Italy
We consider the problem of inferring the probability distribution of flux configurations in metabolic networkmodels from empirical flux data. For the simple case in which experimental averages are to be retrieved, dataare described by a Boltzmann-like distribution ( ∝ e F/T ) where F is a linear combination of fluxes and the‘temperature’ parameter T ≥ allows for fluctuations. The zero-temperature limit corresponds to a FluxBalance Analysis scenario, where an objective function ( F ) is maximized. As a test, we have inverse modeled,by means of Boltzmann learning, the catabolic core of Escherichia coli in glucose-limited aerobic stationarygrowth conditions. Empirical means are best reproduced when F is a simple combination of biomass productionand glucose uptake and the temperature is finite, implying the presence of fluctuations. The scheme presentedhere has the potential to deliver new quantitative insight on cellular metabolism. Our implementation is howevercomputationally intensive, and highlights the major role that effective algorithms to sample the high-dimensionalsolution space of metabolic networks can play in this field. Building a system-level understanding of metabolism, thehighly conserved set of chemical processes devoted to en-ergy transduction, growth and maintanance in living cells, isa major challenge for systems biology. Constraint-based insilico methods like Flux Balance Analysis (FBA) and its re-finements play the central role in this endeavour [1]. Suchschemes provide a coherent theoretical framework to analyzethe capabilities of large (genome-scale) metabolic networksstarting from minimal genomic input and physico-chemicalconstraints. Optimal flux configurations are usually hypothe-sized to optimize a function of the fluxes, like biomass pro-duction, leading to results that can be quantitatively testedagainst experimental data obtained in controlled conditions[2–4]. On the other hand, the choice of an objective func-tion is in many situations not straightforward, and living cellsoften appear to be multi-objective (Pareto) optimal [5–7]. Inaddition, while bulk properties can usually be well describedby optimal flux patterns, experiments performed at single-cellresolution display a considerable degree of cell-to-cell vari-ability [8], which has been linked to the inevitable presence ofrandomness in metabolic processes [9]. At the simplest level,this suggests that flux patterns observed empirically can bethought to be sampled from a stationary probability distribu-tion defined over the space of feasible metabolic states. Thisidea is validated by the fact that empirical growth rate distribu-tions are reproduced by a Maximum Entropy principle at fixedaverage growth rate, according to which flux configurations v occur with a Boltzmann-like probability distribution P ( v ) ∝ e βλ ( v ) , (1)with λ ( v ) the growth rate and where β > controls the mag-nitude of fluctuations [10]. While more comprehensive stud-ies will hopefully refine this picture, recent work has providedfurther support to and insight into the MaxEnt scenario [11].A tightly related question is whether one can infer the prob-ability distribution of metabolic states from empirical datarather than postulating it. Inverse problems and related tech-niques have a long history in applied fields [12] and have at-tracted much attention from more theoretical areas in recent years as they directly connect machine learning, artificial neu-ral networks and statistical mechanics [13]. In the biologicalcontext, they have lead to new insights in domains as apartas protein science [14], neural dynamics [15] and immunol-ogy [16]. In this short note we discuss a method to obtainthe probability distribution of fluxes in a metabolic networkfrom the experimental characterization of a subset of fluxes,and present a proof-of-concept validation based on flux datafor Escherichia coli steady state growth. Taking experimen-tal means as the key features to be captured, we look for fluxdistribution of the form P ( v ) ∝ e F ( v ) /T , (2)where F is a function of fluxes (to be inferred) and T ≥ is anadjustable parameter. In order to learn F , we combine a Boltz-mann learning scheme with a Monte Carlo sampling method.The F thus obtained is found to involve a linear combinationof the biomass output and of the glucose intake rate, and dataare best described by setting T to a finite (non-zero) value,suggesting, in agreement with previous studies, that empiricalfluctuations reflect at least in part some unavoidable (and pos-sibly functionally relevant) noise in the organization of fluxconfigurations.We shall denote by K the number of different samples inwhich a subset X of fluxes has been experimentally quanti-fied, and by v ( k ) j the value of flux v j in the k -th sample. Theempirical mean of flux v j is given by (cid:104) v j (cid:105) emp = 1 K K (cid:88) k =1 v ( k ) j ( j ∈ X ) . (3)To define the space of a priori feasible flux configurationsfor the entire metabolic network, we shall follow the stan-dard route of assuming that viable flux vectors v are non-equilibrium steady states of the underlying system of reac-tions. The feasible space F then corresponds to the solutionsof Sv = , where S stands for the M × N stoichiometric ma-trix ( M denoting the number of chemical species, and N thatof reactions) and where each flux v i is constrained to lie within a r X i v : . [ q - b i o . M N ] A p r an interval [ v min i , v max i ] , whose bounds encode for the physio-logically relevant regulatory, kinetic and thermodynamic con-straints. Geometrically, such an F is a convex polytope. Inprinciple, every point v ∈ F is a feasible flux configurationand configurations are assumed to be a priori equiprobable.However, the empirical means (3) represent information thatcan refine this assumption. In particular, one may expect thatflux vectors should occur with a probability distribution P ( v ) such that (cid:90) F v j P ( v ) d v = (cid:104) v j (cid:105) emp ( j ∈ X ) . (4)Following the Maximum Entropy idea, we focus on the leastconstrained distribution satisfying (4), which maximizes theentropy S [ P ] = − (cid:90) F P ( v ) log P ( v ) d v (5)subject to (4). This is given by P ( v ) ≡ P ( v | c ) = e (cid:80) j ∈X c j v j Z ( c ) ( v ∈ F ) , (6)where c = { c j } j ∈X denotes the vector of Lagrange multipli-ers enforcing (4) for each v j , while Z ( c ) = (cid:90) F e (cid:80) j ∈X c j v j d v (7)ensures proper normalization.A key question at this point concerns the values of the con-stants c j . More precisely, can we set them so as to repro-duce empirical means most accurately via (4)? By straightfor-wardly maximizing the log-likelihood of the parameters giventhe empirical data, i.e. L ( c | data) = 1 K K (cid:88) k =1 log P (data | c ) , (8)one sees that ∂ L ∂c j = (cid:104) v j (cid:105) emp − (cid:104) v j (cid:105) c , (9)where (cid:104) v j (cid:105) c = (cid:90) F v j P ( v | c ) d v . (10)This suggests that the optimal vector c can be found by anupdating dynamics driven by the difference between the em-pirical mean and the mean computed using the current vector c , i.e. via a Boltzmann learning such as c j ( τ + δτ ) − c j ( τ ) = (cid:104) (cid:104) v j (cid:105) emp − (cid:104) v j (cid:105) c ( τ ) (cid:105) δτ . (11)Ideally, the vector c (cid:63) obtained as the asymptotic fixed pointof (11) ensures the best agreement between empirical and the-oretical means (i.e. between (3) and (4)), while P ( v | c (cid:63) ) pro-vides our best guess for the (stationary) probability distribu-tion compatible with empirical means that has generated our dataset. In this scenario, the quantity E = (cid:80) j ∈X c (cid:63)j v j wouldrepresent the key physical parameter regulating the probabil-ity of occurrence of feasible flux vectors. Note that such an E plays the role of F/T in (2).We have implemented the above scheme using data re-trieved from [17], where
E. coli ’s central carbon metabolismis characterized in minimal glucose-limited aerobic condi-tions, at growth/dilution rates below . / h. We collected K = 35 control experiments and the corresponding valuesfor |X | = 24 reactions fluxes (see [11] for more details).We have then studied the inference problem on the feasiblespace F defined by the E. coli ’s core metabolic network [18].The dimension of F when a minimal aerobic glucose-limitedmedium is used to constrain the exchanges between the celland its surroundings is dim( F ) = 23 .To compute the optimal values of the coefficients c j from(11) we have used the following procedure:1. initialize c j (0) = 0 for all j ∈ X
2. at each time step τ : compute (cid:104) v j (cid:105) c ( τ ) from (10) by sam-pling the distribution P ( v | c ( τ )) via Hit and Run MonteCarlo [19]; then3. find the j ∈ X for which the difference (cid:104) v j (cid:105) emp −(cid:104) v j (cid:105) c ( τ ) is largest, update its value according to (11),and iterate.Finally, we set δτ = 10 − .By studying numerically the dynamics of the c j ’s one findsthat, at sufficiently long times τ , coefficients generically be-have as c j ( τ ) (cid:39) L j τ ( j ∈ X ) , (12)where L j are flux-specific constants. For most fluxes, though, L j = 0 , i.e. the corresponding c j ’s converge in time to finite(small) values, while L j (cid:54) = 0 for a small number of fluxes (seeFig. 1). This in turn implies that the function E (cid:39) τ (cid:88) j ∈X L j v j (13)is asymptotically dominated by terms with L j (cid:54) = 0 . Exploit-ing the linear dependencies between variables, one can ex-press E in terms of biologically significant fluxes other thanthose in X . Quite remarkably, one finds that the dominantcontribution to E has the form E (cid:39) τ ( L λ λ + L u u ) , (14)where λ and u denote, respectively, the biomass output rateand the glucose in-take flux while L λ and L u are numericalcoefficients given by approximately by L λ (cid:39) / and L u (cid:39) / .The potentially high-dimensional function (13) thereforereduces, after Boltzmann learning, to a simple form that com-bines two variables of the highest biological significance. Inparticular, the solution of the inverse problem suggests a sce-nario very similar to that derived in [10]. In addition, though, FIG. 1. Values of c j as a function of the Boltzmann learning time τ for the fluxes j ∈ X . Only a representative sample of the compo-nents with L j (cid:39) is shown, while diverging components are dis-played in the inset.FIG. 2. Values of χ (left) and MSE (right) as a function of theBoltzmann learning time τ obtained by comparing experimental datawith the inferred distribution (15) (grey markers) and the guessed one(16) (red markers). the network’s state also appears to be sensitive to the glucoseimport rate u .We have validated these results by comparing empirical av-erages against the means obtained from the inferred distribu-tion P inf ( v ) ∝ e τ ( L λ λ + L u u ) , (15)as well as against means computed from the simpler form (1),i.e. P ( v ) ∝ e τλ , (16)which was obtained by focusing on growth rate distribu-tions (rather than individual fluxes). In both cases, we haveperformed an asymptotic extrapolation by studying how χ and the mean squared error (MSE) between data and modelschange upon varying τ , which –comparing (2) and (15)– thusplays the role of /T (while L λ λ + L u u plays the role of F ).Results are shown in Fig. 2. One sees that both probabilitydistributions generate clear minima in χ and MSE as func-tions of τ , where data are optimally reproduced. However, theinferred distribution (15) outperforms (16) (deeper minimum, -30 0 30 60 90 120 150 180experimental value-300306090120150180 t h e o r e ti ca l p r e d i c ti on FBA (optimal)Inferred (means) rpitkt, tala gapdenopdhglc_ptsicl,malsg6pdh,gndrpe icdhakgdh,acont,mdh,cs,pgi,tpi,fba,pfk
FIG. 3. Flux-by-flux comparison between experimental values andtheoretical predictions obtained from FBA (blue markers) and fromthe inferred distribution (grey markers). Flux intensities are in per-centage with respect to the glucose uptake. Reaction acronyms aretaken from [18]. albeit slightly) both in terms of χ and in terms of MSE. Inter-estingly, the inferred function provides better results at a largervalue of τ compared to P , which suggests that metabolic con-figurations are closer to optima of F = L λ λ + L u u thanto optima of λ . In order to get a more precise idea of theimprovement obtained via inference, we have displayed inFig. 3 a detailed flux-by-flux comparison between inferredmeans (with errors) and experimental means, also showing forsakes of completeness the predictions obtained from standardbiomass-maximizing FBA.It is important to note that the Boltzmann learning dynamicsdoes not converge in general to a vector c such that theoreticalmeans match empirical ones perfectly as τ → ∞ . A possiblereason is that empirical means lie outside the feasible space F . Flux values are in fact tightly constrained by the mass bal-ance equations Sv = , and small numerical errors that mayarise e.g. from experimental noise can lead to violations ofthese conditions. While not surprising per se , this representsa further complication for the inference problem, which formetabolic networks relies on the definition of F , and furtherinvestigations are needed.To summarize, previous work has shown that maximum en-tropy distributions at fixed average growth rate in the space offeasible metabolic states reproduce empirical growth rate dis-tributions measured in exponentially growing populations atsingle-cell resolution [10] and outperform standard FBA (re-trieved in the limit β → ∞ ) in reproducing experimental dataon fluxes [11]. The approach discussed here generalizes theabove results by extending the input dataset to a subset ofmetabolic fluxes. Following standard inference schemes, wehave computed the function F ( v ) that best describes the fluxdataset as being generated from a Boltzmann-like distribution ∝ e F ( v ) /T . Quite remarkably, the biomass output and theglucose intake emerge from the inference process as the keyfluxes that govern the statistics of fluxes. The type of inversemodeling discussed here represents a novel and potentiallypowerful tool to analyze metabolic networks and characterizetheir large-scale organization in a way that is fully data-driven.‘Energy’ functions F obtained in this way may provide a newperspective on metabolic functions and objectives in complexsettings where biomass growth optimization alone does notsuffice to explain observations. However, the predictive ca-pacity of inverse methods is intrinsically limited by the qualityof available data. In addition, the scheme employed here has ahigh computational cost. In fact, as a Monte Carlo sampling ofthe feasible space F is required at every Boltzmann learning step, performing the same analysis on larger (genome-scale)networks, with a feasible space F whose dimension can bean order of magnitude larger than that addressed here, is verydifficult. This problem may however be effectively solved bythe use of approximate representations of the feasible spaceand/or of more efficient computational heuristics [20, 21]. Inthis sense, the present note represents merely a proof of con-cept and more work is needed to make these ideas viable atgenome resolution.We thank A. Braunstein, A. P. Muntoni and A. Pagnani fora critical revision of these ideas and for important commentsand suggestions. We acknowledge the support of the PeopleProgramme (Marie Curie Actions) of the European Union’sSeventh Framework Programme (FP7/2007-2013) under REAgrant agreement no. [291734] (D.D.M). [1] Bordbar A, et al. Constraint-based models predict metabolicand associated cellular functions. Nature Rev Genet (2014)15:107-120[2] Schuetz R, Kuepfer L, Sauer U. Systematic evaluation of objec-tive functions for predicting intracellular fluxes in Escherichiacoli. Molec Sys Biol (2007) 3: 119[3] Knorr AL, Jain R, Srivastava R. Bayesian-based selection ofmetabolic objective functions. Bioinformatics (2007) 23:351-357[4] Feist A and Palsson B. The biomass objective function. CurrOpin Microbiol (2010) 13:344-349[5] Schuetz R, et al. Multidimensional optimality of microbialmetabolism. Science (2012) 336:601[6] Hart Y, et al. Inferring biological tasks using Pareto analysis ofhigh-dimensional data. Nature Methods (2015) 12:233[7] Mori M, Marinari E, De Martino A. A yield-cost trade-off governs Escherichia coli’s decision between fermentationand respiration in carbon-limited growth. BiorXiv preprint,https://doi.org/10.1101/113183 (2017)[8] Wallden M, et al. The synchronization of replication and divi-sion cycles in individual E. coli cells. Cell (2016) 166:729-739[9] Kiviet, DJ, et al. Stochasticity of metabolism and growth at thesingle-cell level. Nature (2014) 514: 376-379[10] De Martino D, Capuani F and De Martino A. Growth againstentropy in bacterial metabolism: the phenotypic trade-off be-hind empirical growth rate distributions in E. coli. Phys Biol(2016) 13:036005[11] De Martino D, Andersson A, Bergmiller T, Guet C, TkacikG, Statistical mechanics for metabolic networks during steady-state growth. ArXiv preprint, https://arxiv.org/abs/1703.01818(2017) [12] Tarantola, A. Inverse problem theory and methods for modelparameter estimation (Society for Industrial and Applied Math-ematics, 2005)[13] Chau Nguyen H, Zecchina R, Berg J. Inverse statistical prob-lems: from the inverse Ising problem to data science. ArXivpreprint https://arxiv.org/abs/1702.01522 (2017)[14] Cocco S, et al. Inverse statistical physics of pro-tein sequences: a key issues review. ArXiv preprint,https://arxiv.org/abs/1703.01222 (2017)[15] Roudi Y, Aurell E, Hertz J. Statistical physics of pairwise prob-ability models. Front Comput Neurosci (2009) 3:22[16] Kaplinsky J and Arnaout R. Robust estimates of overallimmune-repertoire diversity from high-throughput measure-ments on samples. Nature Comm (2016) 7:11881[17] Zhang Z et al. CeCaFDB: a curated database for the documen-tation, visualization and comparative analysis of central carbonmetabolic flux distributions explored by13