[PDF] Evaluation of Parallel Tempering to Accelerate Bayesian Parameter Estimation in Systems Biology

Abstract

Models of biological systems often have many unknown parameters that must be determined in order for model behavior to match experimental observations. Commonly-used methods for parameter estimation that return point estimates of the best-fit parameters are insufficient when models are high dimensional and under-constrained. As a result, Bayesian methods, which treat model parameters as random variables and attempt to estimate their probability distributions given data, have become popular in systems biology. Bayesian parameter estimation often relies on Markov Chain Monte Carlo (MCMC) methods to sample model parameter distributions, but the slow convergence of MCMC sampling can be a major bottleneck. One approach to improving performance is parallel tempering (PT), a physics-based method that uses swapping between multiple Markov chains run in parallel at different temperatures to accelerate sampling. The temperature of a Markov chain determines the probability of accepting an unfavorable move, so swapping with higher temperatures chains enables the sampling chain to escape from local minima. In this work we compared the MCMC performance of PT and the commonly-used Metropolis-Hastings (MH) algorithm on six biological models of varying complexity. We found that for simpler models PT accelerated convergence and sampling, and that for more complex models, PT often converged in cases MH became trapped in non-optimal local minima. We also developed a freely-available MATLAB package for Bayesian parameter estimation called PTempEst (this http URL), which is closely integrated with the popular BioNetGen software for rule-based modeling of biological systems.

Full PDF

EEvaluation of Parallel Tempering to AccelerateBayesian Parameter Estimation in Systems Biology

Sanjana Gupta, Liam Hainsworth, Justin S. Hogg, Robin E. C. Lee, and James R. Faeder

Department of Computational and Systems BiologyUniversity of Pittsburgh School of Medicine, Pittsburgh, PA 15260, USAEmail: {sag134,robinlee,faeder}@pitt.edu

Abstract —Models of biological systems often have many un-known parameters that must be determined in order for modelbehavior to match experimental observations. Commonly-usedmethods for parameter estimation that return point estimatesof the best-ﬁt parameters are insufﬁcient when models are highdimensional and under-constrained. As a result, Bayesian meth-ods, which treat model parameters as random variables andattempt to estimate their probability distributions given data,have become popular in systems biology. Bayesian parameterestimation often relies on Markov Chain Monte Carlo (MCMC)methods to sample model parameter distributions, but the slowconvergence of MCMC sampling can be a major bottleneck. Oneapproach to improving performance is parallel tempering (PT),a physics-based method that uses swapping between multipleMarkov chains run in parallel at different temperatures to accel-erate sampling. The temperature of a Markov chain determinesthe probability of accepting an unfavorable move, so swappingwith higher temperatures chains enables the sampling chainto escape from local minima. In this work we compared theMCMC performance of PT and the commonly-used Metropolis-Hastings (MH) algorithm on six biological models of varyingcomplexity. We found that for simpler models PT acceleratedconvergence and sampling, and that for more complex models,PT often converged in cases MH became trapped in non-optimallocal minima. We also developed a freely-available MATLABpackage for Bayesian parameter estimation called PT EM P EST (http://github.com/RuleWorld/ptempest), which is closely inte-grated with the popular BioNetGen software for rule-basedmodeling of biological systems.

Index Terms —Bayesian parameter estimation; Systems biol-ogy; Parallel tempering; Rule-based modeling;

I. I

NTRODUCTION

Mathematical and computational models have been gain-ing widespread use as tools to summarize our understand-ing of biological systems and to make novel predictionsthat can be tested experimentally [12], [21]. Doing thisrequires a model to be correctly parameterized. Parame-ter estimation, the process of inferring model parametersfrom experimental data, typically involves deﬁning a costfunction that quantiﬁes the discrepancy between the modeloutput and the data, and then performing a search forparameterizations that minimize the cost [2], [22].There are many commonly-used methods for ﬁndingparameter sets that minimize the model cost. These canbroadly be divided into gradient-based and gradient-freemethods. Gradient-based methods are local optimizationmethods that iteratively use the gradient of the cost func-tion to compute a search direction and step length, followed by updating the parameters and checking for convergence[2]. Popular gradient-based methods in systems biologyinclude gradient descent, Newton’s method, the Gauss-Newton algorithm, and the Levenberg-Marquardt algorithm[1]. However, these methods can fail to ﬁnd the globalminimum when landscapes are discontinuous or multi-modal, as is frequently the case for large biological models,which can have many more parameters than independentdata points to constrain the model [25].Gradient-free methods have the advantage that the land-scape need not be smooth, but local search methods, suchas the Nelder-Mead simplex, become inefﬁcient for high-dimensional problems [11]. Gradient-free global optimiza-tion methods such as genetic algorithms and particle swarmoptimization can be effective at ﬁnding optimal solutions inhigh-dimensional spaces [19]. However, the combination ofhigh-dimensional parameter spaces and the limited amountof data available from typical biological experiments oftenmeans that multiple parameter combinations equivalentlydescribe the experimental data, which is referred to as theparameter identiﬁability problem [13]. When parametersare non-identiﬁable, a single parameter set is insufﬁcient todescribe the feasible space of parameters associated with amodel.Bayesian methods solve this problem naturally by at-tempting to estimate the probability distribution of themodel parameters given the experimental data [22], whichallows simultaneous determination of best-ﬁt parametersand parameter sensitivities, while also providing a frame-work to introduce prior information that the modelermay have about the parameters. Bayesian methods includelikelihood-based approaches, such as Markov Chain MonteCarlo (MCMC) methods [9], and likelihood-free approaches,such as Approximate Bayesian Computation (ABC) [22].MCMC is commonly used in systems biology, but slowconvergence is often a major bottleneck for standard sam-pling algorithms, such as Metropolis-Hastings (MH) [9]. Thedevelopment of modular and rule-based software for modelconstruction and simulation [16], [23], [36], allows for theconstruction of increasingly complex models (e.g., [7]),which combined with the increasing availability of single-cell data [35] motivates the need for accelerated methodsfor Bayesian parameter estimation. Parallel tempering (PT)is a physics-based MCMC method that efﬁciently samples a r X i v : . [ q - b i o . Q M ] J a n probability distribution and can accelerate convergenceover conventional MCMC methods [8]. This method hasbeen widely used for molecular dynamics simulations tosample the conformational space of biomolecules [15], [29],but is less common in systems biology [10], [24], [25].Here, we describe key algorithmic elements of the method,provide a software implementation, and evaluate its per-formance on a series of biological models of increasingcomplexity.The remainder of this paper is organized as follows: InSec. II, we describe the MH and PT algorithms as well asthe ABC and ABC-SMC methods used by the software ABC-SysBio [22], which we will later use for comparison. Wealso include a brief description of the PT EMP E ST softwarefor Bayesian parameter estimation. In Sec. III we presenta series of examples of increasing complexity to test theperformance of PT relative to MH with regards to qualityof ﬁt, convergence speed, and sampling efﬁciency. Weinclude a comparison with ABC-SysBio and further show anapplication of using Bayesian methods with Laplace priorsto achieve model reduction. Finally, in Sec. IV we discussour main ﬁndings, limitations, and areas for future work.II. M ETHODS

Bayesian parameter estimation methods infer the pos-terior distribution that describes the uncertainty in theparameter values that remains even after the data is known[22]. The probability of observing the parameter set θ giventhe data Y is given by Bayes’ rule p ( θ | Y ) ∝ p ( Y | θ ) p ( θ ),where p ( Y | θ ) is the conditional probability of Y given θ , which is described by a likelihood model , and p ( θ ) isthe independent probability of θ , often referred to as the prior distribution on model parameters. This distributionrepresents our prior beliefs about the model parameters,and can be used to restrict parameters to a range of valuesor even to limit the number of nonzero parameters, asdiscussed further below. A. MCMC Methods

MCMC methods for parameter estimation sample fromthe posterior distribution, p ( θ | Y ), by constructing a Markovchain with p ( θ | Y ) as its stationary distribution. The keyrequired elements are: • A likelihood model that gives p ( Y | θ ). Assuming themodel is continuous (e.g., an ordinary differentialequation (ODE) model) and Gaussian experimentalmeasurement error, the likelihood function is given by L = e − Σ S Σ T ( Y sim − Y expt ) /2 σ ,where S is a list of the observed species and T isa list of the time points at which observations aremade. PT EMP E ST allows other likelihood models, suchas the built-in t-distribution [34], or any user-suppliedfunction. • Prior distributions on the parameters to be estimated.Uniform priors are a common choice when little isknown about the parameters except for upper andlower limits. Priors can also be introduced to simplifya model by reducing some of its parameters to zero,a process called regularization. For example Lassoregularization [27] penalizes the sum of absolute valuesof parameters (the L1 norm), and Ridge regression [37]penalizes the sum of the squared parameter values (theL2-norm). • A proposal function to deﬁne the probability distri-bution for the next parameter set to sample giventhe current set. A common choice is a normal dis-tribution centered at the current value with a user-speciﬁed variance, which determines the effective stepsize. PT EMP E ST uses a single adaptive step-size todetermine the change in all parameters, but there areother MCMC implementations which permit differentstep sizes to govern changes in different directions inparameter space [9].Following Metropolis et al. [26], we deﬁne the energy ofa parameter set θ as E ( θ ) = − log L ( θ ) − log p ( θ ),where L and p are the likelihood and prior distributionfunctions deﬁned above.

1) Metropolis-Hastings algorithm:

The Metropolis-Hastings (MH) algorithm is one of the most popularMCMC methods [6]. If we assume a symmetric proposalfunction, i.e., the probability of moving from a parameterset θ i to θ j equals that of moving from θ j to θ i , then thealgorithm to sample from p ( θ | Y ) is as follows:1) Select an initial parameter vector θ that has energy E ( θ ) and set i = i until i = N a) Propose a new parameter vector θ new and calcu-late the E ( θ new ).b) Set θ i = θ new with probability min(1, e − ∆ E ), where ∆ E = E ( θ new ) − E ( θ i − ) (acceptance). Otherwise,set θ i = θ i − (rejection).c) Increment i by 1.

2) Parallel Tempering:

One of the key differences be-tween MH and PT is the existence of a temperature pa-rameter, β , that scales the effective “shallowness” of theenergy landscape. Several Markov chains are constructed inparallel, each with a different β . A Markov chain with a β value of 1 samples the true energy landscape, while highertemperature chains have lower values of β and sampleshallower landscapes with the acceptance probability nowgiven by min(1, e − β ∆ E ). Higher temperature chains acceptunfavorable moves with a higher probability and thereforesample parameter space more broadly. Tempering refers toperiodic attempts to swap conﬁgurations between high andlow temperature chains. These moves allow the low temper-ature chain to escape from local minima and improve both2onvergence and sampling efﬁciency [8]. The PT algorithmis as follows:1) For each of N swap attempts (called “swaps” for short)a) For each of N c chains (these can be run inparallel)i) Run N MCMC

MCMC stepsii) Record the values of the parameters andenergy on the ﬁnal step.b) For each consecutive pair in the set of chains indecreasing order of temperature, accept swapswith probability min(1, e ∆ β ∆ E ), where ∆ E = E j − E j − , and ∆ β = β j − β j − , and E j and β j are theenergy and temperature parameter respectivelyof the j chain.Adapting the step size and the temperature parametercan further increase the efﬁciency of sampling [8]. However,varying parameters during the construction of the chainviolates the assumption of a symmetric proposal function(also referred to as “detailed balance”), and it is advisableto do this during a “burn-in” phase prior to sampling.

3) Implementation:

In this work we present PT

EMP E ST ,a MATLAB-based tool for parameter estimation using PTthat is integrated with the rule-based modeling softwareBioNetGen [16]. Models speciﬁed in the BioNetGen lan-guage (BNGL) can be exported as ODE models that arecalled as MATLAB functions by PT EMP E ST . The BioNetGencommands writeMfile or writeMexfile are used toexport models in MATLAB’s M-ﬁle format, which usesMATLAB’s built-in integrators, or as a MATLAB MEX-ﬁle,which encodes the model in C and invokes the CVODElibrary [17], which is usually much more efﬁcient in ourexperience. For additional compatibility, models can beimported into BioNetGen in the System Biology MarkupLanguage (SBML) [18], or the user can write their own costfunction in MATLAB. The Bayesian parameter estimationcapabilities of PT EMP E ST complement those of another toolfor performing parameter estimation on rule-based models,BioNetFit [30].PT EMP E ST uses adaptive step sizes and temperatures.The user provides the following hyper-parameters to controlsampling: initial step size, initial temperature, and adapta-tion intervals and target acceptance probabilities for stepsand swaps. At given intervals, the step acceptance proba-bilities and swap acceptance probabilities are calculated,and the step sizes and chain temperatures are adjustedto bring the step and swap acceptance probabilities closerto their target values respectively. For example, if the stepacceptance rates are too high, the step size will be increasedand vice versa. Similarly, if the swap acceptance rates aretoo high, the chain temperatures will be increased and viceversa. Although there are a considerable number of hyper-parameters associated with this method, we have found thatthe default values provided in PT EMP E ST generally workwell in practice. The MATLAB source code for PT EMP E ST along withmodel and data ﬁles used in the experiments described be-low are available at http://github.com/RuleWorld/ptempest. B. Approximate Bayesian Computation methods1) ABC rejection:

The simplest ABC algorithm is a rejec-tion algorithm [32], which involves repeatedly sampling aparameter vector θ i from the prior distribution, simulatingthe model with the sampled parameters, and calculatingthe discrepancy (often in the form of a distance function)between the simulated data Y sim and the experimentaldata Y expt . If the discrepancy is below a threshold, (cid:178) , θ i is accepted as a member of the posterior distribution;otherwise, it is discarded and another θ i is drawn. Thisprocess continues until the number of samples reachesa speciﬁed number, resulting in an approximation of thedistribution p ( θ | Y sim − Y expt < (cid:178) ), which in the limit of (cid:178) → p ( θ | Y expt ).

2) Approximate Bayesian Computation-Sequential MonteCarlo (ABC-SMC):

The ABC rejection algorithm can sufferfrom low acceptance rates [32]. The ABC-SMC algorithmuses a tolerance schedule to decrease (cid:178) , and sequentiallyconstructs approximate posterior distributions of increasingaccuracy, which eventually converge to the true posteriordistribution [22], [32]. We use ABC-SMC to generate theresults shown in Sec. III-B

C. Metrics for algorithm performance comparisons

In our analyses we ﬁt ODE models to synthetic datagenerated using ﬁxed parameter values. For the comparisonto ABC presented in Sec. III-B we used synthetic datawith additional noise, as was provided in the ABC-SysBioexample ﬁles.For models containing 3–6 parameters, both the MH andPT algorithms ﬁnd the global minimum, and we comparedthe performance using convergence time and samplingefﬁciency. The convergence time is deﬁned as the numberof MCMC steps before the energy drops below a speciﬁedthreshold, determined empirically [4]. For PT convergencetime is based on the number of MCMC steps in the lowesttemperature chain. With uniform priors and data simulatedwithout noise, the negative log likelihood approaches zerowhen the chain converges to the global minimum.The sampling efﬁciency is deﬁned as the ratio of therange of the posterior distribution to the range of the priordistribution, either for a model parameter that is known tobe uniformly distributed, or for an added control parameterthat does not contribute to the model output and thereforeshould be uniformly distributed.For more complex models (11-25 parameters), we do notalways obtain parameter sets that ﬁt the data. In this casewe compare the algorithms in terms of the negative loglikelihood of the best ﬁt parameter sets. In the case ofuniform priors, this directly corresponds to the minimumenergy attained by the Markov chain.3o compare disparate algorithms in terms of the totalamount of computational resource used, we allowed eachto perform a speciﬁed number of model integrations. ForMH the number of model integrations is the number ofMCMC steps, while for PT it is the number of MCMCsteps times the number of chains run in parallel. For ABCalgorithms, which use rejection sampling, the number ofmodel integrations equals the total number of parametersets evaluated to generate the desired number of samples.III. R

ESULTS

A. Michaelis-Menten Kinetics

We start with a simple model to demonstrate howBayesian methods can identify constrained parameter re-lationships even when individual parameters are uniden-tiﬁable. The Michaelis-Menten model describes enzymesubstrate kinetics using the following scheme: E + S k f (cid:10) k r ES k cat → E + P When the total enzyme concentration, [ E ] T is much smallerthan that of the substrate, the rate of product formation isgiven by d [ P ] d t = k cat [ E ] T [ S ]( K M + [ S ]) ,where K M = ( k cat + k r )/ k f is the Michaelis constant. Theproduct trajectory only constrains k cat and K M , whilethe individual forward and backward rates k f and k r areunidentiﬁable. We generated a synthetic product trajectoryusing parameters k f = − , k r = − , k cat = − , andconstructed a likelihood function assuming 1% Gaussianerror. The 3 model parameters are sampled in log-space,with uniform priors on the intervals [ − − − k f , k r and k cat respectively. The ﬁt is repeated100 times using MH, and PT with 4 chains, starting froman initial parameter set of [ − M chains of length N can be more efﬁcient than a single-chain Monte Carlosearch of length M N . PT also has higher sampling efﬁciencyfor k r and k f compared to MH (Figure 1D).As we would expect from the non-identiﬁability of k f and k r , the posterior distributions of log ( k r ) and log ( k f )are uniform across the prior (Figure 1E), but their ratiois constrained (Figure 1F). log ( k cat ) is an identiﬁableparameter and has a constrained distribution centered at − MH & PT & A B&E& F&C D l og c on v e r gen c e t i m e Fig. 1. Parameter estimation for the Michaelis-Menten model. (A) Distribu-tion of minimum energy values obtained via MH and PT. (B) Example of aﬁtted ensemble (colored lines) obtained for the synthetic data (black lineswith error bars) using PT. (C) Distribution of convergence times for PT vs.MH with an energy threshold of 1. (D) Sampling efﬁciency for parameters k r and k f over 100 repeats using MH (light blue) and PT (dark blue). (E)Estimated posterior distributions for each of the model parameters. Thex-axis limits are the uniform prior boundaries. (F) Scatter plots of sampledparameter sets for each pair of model parameters. Axis limits reﬂect priorboundaries. B. mRNA self-regulation

In this section we compare the efﬁciency of ABC-SMC,PT and MH for parameter estimation on a simple model ofmRNA self-regulation (Figure 2A). The ABC-SysBio softwareis distributed with example ﬁles to estimate the parametersof this model assuming uniform priors using the ABC-SMCalgorithm. The model has 5 parameters, one of which isﬁxed [22]. The quality of ﬁt is deﬁned as the Euclideandistance between the ﬁtted trajectory and the data. Forthe ABC-SMC algorithm we extended the default 18-steptolerance schedule provided in ABC-SysBio from 50-15 toa 23-step schedule from 50-5 and set the ensemble sizeas 100. We ran ABC-SMC 50 times, and found that eachrun used an average of 6.7 × model integrations. Wethen ran 50 repeats of 4-chain PT for 16750 MCMC steps,and of MH for 6.7 × MCMC steps, using a likelihoodfunction with 1% Gaussian error. The sampling efﬁciency4 B Time&(seconds) & CE0& mRNA& 0&P1& P2&0& 0&

P1p p &+&P p p p p p Time&(seconds) & F&D

Fig. 2. Parameter estimation for the model of mRNA self regulation. (A)Reaction network diagram of the mRNA self regulation model from [22](B) Quality of ﬁt of the ﬁnal ensemble obtained from ABC-SysBio, PT andMH. The box plots show the distribution of the Euclidean distances ofthe 100 members of each of 50 ﬁtted ensembles from the synthetic data.Example of a typical ﬁtted ensemble obtained from (C) PT and (D) ABC-SysBio. Black lines show the synthetic data, and the colored lines show theﬁtted trajectories. [mRNA] refers to the number of mRNA molecules. (E)Distribution of convergence times for PT vs. MH with an energy thresholdof 20. PT takes on average ~2-fold fewer steps to reach convergence. (F)Comparison of sampling efﬁciency of MH vs. PT. of PT and MH was compared using a control parameter asdescribed above. The quality of ﬁt produced by the MCMC-based algorithms is substantially higher than what we getfrom ABC-SMC (Figure 2B-D). PT takes fewer steps to reachconvergence (Figure 2E), and has higher sampling efﬁciencythan MH given the same number of model integrations(Figure 2F).

C. Model reduction with Lasso

In this section we demonstrate the use of MCMC ap-proaches to perform model reduction, by coupling pa-rameter estimation with regularization. Lasso regulariza-tion penalizes the L1-norm of the parameter vector whileminimizing the cost function during parameter estimation.This performs variable selection by ﬁnding the minimumnumber of non-zero parameters required to ﬁt the data[31]. The Lasso penalty is equivalent to assuming a Laplaceprior on the parameters [27], and the width of the prioris inversely related to the regularization parameter thatgoverns the strength of the penalty. Here, we present anexample of using the Bayesian Lasso for model reduction,and compare the use of PT and MH for this problem.

AC B A& B& p p p p p p D&E& F& W i t h & l a ss o , & b & = & . & W i t h o u t & l a ss o & &b&=&0.5&&b&=&0.1&&b&=&0.01& B& C&

Fig. 3. Model reduction with Lasso. (A) Reaction network diagram ofa toy negative feedback model. The core model used to obtain thesynthetic data for the ﬁt is in blue, and extraneous elements are in red (B)Tuning the regularization parameter, i.e. the width of the Laplace prior,w.r.t the negative log likelihood of the ﬁtted ensembles (C) Examplesof ﬁtted ensembles corresponding to different regularization strengths.Error bars show synthetic data. Solid lines show the simulated ﬁts. (D)Posterior distributions with lasso (top row) for parameters show extraneousparameters peaking at 0. Red lines indicate true parameter values, andthe blue lines show the Laplace prior ( b = A core model of negative feedback regulation with threeprocesses (blue arrows in Figure 3A) was simulated to get asynthetic trajectory for species A using a value of 10 for allthree rate constants (Figure 3C). Three extraneous processwere added to the model (red arrows in Figure 3A), sothat only a subset of the reactions in the reaction schemeare required to ﬁt the data. We constructed a likelihoodfunction assuming 2% Gaussian error, and assumed Laplacepriors of width b on each of the 6 model parameters, where b is the regularization parameter that needs to be tuned.High values of b , i.e., wide priors, will not impact the loglikelihood but will not achieve much variable selection.Conversely for low values of b most of the parameters willgo to 0 at the cost of degrading the log likelihood.Here, we tested a range of b values. For each we ran PTwith 500,000 MCMC steps 50 times to obtain a distributionof negative log-likelihood values (Figure 3B). Figure 3Cshows examples of ﬁtted ensembles obtained with differentregularization strengths. We chose the smallest value of b ,0.5, that does not signiﬁcantly increase the negative log5 & B& Fig. 4. Parameter estimation for the model of calcium signaling. (A)Examples of convergence to a local minimum (top) and to the globalminimum (bottom). Error bars show synthetic data. Solid lines show thesimulated ﬁts. The right column shows the energy chains correspondingto the ﬁts on the left. (B) Distributions of the minimum energy from MH,PT-4 and PT-6 over 100 repeats. likelihood (Figure 3B), and used this for further analysis.The posterior distributions for the model parametersobtained with regularization show the extraneous param-eters peaking at 0, while the essential parameters have welldeﬁned distributions that peak close to their true values(Figure 3D, top row). Without regularization the extraneousparameters p , p and p (red boxes in Figure 3D) take onnon-zero values and make the other parameters unidentiﬁ-able (Figure 3D, bottom row). PT converges faster than MH(Figure 3E), but the sampling efﬁciencies calculated over200,000 MCMC steps are comparable (Figure 3F). D. Calcium signaling

The models considered so far have a relatively smallnumber of parameters and both MH and PT achieve con-vergence readily. For models with more parameters andmore complex dynamics, convergence becomes difﬁcult toachieve. As an example, we consider a four-species modelof calcium oscillations that has 12 free parameters [20].The model describes the dynamics of G α subunits of theG-protein, active PLC, free cytosolic calcium, and calciumin the endoplasmic reticulum. We generated synthetic datafor free cytosolic calcium (Figure 4A), and constructed alikelihood function assuming 20% Gaussian error. The 12free parameters were sampled in log-space with uniformpriors, 6 units wide and centered at the true values. Wegenerated 100 random initial parameter sets, and from eachstarting point sampled using MH, PT with 4 chains (PT-4) and PT with 6 chains (PT-6). Only a fraction of thechains converged to the global minimum in 500,000 MCMCsteps. Figure 4A shows an example of an MH chain thathas converged to a local minimum with high energy, andanother of a PT chain that has converged to the globalminimum. The distributions of minimum energy for chainsobtained from each algorithm (Figure 4B) show that PT-6found better ﬁts than PT-4, which in turn did better thanMH, which returned highly variable results and frequentlydid not reach the global minimum. E. Negative feedback oscillator

We also considered the three species negative feedbackoscillator from Tyson et al. [33], to evaluate the moredifﬁcult case of ﬁtting a model to complex dynamics ofmultiple species. We generated synthetic data for all threemodel species under conditions where all three speciesundergo sustained oscillations. 11 model parameters aresampled in log-space, with uniform priors that are 10units wide and centered at the true values. The likelihoodfunction is a t-distribution with 10% error. We generated15 random initial parameter sets, and from each startingset ran MH, PT with 4 chains (PT-4) and PT with 6 chains(PT-6) for 500,000 MCMC steps.Figure 5A shows examples of chains converging to differ-ent minima. The top row shows an example of convergenceto a high energy. As in the case of calcium signaling,PT with 6 chains outperforms PT with 4 chains, whichin turn outperforms the MH algorithm in ﬁnding theglobal minimum (Figure 5B). Interestingly, the data that wegenerated did not sufﬁciently constrain the frequency ofoscillations exhibited by the model, and we ﬁnd parametersets corresponding to different frequencies that all ﬁt thedata. Figure 5C shows the posterior distributions of the 11model parameters corresponding to the ﬁt shown in themiddle panel of Figure 5A, obtained using PT with 6 chains.The ﬁrst parameter shows 3 clear peaks, one of which iscentered at the true value. Separating the parameter setscorresponding to these peaks shows that they correspondto speciﬁc differences in oscillation frequencies that are allpart of the ﬁtted ensemble (Figure 5D), reinforcing the needto use Bayesian methods with such problems.

F. Growth factor signaling model

Finally we apply MH and PT to a substantially largermodel that has 24 parameters — a rule-based model ofShp2 regulation in growth factor signaling [3] that generates149 species and 1032 reactions. We generated syntheticdata for the micromolar concentration of phosphorylatedreceptors (pYR), an observable that combines the timecourses of 136 model species, and constructed a likelihoodfunction assuming 2% Gaussian error. The parameters weresampled in log-space with a uniform prior on the interval[ − ISCUSSION

In this study we have shown that even with relativelysimple biochemical models, there are signiﬁcant beneﬁtsto using PT over MH in terms of convergence speedand sampling efﬁciency. With more complex models wefound that given a ﬁxed budget of MCMC steps MH oftenfails to ﬁnd the global minimum, whereas PT consistently6 & B&C&D&

Fig. 5. Parameter estimation for the negative feedback oscillator. (A)Examples of convergence to different minima. Error bars show syntheticdata. Solid lines show the simulated ﬁts. The three colors correspondto the three different model species. The right column shows energychains corresponding to the ﬁts shown on the left. (B) Distributions ofthe minimum energy by MH, PT-4 and PT-6 over 15 repeats. (C) Posteriordistributions corresponding to the middle panel in (A). (D) Simulated ﬁtscorresponding to each of the three peaks in the posterior distribution ofthe ﬁrst parameter shown in (C). succeeds. We also showed an example in which Bayesianparameter estimation can effectively perform model re-duction through the introduction of a regularizing prior.While ABC methods constitute a popular class of alternativemethods for Bayesian parameter estimation in cases wherelikelihood models are expensive or not available (such asfor stochastic models), we found that PT outperformedABC for parameter sampling on a relatively simple ODEmodel. Our direct performance comparison supports theprevious observation [22] that likelihood-based methodsare preferable to likelihood-free methods when likelihoodmodels are feasible to compute, such as with ODE models.One limitation of our evaluation procedure is that we

B&A&

Fig. 6. Parameter estimation for the growth factor signaling model. (A)Examples of convergence to different minima. Error bars show syntheticdata. Solid lines show the simulated ﬁts. The right column shows energychains corresponding to the ﬁts shown on the left. (B) Distributions ofthe minimum energy obtained via by MH and PT with 4 chains over 25repeats. have not attempted to compare wall clock times for thedifferent algorithms. Instead, as a performance metric wehave used the number of MCMC steps or the number ofmodel integrations required, which are independent of theimplementation. In practice, the ﬁts reported in Sec. IIIcan all be performed on a typical workstation computerin times ranging from a few minutes for the smallestmodel (Michaelis-Menten) to a few hours for the largestmodel (growth factor signaling). However, we found thatdespite requiring the same number of model integrationsper processor per step, the single-chain MH sampling ransigniﬁcantly faster per step (20–30%) in terms of wall clocktime than PT with four or six chains. Preliminary testsshowed that these differences likely arise from the require-ment in our current implementation for each chain to com-plete a ﬁxed number of steps before a swap is attempted.Parallel efﬁciency decreases when trajectories on differentprocessors take different amounts of time to complete. Wefound that the difference in wall clock time decreased whenthe PT chains are all run at the same temperature, but so dothe algorithmic beneﬁts. When chains are run at differenttemperatures, the high temperature chains tend to sampleparameter space more broadly, which results in greatervariability in the model integration time [5] and causes slowdown due to the synchronization requirement. We plan toinvestigate asynchronous swapping between chains in orderto alleviate this problem.Another limitation of the current work is that the com-parisons were made using speciﬁc choices for the hyper-parameters that control the PT algorithm, such as thosethat control step sizes for the moves and temperatures.Adjustment of these may result in further improvementsto sampling efﬁciency and convergence rates. We wouldalso like to investigate the effect of using different proposalfunctions, such as Hessian-guided MCMC [9], as well asdifferent likelihood models.While we have restricted our MCMC comparisons to MH,there has been considerable work toward improving the7fﬁciency of MCMC methods, such as Differential EvolutionAdaptive Metropolis (DREAM) [28] and Delayed RejectionAdaptive Metropolis (DRAM) [14]. It would be interestingto investigate whether parallel tempering could be fruitfullycombined with these approaches.Finally, PT, as it has been presented and used to thispoint both here and in the molecular simulation literature,is only a moderately parallel algorithm because it uses justa handful of chains. It remains to be seen whether using amuch larger number of chains would retain the advantagesof sampling simultaneously at multiple temperatures andresult in further acceleration.A

CKNOWLEDGMENT

The authors would like to thank Cihan Kaya for technicalassistance and helpful discussions. This work was fundedby NIH grant R35-GM119462 to RECL, and by JRF viathe NIGMS-funded (P41-GM103712) National Center forMultiscale Modeling of Biological Systems (MMBioS).R

EFERENCES[1] T. S. Ahearn, R. T. Staff, T. W. Redpath, and S. I. K. Semple. The use ofthe Levenberg–Marquardt curve-ﬁtting algorithm in pharmacokineticmodelling of DCE-MRI data.

Physics in Medicine and Biology , 50:N85–N92, 2005.[2] M. Ashyraliyev, Y. Fomekong-Nanfack, J. A. Kaandorp, and J. G. Blom.Systems biology: Parameter estimation for biochemical models.

FEBSJournal , 276:886–902, 2009.[3] D. Barua, J. R. Faeder, and J. M. Haugh. Structure-based kinetic mod-els of modular signaling protein function: focus on Shp2.

Biophysicaljournal , 92:2290–300, 2007.[4] N. Bhatnagar, A. Bogdanov, and E. Mossel. The ComputationalComplexity of Estimating MCMC COnvergence Time.

Approximation,Randomization, and Combinatorial Optimization. Algorithms andTechniques. Lecture Notes in Computer Science, vol 6845. Springer,Berlin, Heidelberg , 6845, 2011.[5] K. S. Brown and J. P. Sethna. Statistical mechanical approaches tomodels with many poorly known parameters.

Phys. Rev. E , 68:021904,Aug 2003.[6] S. Chib and E. Greenberg. Understanding the Metropolis-Hastingsalgorithm.

The American Statistician , 49(4):327–335, 1995.[7] L. A. Chylek, V. Akimov, J. Dengjel, K. T. G. Rigbolt, B. Hu, W. S.Hlavacek, and B. Blagoev. Phosphorylation site dynamics of earlyT-cell receptor signaling.

PLoS ONE , 9(8):e104240, 2014.[8] D. J. Earl and M. W. Deem. Parallel tempering: Theory, applica-tions, and new perspectives.

Physical Chemistry Chemical Physics ,7(23):3910, 2005.[9] H. Eydgahi, W. W. Chen, J. L. Muhlich, D. Vitkup, J. N. Tsitsiklis, andP. K. Sorger. Properties of cell death models calibrated and comparedusing Bayesian approaches.

Molecular Systems Biology , 9(1):644–644,2014.[10] D. Fernández Slezak, C. Suárez, G. A. Cecchi, G. Marshall, andG. Stolovitzky. When the optimal is not the best: Parameter estimationin complex biological models.

PLoS ONE , 5(10), 2010.[11] F. Gao and L. Han. Implementing the Nelder-Mead simplex algo-rithm with adaptive parameters.

Computational Optimization andApplications , 51(1), 2012.[12] B. Goldstein, J. R. Faeder, and W. S. Hlavacek. Mathematical andcomputational models of immune-receptor signalling.

Nature ReviewsImmunology , 4:445–456, 2004.[13] R. N. Gutenkunst, J. J. Waterfall, F. P. Casey, K. S. Brown, C. R. Myers,and J. P. Sethna. Universally Sloppy Parameter Sensitivities in Systems

Biology Models.

PLOS Computational Biology , 3(10):1871–1878, 2007.[14] H. Haario, M. Laine, A. Mira, and E. Saksman. DRAM: Efﬁcientadaptive MCMC.

Statistics and Computing , 16:339–354, 2006.[15] U. H. Hansmann. Parallel tempering algorithm for conformationalstudies of biological molecules.

Chemical Physics Letters , 281:140–150, 1997. [16] L. A. Harris, J. S. Hogg, J. J. Tapia, J. A. Sekar, S. Gupta, I. Korsunsky,A. Arora, D. Barua, R. P. Sheehan, and J. R. Faeder. BioNetGen 2.2:Advances in rule-based modeling.

Bioinformatics , 32(21):3366–3368,2016.[17] A. C. Hindmarsh, P. N. Brown, K. E. Grant, S. L. Lee, R. Serban,D. E. Shumaker, and C. S. Woodward. Sundials: Suite of nonlinearand differential/algebraic equation solvers.

ACM Transactions onMathematical Software , 31(3):363–396, 2005.[18] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano,A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, et al. Thesystems biology markup language (SBML): A medium for represen-tation and exchange of biochemical network models.

Bioinformatics ,19(4):524–531, 2003.[19] S. Iadevaia, Y. Lu, F. C. Morales, G. B. Mills, and P. T. Ram. Iden-tiﬁcation of optimal drug combinations targeting cellular networks:Integrating phospho-proteomics and computational network analy-sis.

Cancer Research , 70(17):6704–6714, 2010.[20] U. Kummer, L. F. Olsen, C. J. Dixon, A. K. Green, E. Bornberg-Bauer,and G. Baier. Switching from simple to complex oscillations incalcium signaling.

Biophysical Journal , 79(3):1188–1195, 2000.[21] R. E. C. Lee, S. R. Walker, K. Savery, D. A. Frank, and S. Gaudet. FoldChange of Nuclear NF- k B Determines TNF-Induced Transcriptionin Single Cells.

Molecular Cell , 53(6):867–879, 2014.[22] J. Liepe, P. Kirk, S. Filippi, T. Toni, C. P. Barnes, and M. P. H. Stumpf.A framework for parameter estimation and model selection from experimental data in systems biology using approximate Bayesiancomputation.

Nature Protocols , 9(2):439–456, 2014.[23] C. F. Lopez, J. L. Muhlich, J. A. Bachman, and P. K. Sorger. Program-ming biological models in Python using PySB.

Molecular SystemsBiology , 9(1):1–19, 2013.[24] S. Lukens, J. DePasse, R. Rosenfeld, E. Ghedin, E. Mochan, S. T. Brown,J. Grefenstette, D. S. Burke, D. Swigon, and G. Clermont. A large-scaleimmuno-epidemiological simulation of inﬂuenza A epidemics.

BMCPublic Health , 14(1):1019, 2014.[25] A. D. Malkin, R. P. Sheehan, S. Mathew, W. J. Federspiel, H. Redl,and G. Clermont. A Neutrophil Phenotype Model for ExtracorporealTreatment of Sepsis.

PLoS Computational Biology , 11(10):1–30, 2015.[26] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, andE. Teller. Equation of State Calculations by Fast Computing Machines.

The Journal of Chemical Physics , 21(6):1087–1092, 1953.[27] T. Park and G. Casella. The Bayesian Lasso.

Journal of the AmericanStatistical Association , 103(482):681–686, 2008.[28] E. M. Shockley, J. A. Vrugt, and C. F. Lopez. PyDREAM: High-dimensional parameter inference for biological models in Python.

Bioinformatics , doi:10.1093/bioinformatics/btx626, 2017.[29] Y. Sugita and Y. Okamoto. Replica-exchange molecular dynamicsmethod for protein folding.

Chemical Physics Letters , 314:141–151,1999.[30] B. R. Thomas, L. A. Chylek, J. Colvin, S. Sirimulla, A. H. Clayton,W. S. Hlavacek, and R. G. Posner. BioNetFit: a ﬁtting tool compatiblewith BioNetGen, NFsim and distributed computing environments.

Bioinformatics , 32(5):798–800, 2015.[31] R. Tibshirani. Regression Shrinkage and Selection via the Lasso.

Journal of the Royal Statistical Society , 58(1):267–288, 1996.[32] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. H. Stumpf.Approximate Bayesian computation scheme for parameter inferenceand model selection in dynamical systems.

Journal of the RoyalSociety Interface , 6:187–202, 2009.[33] J. J. Tyson, K. C. Chen, and B. Novak. Sniffers , buzzers , toggles andblinkers : dynamics of regulatory and signaling pathways in the cell.

Current Opinion in Cell Biology , 15(2):221–231, 2003.[34] L. Wasserman.

All of Statistics. A Concise Course in StatisticalInference . 2005.[35] J. Yao, A. Pilko, and R. Wollman. Distinct cellular states determinecalcium signaling response.

Molecular systems biology , 12(894):1–12,2016.[36] F. Zhang, B. R. Angermann, and M. Meier-Schellersheim. TheSimmune Modeler visual interface for creating signaling networks based on bi-molecular interactions.

Bioinformatics , 29(9):1229–30,2013.[37] H. Zou and T. Hastie. Regularization and variable selection via theelastic net.

Journal of the Royal Statistical Society , 67:301–320, 2005., 67:301–320, 2005.