[PDF] Approximating nonequilibrium processes using a collection of surrogate diffusion models

Abstract

The surrogate process approximation (SPA) is applied to model the nonequilibrium dynamics of a reaction coordinate (RC) associated with the unfolding and refolding processes of a deca-alanine peptide at 300 K. The RC dynamics, which correspond to the evolution of the end-to-end distance of the polypeptide, are produced by steered molecular dynamics (SMD) simulations and approximated using overdamped diffusion models. We show that the collection of (estimated) SPA models contain structural information "orthogonal" to the RC monitored in this study. Functional data analysis ideas are used to correlate functions associated with the fitted SPA models with the work done on the system in SMD simulations. It is demonstrated that the shape of the nonequilibrium work distributions for the unfolding and refolding processes of deca-alanine can be predicted with functional data analysis ideas using a relatively small number of simulated SMD paths for calibrating the SPA diffusion models.

Full PDF

aa r X i v : . [ c ond - m a t . s t a t - m ec h ] J a n Approximating nonequilibrium processes using a collection of surrogate diﬀusionmodels

Christopher P. Calderon

Department of Statistics and Department of Computational and Applied Mathematics,Rice University, Houston, TX 77005-1892, USA.

Riccardo Chelli

Dipartimento di Chimica, Universit`a di Firenze, Via della Lastruccia 3,I-50019 Sesto Fiorentino, Italy, and European Laboratory for Non-linear Spectroscopy (LENS),Via Nello Carrara 1, I-50019 Sesto Fiorentino, Italy. (Dated: October 25, 2018)The surrogate process approximation (SPA) is applied to model the nonequilibrium dynamics ofa reaction coordinate (RC) associated with the unfolding and refolding processes of a deca-alaninepeptide at 300 K. The RC dynamics, which correspond to the evolution of the end-to-end distance ofthe polypeptide, are produced by steered molecular dynamics (SMD) simulations and approximatedusing overdamped diﬀusion models. We show that the collection of (estimated) SPA models containstructural information “orthogonal” to the RC monitored in this study. Functional data analysisideas are used to correlate functions associated with the ﬁtted SPA models with the work done onthe system in SMD simulations. It is demonstrated that the shape of the nonequilibrium work dis-tributions for the unfolding and refolding processes of deca-alanine can be predicted with functionaldata analysis ideas using a relatively small number of simulated SMD paths for calibrating the SPAdiﬀusion models.

I. INTRODUCTION

Single-molecule experiments [1] oﬀer the possibility of exploring the dynamical responses in systems without havingto resort to ensemble measurements [2, 3, 4, 5, 6]. In recent years these types of investigations have received increas-ing attention, due partially to developments in nonequilibrium statistical mechanics [7, 8]. In the context of proteinfolding, vastly diﬀerent dynamical trajectories can be observed when monitoring the nonequilibrium time evolution ofa reaction coordinate (RC) obtained from independent single-molecule measurements carried out at identical experi-mental conditions. The diﬀerences in dynamical response cannot usually be attributed only to classical thermal noise,but also to the diﬀerent atomistic “collective” coordinates [9] (e.g., radius of gyration) consistent with the conditionsof the system at the initial time of the various nonequilibrium experiments. In many cases, these types of collectivevariables strongly inﬂuence the nature of observed dynamical response. This type of phenomenon has been quantiﬁedby modeling with “multiple-states” [10, 11, 12] or using the so-called “dynamic disorder” description [13]. Often, bothdescriptions can reasonably be thought of as artifacts resulting from an imperfect RC [12, 14].One concrete example that exhibits the concepts expressed above is the extensively studied [5, 6] process of unfold-ing and refolding of deca-alanine (a deca-peptide consisting of alanine residues). In Fig. 1 we report some snapshots ofdeca-alanine taken from two forced refolding realizations along the end-to-end distance (the RC) produced by steeredmolecular dynamics (SMD) simulations [6]. By comparing the time evolution of the RC in the two computationalexperiments (see panels in the ﬁgure), one would be inclined to classify the two nonequilibrium trajectories as (statis-tically) identical, or at least very similar. However, when the snapshots of the entire molecule are observed, it is easyto distinguish between the dynamical responses (“proper folding” and “misfolding”). Another instance of this typeof phenomenon is documented in Ref. 15. This situation is relevant to a variety of diﬀerent systems because, whenall-atom descriptions are used, it is usually diﬃcult to select RCs which accurately reﬂect all of the relevant detailsof the system [5, 16, 17]. Even if good RCs are known and easily manipulatable, a priori quantitative knowledge ofhow the system responds to external stimuli that vary these RCs is usually lacking.This paper aims at providing this type of information by applying pathwise time series techniques [18, 19] to theoutput of nonequilibrium single-molecule computer simulations. The numerical procedures presented here also haveapplicability to wet-lab times series data [2, 3, 20]. Nonequilibrium simulations are attractive because they sometimesallow a researcher to obtain detailed equilibrium information about portions of phase space that, if the system wereunforced, would not be explored in the typical simulation timescale [5, 16, 17, 21]. The system studied throughoutis a SMD simulation of deca-alanine in vacuo maintained at constant temperature by a Nos´e-Hoover thermostat [22].We systematically manipulate the end-to-end distance RC of the deca-alanine peptide, referring to it as the “steered”RC [5].The purpose of this study is two-fold. The ﬁrst goal is to present a computational approach which attempts tolearn about conformational dynamics not directly contained in a single trajectory of the steered RC (but is indirectlycontained in a collection of trajectories). Several conﬁgurational degrees of freedom, such as dihedral angles, may beindeed important in determining the dynamics of the studied system. These degrees of freedom can be thought ofas “other” RCs and are not directly included in the dynamical models calibrated from the observed data. However,these RCs can be relatively slowly evolving (compared to the timescale of the simulation) and may therefore eﬀectivelymodulate the dynamics of the steered RC. Because of these other RCs, independent trajectories may be characterizedby signiﬁcantly diﬀerent dynamics yielding a collection of diﬀusion models [43] . In this study, ideas from functionaldata analysis (FDA) [23] are used in our models to include this important source of “dynamical heterogeneity”.In particular we develop a procedure, that we call “FDA bootstrapping” (which is a speciﬁc instance of “type IIbootstrapping” discussed in Ref. 19). This bootstrapping is able to correlate the work done on the system during anunfolding or refolding realization to appropriate functions associated with a surrogate process approximation (SPA)model [19]. This type of treatment is attractive because it does not require the identiﬁcation or manipulation of theslowly evolving collective coordinates (this is particularly relevant to wet-lab studies).The second aim of the paper is to show how the proposed approach (coupled with ideas from Ref. 19) can be appliedto recover the non-Gaussian work distributions associated with nonequilibrium SMD folding/unfolding of deca-alanine.This may possibly assist experiments or simulations where path samples are so expensive or time consuming thatobtaining an adequate statistical sample is diﬃcult. The statistical validity of approximating deterministic (chaotic)dynamics using diﬀusion approximations has also been addressed here given its relevancy to a variety of atomisticsystems [22, 24, 25]. However, in order to keep the presentation streamlined, we have relegated these results to theSupplementary Material [44].The outline of the article follows. In Section II we review the basic concepts underlying the SPA model and theFDA methodology and provide the computational details regarding the SMD simulations of deca-alanine. In SectionIII we report the results regarding the determination of the work distributions using diﬀusion models. In this sectionwe also report data and discuss issues regarding the connection between the dynamical information from SPA modelsand other possible RCs “orthogonal” [45] to the steered one. A summary and some potential future directions ofresearch are ﬁnally discussed in Section IV.

II. THEORY AND METHODSA. SMD simulations of deca-alanine

As stated in the Introduction, the dynamical system considered in the present study is the process of unfolding(and refolding) of one deca-alanine peptide at ﬁnite temperature. As in a previous work [6], SMD simulations wereused as a device for producing the forced nonequilibrium dynamics of the steered RC, Z , which is taken to be theend-to-end distance of deca-alanine. More speciﬁcally, Z corresponds to the distance between the N atom of theN-terminus amino-acid (constrained to a ﬁxed position) and the N atom of the C-terminus amino-acid (constrainedto move along a ﬁxed direction). The values of Z in the completely folded and unfolded states of deca-alanine areassumed to be 15.5 and 31.5 ˚A , respectively. In the former state, deca-alanine is found in a α -helix form, while anelongated “random coil” [5] conﬁguration characterizes the latter state. We have arbitrarily assumed the unfoldingand refolding processes as forward ( F ) and reverse ( R ), respectively.It should be noted that in general the end-to-end distance does not determine uniquely the conﬁgurational state ofpolypeptides. However, in the speciﬁc case of deca-alanine, the equilibrium distribution at Z = 15 . α -helix form, as for this end-to-end distance alternativestructures are highly unlikely under the associated stationary distribution [46]. The same holds true for the statecorresponding to Z = 31 . α -helix (and elongated random coil) to be used as initial microstates of the SMD simulations canreadily be obtained by sampling from the stationary distribution associated with a constrained RC molecular dynamicssimulation. However, as discussed in the Introduction, subtle conﬁgurational diﬀerences in these initial microstatesmay yield vastly diﬀerent dynamical behaviors when the system is driven out of equilibrium. The initial microstatesof deca-alanine for the F and R realizations were randomly drawn from two “equilibrated” biased molecular dynamicssimulations constraining Z to 15.5 and 31.5 ˚A , respectively, by means of a stiﬀ harmonic potential (force constant k pull equal to 800 kcal mol − ˚A − ). In both equilibrium simulations and in the subsequent SMD simulations, constanttemperature (300 K) has been enforced using a Nos´e-Hoover thermostat[22]. Force ﬁeld parameters have been takenfrom Ref. 26. For each type of process, F and R , we have generated 10 realizations guiding Z from 15.5 to 31.5 ˚A ( F paths) or from 31.5 to 15.5 ˚A ( R paths) using a time-dependent harmonic potential with the force constant reportedabove. The parameter driving the reaction coordinate, that we denote with λ ( t ), evolves in time with an externallydetermined constant rate of 0.032 ˚A ps − (the same rate is used for both F and R processes). The harmonic guidingpotential V ( t ) is the following: V ( t ) = 0 . k pull [ Z ( t ) − λ ( t )] . We stress that λ ( t ) is a deterministic function commonto all SMD simulations and Z ( t ) represents a stochastic process. Moreover, we note that, in order to generate thelarge amount of temporal data needed to reliably calibrate diﬀusion models and still have the time spacing betweenobservations large relative to the faster timescales [27, 28, 29, 30, 31], we have performed longer SMD simulationsthan those reported in Ref. 6.The ORAC program was used in all simulations[32]. It uses a multiple timestep technique, r-RESPA, to integratethe equations of motion. In particular forces between atoms whose distance is greater than 12 ˚A have been updatedevery 5 fs; forces between atoms whose distance falls in the range 6-12 ˚A have been updated every ∼ ∼ ∼ Z ( t ), have been stored every 50 fs. B. Surrogate Process Approximation and Bootstrapping

If we assume that only a relatively small number of paths or realizations are available from an experi-ment/calculation, it is natural to try to use the rich amount of dynamical data contained in the paths in orderto calibrate a surrogate model and then turn to this model to augment the limited data set for improving the qualityof the sample. This is the basic idea behind the SPA method used in the present work and introduced in Ref. 19.We attempt to ﬁt time series data coming from constant velocity SMD simulations of deca-alanine to time dependentoverdamped diﬀusion models using locally parametric maximum likelihood methods. The underlying complex systemused to generate the time series is a system of deterministic chaotic ODEs, as opposed to a Langevin type processwhich was originally used in Ref. 19.Speciﬁcally, a single RC trajectory obtained from a SMD simulation is used to calibrate a collection of local diﬀusionmodels, each model being characterized by a diﬀusion and a drift coeﬃcient [19]. For the transition density expansions,we utilize the Simulated Maximum Likelihood [33] estimator. Local overdamped diﬀusion models are estimated atthe “center points” { , , . . . , } ˚A of the RC (for a total of 18 local models) using a neighborhood radius of 0.6˚A for each local model [19]. The diﬀusion and the drift coeﬃcients of the local diﬀusion models are then stitchedtogether to form two continuos functions, i.e., the diﬀusion and the drift functions representing the global diﬀusionmodel of the RC trajectory. These functions are obtained using a piecewise polynomial smoothing routine availablefrom Matlab’s spline toolbox (csaps, with smoothing parameter equal to 0.4). In Fig. 2 we show the drift anddiﬀusion functions calibrated using 15 forward and reverse SMD trajectories of deca-alanine. It should be noted thatthe diﬀusion model ﬁt to the data is identical to that used in Ref. 19. Section II of the aforementioned referenceprovides a more comprehensive discussion on the diﬀusion models.The resulting global diﬀusion model can then be used to generate a collection of new paths taking the same initialcondition as the SMD trajectories, but using a batch of random number sequences. The diﬀerent random numbersequences used to create Brownian paths are intended to incorporate variability that can be attributed to physicallyunimportant fast degrees of freedom, such as vibrational degrees of freedom associated with covalent bonds, presentin the full atomistic system. The term “type I bootstrapping” is used here to describe the generation of new pathsfrom the procedure described above.There is interest in understanding the consequences of approximating highly structured chaotic “noise” with a fairlycrude diﬀusion model [29, 30, 31] when the data are sampled frequently in time (see the Supplementary Material for anextensive discussion along these lines). In principle, any stochastic process could be ﬁt to the time series data and serveas the surrogate, e.g., underdamped diﬀusions [28], non-Markovian processes [27, 34], etc. However, the motivationfor focusing on estimation of overdamped models stems from two facts. (1) In macromolecules large collective motionstypically [9] cause an eﬀectively overdamped system even when the traditional hydrodynamic “damping” (or viscosity)is very low [35]. (2) Typically, in wet-lab experiments, only simple position-like coordinates are available, which forcethe researcher into a framework where overly simpliﬁed dynamical models are used because the information to calibratemore sophisticated models is simply not accessible (or perhaps deemed not worth pursuing) [47].Characterizing the errors introduced by ignoring degrees of freedom in a low-dimensional RC calculated fromobservations of a detailed atomistic system is a diﬃcult and open problem [27, 29, 34]. The multiscale nature ofthe signals that come from atomistic systems makes the legitimacy of using a single low dimensional diﬀusion modelquestionable. The approach described in the next section aims at remedying this problem by enhancing the samplingof models that a batch of fully atomistic paths sometimes yields. A physics-based description of the presence of acollection of models would assume that the eﬀective free energy [30] is a function of two or more RCs, but we onlyobserve and model the dynamics of one. Unlike the RC explicitly considered in the model, the “other” RCs are notcontrolled either at the beginning of the nonequilibrium experiment or during it. This implies that each RC path maygive rise to a dynamical model which depends heavily on unobserved RCs. Therefore, many RC paths may result ina collection of models, whose “heterogeneous dynamical response” can be attributed to the unconstrained other RCsin the nonequilibrium experiment. In this paper, we present a technique for using the information contained in thisdistribution of models to make reﬁned predictions of the work distribution from a small data set. We also demonstratethat this distribution of models can provide indirect information about other important RCs. This is made possibleby FDA ideas. C. Using FDA Ideas to Bootstrap Work Paths

The basic goal of FDA [23, 36] is to characterize the distribution of a family of curves. Instead of statisticallyanalyzing ﬁnite dimensional random vectors (e.g., Euclidean vectors), one focuses instead on the distribution ofobjects living in an inﬁnite dimensional space. In the present case we deal with the space of continuous functions thatdescribe the dynamics of the various SMD trajectories. One may associate three functions to each SMD trajectory:the work function corresponding to the time sequence of the external work performed on the system during thesimulation and the two SPA coeﬃcient functions (see Fig. 2). Here we assume that the observed work functions areindependent and identically distributed (iid).Unfortunately one usually does not have the luxury of continuous time observation in many applications [18, 23].However, when the underlying objects are believed to be relatively smooth functions (as it seems to be the case ofthe deca-alanine system) it makes sense to discretely sample the external work at a moderate sampling frequency.Then one can attempt dimension reduction by using principal components analysis (PCA) tools [23]. To determinethe correlation between the SPA model functions and the observed work functions, one could create a raw observationmatrix of the form:  W (1) ( τ ) W (2) ( τ ) · · · W ( N ) ( τ ) W (1) ( τ ) W (2) ( τ ) · · · W ( N ) ( τ )... W (1) ( τ D ) W (2) ( τ D ) · · · W ( N ) ( τ D )˜ A (1) ( Z ( τ )) ˜ A (2) ( Z ( τ )) · · · ˜ A ( N ) ( Z ( τ ))˜ A (1) ( Z ( τ )) ˜ A (2) ( Z ( τ )) · · · ˜ A ( N ) ( Z ( τ ))...˜ A (1) ( Z ( τ D )) ˜ A (2) ( Z ( τ D )) · · · ˜ A ( N ) ( Z ( τ D ))˜ C (1) ( Z ( τ )) ˜ C (2) ( Z ( τ )) · · · ˜ C ( N ) ( Z ( τ ))˜ C (1) ( Z ( τ )) ˜ C (2) ( Z ( τ )) · · · ˜ C ( N ) ( Z ( τ ))...˜ C (1) ( Z ( τ D )) ˜ C (2) ( Z ( τ D )) · · · ˜ C ( N ) ( Z ( τ D ))  ≡  W ˜A˜C  , where W ( i ) ( τ j ), ˜ A ( i ) ( Z ( τ j )) and ˜ C ( i ) ( Z ( τ j )) are the work, drift and diﬀusion functions at time τ j correspondingto the i th SMD trajectory. The N matrix columns correspond to the function “samples” observed discretely at thetimes τ , τ , . . . , τ D [48]. After projecting these functions onto a basis, the weights are then “stacked” upon eachother (this simply means that the weights of the diﬀerent projections are concatenated) to form a new matrix withblocks W ′ , ˜A ′ and ˜C ′ [49]. The speciﬁc weight that should be assigned to the diﬀerent blocks is not easy question toanswer because introducing an appropriate metric (or semimetric) is diﬃcult [36] without a priori information on thesmoothness of the curves. In this work we use a fairly simple weighting procedure made of the following steps: (1)within each block, the average population (across the N samples) of the coeﬃcients (a vector) was subtracted fromthe individual coeﬃcient vectors; (2) within each block, the coeﬃcients resulting from step 1 were “normalized” bythe corresponding block standard deviation [50]. The intention is to give each block the same relative importance ina PCA type decomposition. After the PCA was carried out on the data matrix, each stacked observation (of basisweights) was then projected onto the constructed principal components basis (three PCA modes were retained in allcases reported) and the PCA weights of each projection were then recorded. In order to get the distribution of theweights two methods were used:1. Under the ad hoc assumption that the PCA weights are statistically independent [51] (by construction of thePCA decomposition, the weights are linearly uncorrelated), the empirical distributions (across the functionalcurve population) were recorded for each PCA projection and new samples were created by resampling the PCAweight distributions. This new set of weights augments the original small set of weights (the basic idea beingto “interpolate” between function curves by resampling from the empirical cumulative distribution functions).2. In an eﬀort to computationally remove any dependence between the weights, and still under the assumptiongiven in Footnote 49, we employed tools such as independent components analysis (ICA) [37, 38] or projection-pursuit methods to transform the weights into roughly “decoupled” coordinates (we used the FastICA package[38]). We then empirically measured the cumulative distribution function of the weights and used the resultingdistributions to generate a new set of weights. The resulting weights were ﬁnally transformed back to the originalcoordinates and the resulting work paths were recorded [52].The type of methods described above are referred to as “FDA bootstrapping”. The basic steps of the overall methodusing one of the above methods are summarized below: • Let X ≡ [ W ′ , ˜A ′ , ˜C ′ ] T and denote the corresponding linear basis by { u , u , . . . } . Each column of X is thenprojected onto each column of the linear basis to obtain an ensemble of weights ≡ { w , w , . . . } . • The empirical cumulative distribution function associated with each (raw or processed) row of the w i vectorsis determined (i.e. across the N samples, a cumulative distribution function of each component of the weightvector is measured). • Using the above cumulative distribution function, a large number of “bootstrapped” weights are then resampled. • Synthetic work paths and diﬀusion models are ﬁnally produced using “FDA bootstrapped models”. The cost ofthis operation is usually marginal in relation to the overall data production procedure.

III. RESULTSA. Failure of Direct Work Process Approximation

SPA models were calibrated by observing 15 randomly selected SMD paths. The time dependent distribution of Z was accurately modeled with our stochastic dynamical models (see Figs. 1-2 in the Supplementary Material). Theaccuracy in the Z distribution approximation gives us partial hope for using the estimated SPA models to generate acollection of surrogate processes from which work can directly be measured by type I bootstrapping (as was done inRef. 19). 10 such paths were generated and the work done on the system (pulling in the R direction) was recordedas a function of time. The “pulling” parameters used in these type I bootstrapping SPA simulations were identical tothose used in the corresponding SMD paths. For comparison, the work done on the system as a function of time fromthe 1000 SMD paths has also been computed. Both SPA and SMD average work proﬁles, along with their standarddeviations, are shown in Fig. 3. The standard deviation of the SPA work paths is far too large compared to the SMDprocess it aims at approximating. This can be very problematic if one wants to use the approximate work distributionto compute free energy diﬀerences using a nonequilibrium method like the Jarzynski equality[7].A few representative work paths are also plotted for both the SPA and SMD cases. When individual paths arecompared, one clearly observes that the SPA work paths are too “rough” in relation to the relatively smooth SMDwork paths. This smoothness is likely caused by structured fast degrees of freedom not explicitly included in oursurrogate models. Accurately approximating this structure would be a formidable time series modeling problem andan interesting future research direction. However here we turn instead to the tools of FDA to model the work process.The estimated SPA models are still useful (despite the fact that they cannot directly simulate useful work paths)because we demonstrate that the information in the collection of SPA models indirectly summarizes some features ofthe complex SMD process. B. Applying FDA Bootstrapping to Approximate Work Distributions

We recall that one of the motivations of this study is to use the information contained in the estimated SPAmodels for approximating the shape of the work distributions associated with SMD simulations. In several earlycomputational studies[5, 21, 39], Gaussian work distributions were observed in pulling nonequilibrium processes. Mostof the arguments in the literature that attempt to justify this fact are either heuristic [39] or make the assumptionthat a single overdamped diﬀusion can be used to model the time series coming from nonequilibrium SMD simulationdata [5]. However it has been also shown that, even seemingly Gaussian work distributions, may not be Gaussian.This is just the case of the work distribution for the unfolding process of deca-alanine whose shape, though verysimilar to a Gaussian proﬁle, is inherently asymmetric[6]. This brings up a subtle point we want to stress. In thissystem a single overdamped model (calibrated at this timescale) is not adequate to describe the dynamics of eachnonequilibrium response. This is likely caused by slowly evolving degrees of freedom outside of the selected reactioncoordinate that have not been suﬃciently “averaged out” [28] [53]. In our system, this causes one to estimate multiple overdamped diﬀusion models. The approximation we introduced takes this model variability into account and shouldbe viewed as a new type of stochastic process that is diﬀerent than a single “simple” overdamped diﬀusion. Thisnew stochastic process violates the conditions that guarantee a Gaussian work distribution stated in Ref. 5. Becauseguaranteeing (or enforcing) a Gaussian work distribution can be diﬃcult with certain reaction coordinates, it wouldbe useful to have a tool which can be used to check the validity of the Gaussian work assumption without requiringa large number of nonequilibrium realizations. In the deca-alanine system, the FDA motivated techniques presentedhere are shown to be useful in this context (see below).The work distributions recovered from the 1000 SMD simulations and from the various FDA bootstrapping ap-proaches (see Section II C) for the R realizations are shown in Fig. 4. 15 SMD trajectories, taken randomly from theoverall ensemble of SMD trajectories, were used to calibrate 15 SPA models used for the various FDA bootstrappingapproaches. The work values corresponding to these SMD trajectories are also reported in Fig. 4. The FDA boot-strapping types are classiﬁed as follows: (1a) the FDA bootstrapping used diﬀusion and drift coeﬃcient functions fromthe SPA models along with the corresponding work function from SMD trajectories ( WAC bootstrapping); (1b) sameas previous, but ICA decoupling has been used on weights (

WAC

ICA bootstrapping); (2a) only the work functionfrom the SMD trajectories has been used in the FDA bootstrapping ( W bootstrapping); (2b) same as previous, butICA decoupling has been used on weights ( W ICA bootstrapping). Three PCA modes were used in all cases reported.The number of surrogate paths extracted from each FDA bootstrapping type is 10 . In addition the same randomnumber stream was used for each case in order to minimize sampling errors and attribute diﬀerences in the resultssolely to the systematic diﬀerences in the computational methods.We note that a naive bootstrapping ( W ) falsely predicts multiple modes, whereas the other bootstrapping schemesfairly accurately capture the shape of the SMD work distribution. This is indeed quite surprising given that only 15SMD paths underly the SPA models. The fact that a structured tube of functions is observed in the SPA models(sse Fig. 2), and that using this information in the FDA bootstrapping improves the reproduction of the workdistribution, suggests that the work and the diﬀusion dynamics of the RC (accounted for by the SPA models) areindeed signiﬁcantly correlated.Increasing the number of SMD paths underlying the SPA models improves the estimation of the low order moments(see Table I), but the overall shape of the predicted work distributions does not change substantially. We point outthat W ICA and

WAC

ICA methods seem to work better than W and WAC methods because there are multipleclusters (discussed in detail in Section III C) of observed dynamical responses ( i.e., SPA models) that a linear methodlike PCA cannot directly account for [37, 38]. The SPA model information aids in detecting these clusters , thusallowing for an appropriate assignment of the weights to the work paths. Though the weighting induced by the SPAfunctions resulting in an improved work distribution estimation is just a fortuitous coincidence, in the future, wehope to develop schemes which systematically correlate these SPA “function clusters” with slowly evolving collectivedegrees of freedom and assign appropriate weights to the paths using empirically determined correlations.In the

WAC and W bootstrapping methods, small batches of 15 SMD paths were drawn (without replacement) andused to calibrate 15 SPA models and then these models are used to approximate the nonequilibrium work distribution.When this procedure is repeated (with new batches of iid SMD sample paths) the mode variability in the resultingwork distribution is signiﬁcant (see Table I). The mean and mode in the WAC

ICA approach is relatively stable andwhen independent batches of 15 SMD paths are used in FDA bootstrapping, the resulting work distributions havenon-Gaussian shape with roughly the same mode. This is demonstrated in Fig. 5, where

WAC

ICA bootstrappingis applied for calculating the R and F work distributions using independent batches of SMD paths (3 batches of 15paths for the R direction and 2 batches of 15 paths for the F direction). For comparison, in the inset of Fig. 5we report the F and R work distributions recovered from the 1000 SMD paths. The free energy diﬀerence betweenfolded and unfolded states of deca-alanine can be predicted using a random pair of F and R work distributionscontained in Fig. 5 (free energy diﬀerence actually corresponds to the intersection point of the P R ( − W ) and P F ( W )distributions). Such free energy diﬀerences are roughly consistent with the free energy diﬀerence obtained from theSMD work distributions (inset of Fig. 5) and those reported in Ref. 6. A more detailed analysis about free energydetermination using the various FDA models is provided in the Supplementary Material.Given that the underlying decoupling ICA methods aim at separating signals into non-Gaussian sources [38], it isreassuring to note that the WAC

ICA approach is able to recover also Gausian-like work distributions, such as thosein the F direction (Fig. 5). This is quantiﬁed by the skew and by the diﬀerence between mean and median of the two F work distributions which are very small (skews: 0.0862 and -0.0142; diﬀerence between mean and median: -0.0979and 0.0249). It should also be noted that the ICA methods can give rise to convergence problems for small samplessizes (this computational decoupling method usually requires larger “training sets” [37, 38]). We notice however thatin the present case good results are obtained for both W ICA and

WAC

ICA bootstrapping even using small samplesizes. The positive performance of the small sample

WAC case is reassuring because this method can be used evenwhen the sample size in hand is not large enough for reliable use of ICA methods.

C. Physical Interpretation and collective variables orthogonal to the RC

Previous works have already made some connections between classical statistical mechanics and nonequilibriumwork relations [40]. The time series studied here suggest that paths exhibiting extreme work values, i.e., paths whoserelated work falls in one of the tails of the measured work distribution, are associated with stochastic dynamicswhich more strongly deviate from the “average SPA model”. These types of relationships were ﬁrst discovered by thedata driven methods presented earlier (using only the end-to-end distance RC). Physics-based intuitive explanationswere then tested by analyzing the sequence of SMD all-atom snapshots. The FDA methods revealed that there areclusters (sub-populations) in our data sets. The fact that the clusters appear to have physical relevance (see discussionbelow) suggests that SPA output may be useful for a functional classiﬁcation tool[36] in systems where the underlyingmolecular details are not well understood. We expand on this speculative idea below, but ﬁrst some establishedphysical facts are presented.When a process takes place far from equilibrium, classical statistical mechanics predicts that paths on average dissi-pate more work than they would if the process were carried out reversibly. We label these paths as the “uninterestingpaths”. We label those that microscopically violate the second law of thermodynamics, i.e. with negative dissipatedwork, as the “interesting paths”[40]. The labels are only used here to distinguish between the two cases in terms of thecontribution each of them gives to the exponential average used for recovering free energy diﬀerences[7]. We noticethat in our system it is diﬃcult to distinguish between interesting and uninteresting paths by visually inspecting thedynamics of the steered RC alone (see bottom of Fig. 1). However there are other conﬁgurational coordinates that wecould use to make the diﬀerences more apparent. Figure 6 shows some of these possible conﬁgurational coordinates,namely the radius of gyration and the number of hydrogen bonds in deca-alanine, as a function of the steered RC.Speciﬁcally, in Fig. 6 we report the path population mean µ and the curves that indicate the standard deviation, µ + σ and µ − σ , of these two quantities. Moreover the blue lines denote a set of uninteresting paths, while the redlines denote a set of interesting paths. Using the number of hydrogen bonds it is diﬃcult to distinguish betweeninteresting and uninteresting paths, but by using the radius of gyration, one observes that the interesting paths forma tube that is systematically lower than the mean of the entire population. It is noteworthy that the radius of gyrationassociated, in turn, with interesting and uninteresting paths has a signiﬁcantly diﬀerent behavior when 22 < Z < R workdistribution (note that in the ﬁgure the mean function calculated over the 60 SPA model coeﬃcient functions hasbeen subtracted from the individual functions). Sampling the paths in the tails of the work distribution allows us toselect interesting (smallest work values) and uninteresting paths (largest work values). In the interesting paths, thedrift function shows a faster contraction at the beginning portion of the path. A reasonable explanation for this wouldbe that, in the interesting paths, the elongated “random coil” is aligned in such a way that it experiences signiﬁcantlymore attractive forces from adjacent peptides (compared to typical conﬁguration consistent with the initial Z ). Theconverse applies to the uninteresting paths. Perhaps a more intriguing trend is observed in the diﬀusion functionsassociated with interesting paths. Initially they are low, but then increase rapidly in a sort of “transition” that occursaround Z ≈

26 ˚A. It is interesting to note that the function describing the noise magnitude undergoes this transitionroughly at the same Z value at which the radius of gyration paths “diverge” (compare Figs. 6 and 7). Some physicistsmight attribute this fact to a meta-stable state that exists for Z near 26 ˚A [15].The possible connections existing between diﬀusion models and structural features in nonequilibrium processes,as we have shown here, suggests that incorporating dynamic response information (such as that contained in thiscollection of SPA models) along with other structural information may provide a useful tool in classifying conﬁgu-rational trajectories where there are important unresolved (due to lack of direct observation) conﬁgurational detailsthat are believed to be important [10, 34]. Identifying clusters of dynamical responses could potentially assist incorrelating SPA model functions to both work and slowly evolving conﬁgurational details using more sophisticatedFDA techniques (e.g. functional partial least squares or mixed linear models [36]). This is attractive in situationswhere small and large samples of nonequilibrium paths are available. Physically, the clusters of curves in Fig. 7suggest that, in this system, the microstates associated with the slowly evolving degrees of freedom can be modeledeﬀectively as a continuum. For example, the radius of gyration appears to take a continuous range of values for a giventemperature and ﬁxed value of the RC. As a result of this, a continuous tube of SPA models is likely observed becausethe (roughly) continuously distributed radius of gyration reaction coordinate modulates the dynamical response [54].In more complex macromolecules, one should be warned that the SPA models estimated will likely depend on theconﬁgurational degrees of freedom in a more complicated fashion, but if they do measurably modulate the dynamics(e.g. see Ref. 4), then our technique may be helpful (but the nonequilirium work will probably not be approximatedas accurately as it was here). IV. CONCLUSIONS AND PERSPECTIVES

We have presented a surrogate process approximation (SPA) method[19] based on a diﬀusion process to modelnonequilibrium paths of a given reaction coordinate. We have applied the methodology to the nonequilibrium dynamicsof the end-to-end distance of a deca-alanine peptide at ﬁnite temperature produced by steered molecular dynamics(SMD) simulations. Special attention has been devoted to the refolding process of deca-alanine, that was previouslydemonstrated to yield markedly non-Gaussian work distributions[6]. For modeling the dynamical behavior of thereaction coordinate of deca-alanine, two strategies have been proposed. One is based on the modleling of one SMD paththrough a series of local diﬀusion models, each describing a limited interval of the path. These local models are thenstitched to give the global diﬀusion model (SPA model) for the given SMD path. A collection of global SPA models,modeled on various statistically independent SMD paths, have been then used to increase the statistical sampling,that is to produce a large amount of surrogate paths. Unluckily, these surrogate paths have been demonstrated tofurnish excessively broad work distributions. In order to correlate the dynamics of the reaction coordinate resultingfrom the SPA models with the work paths observed in the underlying SMD paths, we have turned to functional dataanalysis[23]. This procedure has allowed us to recover good quality work distributions considering the relatively smallnumber of SMD paths used for the modeling. This is indeed one of the goals of our study: when the number of directobservations (from experiments or calculations) is small, our approach allows one to obtain reliable estimation of thework distribution shape. Given the fact that several recent methods for estimating free energy diﬀerences[7, 8] dependon approximations of the work distributions obtained from a ﬁnite collection of nonequilibrium realizations along agiven reaction coordinate, any method for improving the quality of the work distributions may provide considerablehelp. Moreover, the determination of the work distribution shape can also furnish signiﬁcant insights on the underlyingdynamics of nonequilibrium processes. Taking advantage of this possibility, we are currently investigating strategiesfor supplying work distribution reweighting algorithms (see, e.g., Eq. 6 of Ref. 41) with SPA protocols.A further outcome of this study is to reveal the possibility of using SPA models to detect stochastic dynamicalfeatures of a system, e.g. trends in the drift or diﬀusion coeﬃcient of an approximating SDE, from noisy reactioncoordinate time series (obtained from nonequilibrium simulations). In fact, the dynamical behavior of slowly evolvingdegrees of freedom (or of collective variables orthogonal to the chosen reaction coordinate), has been shown to beemphasized by the SPA models. Evidence of this can be found in the diverse “population” of drift and diﬀusionfunctions associated with the collection of estimated SPA models (see Figs. 2, 6 and 7). It should be carefully notedthat the drift and diﬀusion functions of a single SPA model alone (estimated using a single SMD path) can hardlyreveal the occurrence of peculiar heterogeneities in the dynamics of the system. On the other hand, such features canbe detected by analyzing clusters in a collection of (estimated) SPA drift and diﬀusion coeﬃcient functions. At leastthis is the case observed in deca-alanine, where the dynamics of the radius of gyration appears to be signiﬁcantlycorrelated to both drift and diﬀusion functions. The radius of gyration was shown to readily identify work pathswhich transiently violated the second law of thermodynamics. It should be noted that, in a real experiment, a carefulanalysis of additional degrees of freedom, as we have done for deca-alanine, may be diﬃcult or simply not feasible.The methods presented here provide an indirect way of quantifying the eﬀect of diﬃcult to measure degrees of freedomutilizing dynamical information contained in a collection of time series of a simple reaction coordinate.In summary we have demonstrated that the measured diﬀusion functions oﬀer an alternative method for using“noise” (estimated on a pathwise basis) in nonequilibrium simulations/experiments to detect subtle small scale tran-sitions. One possible future application of this observation would be to develop a sampling method which constructsa basis (e.g. functional PCA) from a few “entire” pullings (i.e. span the entire RC range of interest). If after this stepis completed, transitions regions become apparent (or the location of transitions is known from established physicaltheories), one could exploit this ﬁnding and run a larger batch of “short” SMD simulations/experiments. By this wemean that the SMD simulations/experiments would only need to be carried out until just after the transition region ofthe RC. In this way, one could project onto the previously determined (empirical) basis and use the larger set of newtrajectories to get a reﬁned estimate of the frequency of certain important small scale molecular events by analyzingthe distribution of the projections onto this basis.

Acknowledgments

Acknowledgements The work of C.P.C. was supported by an NSF grants [1] C. Bustamante, J. Liphardt, and F. Ritort, Physics Today , 43 (2005).[2] D. Collin, F. Ritort, C. Jarzynski, S. Smith, I. Tinoco, Jr., and C. Bustamante, Nature , 231 (2005).[3] N. Harris, Y. Song, and C. Kiang, Phys. Rev. Lett. , 068101 (2007).[4] S. Paramore, G. Ayton, and G. Voth, J. Chem. Phys. , 105105 (2007).[5] S. Park and K. Schulten, J. Chem. Phys. , 5946 (2004).[6] P. Procacci, M. S., A. Barducci, G. Signorini, and R. Chelli, J. Chem. Phys. , 164101 (2006).[7] C. Jarzynski, Phys. Rev. Lett. , 2690 (1997).[8] G. E. Crooks, J. Stat. Phys. , 1481 (1998).[9] W. Wriggers, Z. Zhang, M. Shah, and D. C. Sorensen, Mol. Sim. , 803 (2006).[10] X. Zhuang, H. Kim, M. Pereira, H. Babcock, N. Walter, and S. Chu, Science , 1473 (2002).[11] K. Walther, F. Gr¨ater, L. Dougan, C. Badilla, B. Berne, and J. Fernandez, PNAS , 7916 (2007).[12] J. Wang, Chemical Physics Letters , 544 (2006).[13] S. Kim, P. Blainey, C. Schroeder, and X. Xie, Nature Methods , 397 (2007).[14] M. Vendruscolo and C. Dobson, Science , 1586 (2006).[15] M. Forney, L. Janosi, and I. Kosztin, (submitted?) (2007).[16] G. Hummer and A. Szabo, PNAS , 3658 (2001).[17] F. Gr¨ater and H. Grubm¨uller, J. Struct. Biology , 557 (2007).[18] B. Prakasa Rao, Statistical Inference for Diﬀusion Type Processes (Hodder Arnold, 1999).[19] C. Calderon, J. Chem. Phys. , 084106 (2007).[20] M. Carrion-Vazquez, A. Oberhauser, T. Fisher, P. Marszalek, H. Li, and J. Fernandez, Progress in Biophysics and MolecularBiology , 63 (2000).[21] I. Kosztin, B. Barz, and L. Janosi, J. Chem. Phys. , 064106 (2006).[22] W. Hoover, Phys. Rev. A , 1695 (1985).[23] J. Ramsay and B. Silverman, Functional Data Analysis (Springer, 2005).[24] D. Searles, L. Rondoni, and D. Evans, J. Stat. Phys. (to appear).[25] H. Kushner,

Weak convergence methods and singularly perturbed stochastic control and ﬁltering problems (Birkhauser,1990).[26] A. Mackerell, D. Bashford, M. Bellot, R. Dunbrack, J. Evanseck, M. Field, J. Gao, H. guo, S. Ha, D. Joseph-Mcarthy,et al., J. Phys. Chem. B , 3586 (1998).[27] A. Mamonov, M. Kurnikova, and R. Coalson, Biophys. Chem. , 268 (2006).[28] R. Zwanzig,

Nonequilibrium Statistical Mechanics (Oxford Univeristy Press, New York, 2001).[29] G. A. Pavliotis, A. M. Stuart, and K. Zygalakis, Comm. Math. Sci. p. (to appear) (2007).[30] G. A. Pavliotis and A. M. Stuart, J. Stat. Phys. , 741 (2007).[31] C. Calderon, Mutliscale Modeling and Simulation , 656 (2007).[32] P. Procacci, T. A. Darden, E. Paci, and M. Marchi, J. Comp. Chem. , 1848 (1997).[33] M. Bradnt and P. Santa-Clara, J. Financial Economics , 161 (2002).[34] S. Kou and X. Xie, Phys. Rev. Lett. , 18 (2004).[35] A. Ansari, J. Chem. Phys. , 1774 (1999).[36] F. Ferraty and P. Vieu, Nonparametric Functional Data Analysis (Springer Series in Statistics, 2006).[37] J. Stone, J. Porrill, N. Porter, and I. Wilkinson, NeuroImage , 407 (2002).[38] A. Hyvarinen, IEEE Transactions on Neural Networks , 626 (1999).[39] F. Ritort, Seminaire Poincar´e , 195 (2004).[40] C. Jarzynski, Phys. Rev. E. , 046105 (2006).[41] R. Chelli, S. Marsili, and P. Procacci, (2007).[42] W. Humphrey, A. Dalke, and K. Schulten, Journal of Molecular Graphics Z value.[47] Besides the statistical validity of the overdamped diﬀusion approximation can be readily tested using information containedin our models (see Supplementary Material).[48] It should be noted that the actual reaction coordinate value ( Z ( τ )) is very close to the deterministic target value ( λ ( τ ))because of the high spring constant used in the SMD simulations. In the FDA analysis employed, there was a negligiblediﬀerence between results obtained using Z and those obtained using λ .[49] The matrix we work with in this study is actually a block of the weights numerically obtained when our discretely observedfunctions are projected onto a Haar wavelet basis set. [50] Within each block, the variance at a grid point was ﬁrst computed across the function population. The variance at all gridpoints within the block was then summed and the square root of this quantity is what we call the “standard deviation”.[51] Since the initial conﬁgurations are iid samples from a stationary distribution, there is no statistical correlation betweenthe work, diﬀusion and drift functions for two diﬀerent SMD simulations.[52] Unfortunately ICA typically requires very large samples sizes to decouple and separate signals into (roughly) independentcomponents. However if one has a large collection of paths and is interested in using ICA to classify trajectory clusters,then ICA might be more useful (and reliable).[53] We do stress that the SPA models do accurately capture many statistical features of the true data (see SupplementaryMaterial).[54] Of course, many other factors are relevant in determining the dynamical response but the basic idea remains the same. TABLE I: Work Statistics. The case labelled “SMD” corresponds to using SMD trajectories to evaluate work at the terminalendpoint. In all of the FDA bootstrapping cases, 10 bootstrapped processes have been used. The number in parenthesisindicates the number of SMD paths used to calibrate each collection of SPA models (in the SMD cases this corresponds to thenumber of paths where work was measured directly from). In this Table only, the larger sample cases build incrementally fromthe smaller sample cases, e.g. W (30) contains the 15 paths from W (15).Case ˆ µ ˆ µ -Median Mode ˆ σ Skew KurtosisSMD (15) -77.785 2.478 -75.270 13.390 1.491 4.915SMD (30) -79.362 1.915 -82.014 11.827 1.210 5.254SMD (60) -77.481 1.424 -82.795 12.166 1.235 5.382SMD (100) -78.123 2.642 -87.845 12.154 0.993 4.583SMD (1000) -77.661 1.301 -84.889 11.844 1.127 5.385

WAC

ICA (15) -80.591 2.677 -88.129 13.408 0.856 3.228

WAC

ICA (30) -80.100 0.789 -82.127 11.634 0.358 2.994

WAC

ICA (60) -80.151 1.366 -86.195 12.165 0.510 2.985 W ICA (15) -80.736 2.105 -85.370 13.428 1.032 4.233 W ICA (30) -81.688 1.300 -84.595 11.538 0.948 4.900 W ICA (60) -80.042 1.119 -82.647 12.161 0.736 4.279

WAC (15) -79.560 3.027 -83.593 12.693 1.024 3.726

WAC (30) -78.427 0.563 -80.560 11.351 0.110 3.417

WAC (60) -75.564 0.980 -82.097 11.561 0.306 2.859 W (15) -73.196 2.376 -86.172 14.201 0.240 1.832 W (30) -76.847 0.426 -74.422 10.621 0.338 3.488 W (60) -75.353 1.416 -83.797 12.163 0.623 3.305 FIG. 1: (Color online) Snapshots of deca-alanine during two independent refolding processes obtained from steered moleculardynamics simulations published in Ref. 6. In the right side we report snapshots from a process where misfolding occurs andthe work done on the system is greater than that obtained from a reversible process. In the left side we report snapshots froma process that instead microscopically violates the second law of thermodynamics [40]. The center panel contains the reactioncoordinate Z ( τ ) for both simulations as a function of time. In the bottom panel we plot Z ( τ ) − λ ( τ ) , where λ ( τ ) is thedeterministic protocol used for steering the reaction coordinate. The snapshots were generated with the VMD program [42].FIG. 2: (Color online) Drift and diﬀusion functions (panels A and B, respectively) estimated using 15 SMD paths in the F and R directions (black and red curves, respectively). The sampling step of the SMD paths is 150 fs.FIG. 3: (Color online) Mean µ and standard deviation σ for R work paths using the SMD and SPA processes. The curveslabelled with W i correspond to a few randomly chosen work paths (see legend).FIG. 4: (Color online) Work distributions in the R direction obtained from 1000 SMD simulations and from the FDAbootstrapping methods (see legend). The circles indicate the work values corresponding to the 15 SMD paths used for calibratingthe SPA models used in FDA bootstrapping.FIG. 5: (Color online) WAC

ICA work distributions in the R and F directions (3 solid curves and 2 dashed curves, respectively).In the inset the R and F work distributions obtained from the SMD simulations are reported (black and green histograms,respectively). The work value at the intersection point of these last work distributions corresponds to the free energy diﬀerence(91.7 kJ mol − ). We stress that each FDA bootstrapping (approximate) work density used diﬀerent iid SMD paths as theunderlying data. For example in the R direction 45 total SPA models were estimated and this collection was split into 3 setsof 15. These batches were used to create the approximate R work densities shown here.FIG. 6: (Color online) Alternative RCs, i.e. radius of gyration (panel A) and number of hydrogen bonds in deca-alanine (panelB), as a function of the steered RC (from R SMD simulations). The black dashed line denotes the mean population curve µ calculated using 100 SMD paths. The solid black lines denote the standard deviation curves µ + σ and µ − σ . 10 “interesting”and 50 “uninteresting” paths are plotted in red and blue, respectively. Both quantities were calculated using the VMD program[42]. The radius of gyration was computed using all atoms of deca-alanine. The hydrogen bonds were computed using theVMD subroutine Hbonds using a 3 ˚A HN · · · OC bond distance and 20 o O · · · N-H angle cutoﬀs.FIG. 7: (Color online) Drift (panel A) and diﬀusion (panel B) functions (with population mean subtracted) of SPA modelscalibrated on interesting and uninteresting R SMD paths (solid and dashed lines, respectively).

100 200 300 400 500−200−150−100−50050 τ [ps] W o r k [ kJ / m o l ] µ (SPA) µ ± σ (SPA)W i (SPA) µ (SMD) µ ± σ (SMD)W i (SMD)

110 −100 −90 −80 −70 −60 −50 −40 −30 −2000.010.020.030.040.050.06

W [kJ/mol] P R ( W ))