Optimal estimators and asymptotic variances for nonequilibrium path-ensemble averages
aa r X i v : . [ c ond - m a t . s t a t - m ec h ] J u l Optimal estimators and asymptotic variances for nonequilibrium path-ensembleaverages
David D. L. Minh ∗ Laboratory of Chemical Physics, NIDDK, National Institutes of Health, Bethesda, Maryland 20892, USA
John D. Chodera † Research Fellow, California Institute of Quantitative Biomedical Research (QB3),University of California, Berkeley, 260J Stanley Hall, Berkeley, California 94720, USA (Dated: May 28, 2018)Existing optimal estimators of nonequilibrium path-ensemble averages are shown to fall within theframework of extended bridge sampling. Using this framework, we derive a general minimal-varianceestimator that can combine nonequilibrium trajectory data sampled from multiple path-ensemblesto estimate arbitrary functions of nonequilibrium expectations. The framework is also applied toobtaining asymptotic variance estimates, which are a useful measure of statistical uncertainty. Inparticular, we develop asymptotic variance estimates pertaining to Jarzynski’s equality for free en-ergies and the Hummer-Szabo expressions for the potential of mean force, calculated from uni- orbidirectional path samples. Lastly, they are demonstrated on a model single-molecule pulling exper-iment. In these simulations, the asymptotic variance expression is found to accurately characterizethe confidence intervals around estimators when the bias is small. Hence, it does not work wellfor unidirectional estimates with large bias, but for this model it largely reflects the true error in abidirectional estimator derived by Minh and Adib.
I. INTRODUCTION
Path-ensemble averages play a central role in nonequi-librium statistical mechanics, akin to the role of con-figurational ensemble averages in equilibrium statisti-cal mechanics. Expectations of various functionalsover processes where a system is driven out of equi-librium by a time-dependent external potential havebeen shown to be related to equilibrium properties, in-cluding free energy differences and thermodynamicexpectations.
The latter relationship, between equilib-rium and nonequilibrium expectations, has been appliedto several specific cases, such as: the potential of meanforce (PMF) along the pulling coordinate (or otherobserved coordinates ) in single-molecule pulling experi-ments; RNA folding free energies as a function of a con-trol parameter; the root mean square deviation from areference structure; the potential energy distribution and average; and the thermodynamic length. Compared to equilibrium sampling, nonequilibriumprocesses may be advantageous for traversing energeticbarriers and accessing larger regions of phase space perunit time. This is useful, for example, in reducing the ef-fects of experimental apparatus drift or increasing thesampling of barrier-crossing events. Thus, there hasbeen interest in calculating equilibrium properties fromnonequilibrium trajectories collected in simulations orlaboratory experiments. Indeed, single-molecule pullingdata has been used to experimentally verify relationshipsbetween equilibrium and nonequilibrium quantities.
While many estimators for free energydifferences and equilibrium ensemble averagescan be constructed from nonequilibrium relationships,they will differ in the efficiency with which they utilizefinite data sets, leading to varying amounts of statistical bias and uncertainty. Characterization of this biasand uncertainty is helpful for comparing the quality ofdifferent estimators and assessing the accuracy of aparticular estimate. The statistical uncertainty of anestimator is usually quantified by its variance in theasymptotic, or large sample, limit, where estimatesfrom independent repetitions of the experiment oftenapproach a normal distribution about the true valuedue to the central limit theorem. It is an importantgoal to find an optimal estimator which minimizes thisasymptotic variance.Although numerical estimates of the asymptotic vari-ance may be provided by bootstrapping (e.g. Ref. ),closed-form expressions can provide computational ad-vantages in the computation of confidence intervals, al-low comparison of asymptotic efficiency, and facili-tate the design of adaptive sampling strategies to targetdata collection in a manner that most rapidly reducesstatistical error. In the asymptotic limit, the sta-tistical error in functions of the estimated parameters canbe estimated by propagating this variance estimate via afirst-order Taylor series expansion. While this procedureis relatively straightforward for simple estimators, it canbe difficult for estimators that involve arbitrary functions(e.g. nonlinear or implicit equations) of nonequilibriumpath-ensemble averages.Fortunately, the extended bridge sampling (EBS)estimators, a class of equations for estimatingthe ratios of normalizing constants, are known to haveboth minimal-variance forms and associated asymptoticvariance expressions. Recently, Shirts and Chodera ap-plied the EBS formalism to generalize the Bennett ac-ceptance ratio, producing an optimal estimator com-bining data from multiple equilibrium states to com-pute free energy differences, thermodynamic expecta-tions, and their associated uncertainties. Here, we ap-ply the EBS formalism to estimators utilizing nonequilib-rium trajectories. We first construct a general minimal-variance path-average estimator that can use samplescollected from multiple nonequilibrium path-ensembles.We then show that some existing path-average estima-tors using uni- and bidirectional data are special casesof this general estimator, proving their optimality. Thisalso allows us to develop asymptotic variance expressionsfor estimators based on Jarzynski’s equality and theHummer-Szabo expressions for the PMF. We thendemonstrate them on simulation data from a simple one-dimensional system and comment on their applicability.
II. EXTENDED BRIDGE SAMPLING
Suppose that we sample N i paths (trajectories) fromeach of K path-ensembles indexed by i = 1 , , ..., K . Thepath-ensemble average of an arbitrary functional F [ X ] inpath-ensemble i is defined by hFi i ≡ Z dX F [ X ] ρ i [ X ] , (1)where ρ i [ X ] is a probability density over trajectories, ρ i [ X ] = c − i q i [ X ] ; c i = Z dX q i [ X ] , (2)with unnormalized density q i [ X ] > c i (a path partition function). The aboveintegrals, in which dX is an infinitesimal path element,are taken over all possible paths, X . Extended bridgesampling estimators provide a way of estimating ratiosof normalization constants c i /c j , which will prove usefulin estimating free energies and thermodynamic expecta-tions.To construct these estimators, we first note the impor-tance sampling identity, c i h α ij q j i i = (cid:20)Z dX q i [ X ] (cid:21) R dX α ij [ X ] q i [ X ] q j [ X ] R dX q i [ X ]= (cid:20)Z dX q j [ X ] (cid:21) R dX α ij [ X ] q i [ X ] q j [ X ] R dX q j [ X ]= c j h α ij q i i j , (3)where j is another path-ensemble index, α ij [ X ] is an ar-bitrary functional of X , and all normalization constantsare nonzero.Summing over the index j in Eq. 3 and using the sam-ple mean, N − i P N i n =1 F [ X in ], as an estimator for hFi i ,we obtain a set of K estimating equations, K X j =1 ˆ c i N i N i X n =1 α ij [ X in ] q j [ X in ] = K X j =1 ˆ c j N i N j X n =1 α ij [ X jn ] q i [ X jn ] , (4)whose solutions yield estimates ˆ c i for the normalizationconstants c i , up to an irrelevant scalar multiple. Each path, X in , is indexed by the ensemble i from which it issampled, and the sample number n = 1 , , ..., N i . Thiscoupled set of nonlinear equations defines a family ofestimators parameterized by the choice of α ij [ X ], all ofwhich are asymptotically consistent, but whose statisticalefficiencies will vary. With the choice, α ij [ X ] = N j ˆ c − jK P k =1 N k ˆ c − k q k [ X ] , (5)Eq. 4 simplifies to the optimal EBS estimator,ˆ c i = K X j =1 N j X n =1 " K X k =1 N k ˆ c k q k [ X jn ] q i [ X jn ] − . (6)This choice for α ij [ X ] is optimal in that the asymptoticvariance of the ratios ˆ c i / ˆ c j is minimal. These equa-tions may be solved by any appropriate algorithm, in-cluding a number of efficient and stable methods sug-gested by Shirts and Chodera. The asymptotic covariance of Eq. 6 is estimated by,ˆ Θ = M T ( I N − M N M T ) + M (7)where the elements of Θ are the covariances of the log-arithms of the estimated normalization constants, Θ ij =cov (ˆ γ i , ˆ γ j ), and ˆ γ i = ln ˆ c i . The superscript ( ... ) + de-notes an appropriate generalized inverse, such as theMoore-Penrose pseudoinverse, I N is the N × N identitymatrix (where N = P Ki =1 N i is the total number of sam-ples), N = diag ( N , N , ..., N K ) is the diagonal matrixof sample sizes, and M is the N × K weight matrix withelements, M ni = ˆ c − i q i [ X n ] K P k =1 N k ˆ c − k q k [ X n ] . (8)In this matrix, the distribution from which samples aredrawn from is irrelevant, and X is only indexed by n = 1 , . . . , N . We note that the sum over each column, P Nn =1 M ni , is one.For arbitrary functions of the logarithms of the nor-malization constants, φ (ˆ γ , ..., ˆ γ K ) and ψ (ˆ γ , ..., ˆ γ K ), theasymptotic covariance cov( ˆ φ, ˆ ψ ) can be estimated fromˆ Θ according to,cov( ˆ φ, ˆ ψ ) ≈ K X i,j =1 ∂φ∂ ˆ γ i ˆΘ ij ∂ψ∂ ˆ γ j , (9)through first-order Taylor series expansion of φ and ψ . III. GENERAL PATH-ENSEMBLE AVERAGES
Following previous work, we estimate nonequilib-rium expectations by defining additional path-ensembleswith “unnormalized densities” q F i [ X ] = F [ X ] q i [ X ] ; c F i = Z dX q F i [ X ] . (10)Using Eqs. (1), (2), and (10), we can express nonequilib-rium expectations as a ratio of the appropriate normal-ization constants, hFi i = c F i /c i . Notably, this can beestimated without actually sampling path-ensembles bi-ased by some function of F [ X ] (although it is sometimespossible to do so in computer simulations via transi-tion path sampling ). If no paths are drawn from thepath-ensemble corresponding to q F i [ X ], then N F i = 0and it is no longer required that q F i [ X ] > For each defined path-ensemble, the weight matrix M is augmented by one column with elements, M n F i = ˆ c − F i F [ X n ] q i [ X n ] K P k =1 N k ˆ c − k q k [ X n ] . (11)The estimator for the path-ensemble average, ¯ F i ≈ hFi i ,can be expressed in terms of weight matrix elements,¯ F i = N X n =1 M ni F [ X n ] , (12)and its uncertainty estimated by σ ( ¯ F i ) ≈ ¯ F i ( ˆΘ F i F i − F i i + ˆΘ i i ) . (13) IV. EXPERIMENTALLY RELEVANTPATH-ENSEMBLES
The above formalism is fully general, and may be ap-plied to any situation where the ratio q i [ X ] /q j [ X ] can becomputed. For arbitrary path-ensembles, unfortunately,calculating this ratio is only possible in computer sim-ulations unless certain assumptions are made about thedynamics. In a few special path-ensembles, however,we can use the Crooks fluctuation theorem to esti-mate this ratio, allowing us to apply the EBS estimatorto laboratory experiments. We examine these here.First, consider a forward process , in which a system,initially in equilibrium, is propagated under some time-dependent dynamics for a time τ which may cause it tobe driven out of equilibrium. The time-dependence ofthe evolution law (e.g. Hamiltonian dynamics in a time-dependent potential) is the same for all paths sampledfrom this ensemble.For a sample of paths only drawn from this ensemble,the optimal EBS estimator of hFi f reduces to the samplemean estimator, which we call the unidirectional path-ensemble average estimator¯ F f = 1 N f N f X n =1 F [ X fn ] , (14) and the associated asymptotic variance from Eq. 9 re-duces to the variance of the sample mean (see AppendixA) σ ( ¯ F f ) ≈ N f N f N f X n =1 (cid:0) F [ X fn ] − ¯ F f (cid:1) (15)The forward process has a unique counterpart knownas the reverse process . Here, the system moves via theopposite protocol in thermodynamic state space; afterinitial configurations are drawn from the final thermo-dynamic state of the forward path-ensemble, they aredriven towards the initial state. If the dynamical lawsatisfies detailed balance when the control parameters areheld constant at each fixed time t , the path probabilitiesin the conjugate forward and reverse path-ensembles arerelated according to the Crooks fluctuation theorem: ρ f [ X ] ρ r [ ˜ X ] = q f [ X ] q r [ ˜ X ] c r c f = e w τ [ X ] − ∆ f τ ≡ e Ω[ X ] , (16)in which ˜ X is the time-reversal, or conjugate twin , of X , ∆ f t = − ln( c t /c ) is the dimensionless free energydifference between thermodynamic states at times 0 and t (with τ being the fixed total trajectory length) and w t [ X ] is the appropriate dimensionless work. In Hamil-tonian dynamics, for example, this work is w t [ X ] = β R t dt ′ ( ∂H/∂t ′ ). For convenience, we define the total dissipative work as Ω[ X ] ≡ w τ [ X ] − ∆ f τ .We will refer to data sets which only include realiza-tions from the forward path-ensemble as ‘unidirectional’,and those with paths from both path-ensembles as ‘bidi-rectional’. Notably, sampling paths from these conjugateensembles and calculating the associated work w t [ X ] ispossible in single-molecule pulling experiments as well ascomputer simulations (c.f. Refs. ). To combine bidi-rectional data to estimate hFi f , we apply the Crooksfluctuation theorem to Eq. 6 and divide by ˆ c f , lead-ing to,¯ F f = N f X n =1 F [ X fn ] N f + N r e − ˆΩ[ X fn ] + N r X n =1 F [ X rn ] N f + N r e − ˆΩ[ X rn ] (17)which is bidirectional path-average estimator of Minhand Adib, derived here by a different route whichdemonstrates its optimality. (The asymptotic varianceestimator for this equation is written in a closed form inAppendix B.) In these bidirectional expressions, samplesdrawn from the reverse path-ensemble are time-reversedto obtain the paths X rn . The dissipated work estimate,ˆΩ[ X ] ≡ w τ [ X ] − ∆ ˆ f τ , requires an estimate of ∆ f τ . Amethod for obtaining this estimate will be described next. V. FREE ENERGY
Jarzynski’s equality, e − ∆ f t = (cid:10) e − w t (cid:11) f , (18)relates nonequilibrium work and free energy differences.To facilitate the use of EBS in Jarzynski’s equality, wedefine a path-ensemble by choosing F [ X ] = e − w t [ X ] inEq. 10, leading to q w t [ X ] = e − w t [ X ] q f [ X ] ; c w t = Z dX e − w t [ X ] q f [ X ] . (19)When only unidirectional data is available, the optimalEBS estimator for Jarzynski’s equality is e − ∆ ˆ f t = 1 N f N f X n =1 e − w t [ X fn ] (20)and its asymptotic variance is straightforwardly given byerror propagation. Estimators and asymptoticvariances have also been developed for unidirectional importance sampling forms of the equality.When bidirectional data is available, the same choiceof F [ X ] in Eq. 17 gives the estimator e − ∆ ˆ f t = N f X n =1 e − w t [ X fn ] N f + N r e − ˆΩ[ X fn ] + N r X n =1 e − w t [ X rn ] N f + N r e − ˆΩ[ X rn ] (21)In this equation, choosing t = 0 or t = τ leads to an im-plicit function mathematically equivalent to the Bennettacceptance ratio method, as previously explained. The asymptotic variance of ∆ ˆ f t is calculated by augment-ing the matrices M and ˆ Θ and using φ = ψ = ∆ f t = − ln( c w t /c f ) in Eq. 9, such that, σ (∆ ˆ f t ) = ˆΘ w t w t − w t f + ˆΘ f f . (22) VI. POTENTIAL OF MEAN FORCE
Building on Jarzynski’s equality, Hummer and Sz-abo developed expressions for the PMF, the free en-ergy as a function of a reaction coordinate rather thana thermodynamic state, that may be used to interpretsingle-molecule pulling experiments. In these experi-ments, a molecule is mechanically stretched by a force-transducing apparatus, such as an laser optical trap oratomic force microscope tip (c.f. ). The Hamiltonian gov-erning the time evolution in these experiments, H ( x ; t ) = H ( x ) + V ( z ( x ); t ), is assumed to contain both a termcorresponding to the unperturbed system, H ( x ), and atime-dependent (typically harmonic) external bias poten-tial imposed by the apparatus, V ( z ; t ), which acts along apulling coordinate, z ( x ). As the coordinate z t ≡ z ( x ( t ))is observed at fixed intervals ∆ t over the course of theexperiment, we will henceforth use t = 0 , , ..., T as aninteger time index. We calculate the work with a dis-crete sum as w t = P tn =1 [ V n ( z n ) − V n − ( z n )], where V n ( z ) ≡ V ( z ; n ∆ t ).While the expressions in Section V provide an esti-mate of relative free energies of the equilibrium thermo-dynamic states defined by H ( x ; t ), they are not imme-diately useful as an estimate for the PMF along z . By applying the nonequilibrium estimator for thermody-namic expectations, it was shown that the PMF in theabsence of an external potential is given by e − g ( z ) = (cid:10) δ ( z − z t ) e − w t (cid:11) f e V ( z t ; t ) , (23)where the dimensionless PMF, g ( z ), is defined in rela-tion to the normalized density as g ( z ) = − ln p ( z ) − δg .In this equation, δg is a time-independent constant, e − δg = R dx e − H ( x ;0) / R dx e − H ( x ) . This theorem can be used to develop estimators forthe PMF by replacing the delta function using a kernelfunction of finite width, such as, h ( z − z t ) = ( z , if | z − z t | < ∆ z , else . (24)The width ∆ z must be small so that e V ( z ; t ) does not varysubstantially across it.As this theorem is valid at all times, it is possible toobtain an asymptotically unbiased density estimate ˆ p t from each time slice. It is far more efficient, however, toestimate the PMF using all recorded time slices. Whileany linear combination of time slices will lead to a validestimate, certain choices will be more statistically effi-cient (leading to lower variance) than others. One wayto combine time slices is to use the asymptotic covari-ance matrix in the method of control variates, lead-ing to a generalized least-squares optimal estimate of thePMF. Unfortunately, we empirically found this approachto be numerically unstable. A more numerically stableapproach, which was proposed by Hummer and Szabo, is based on the weighted histogram analysis method, ˆ p ( z ) = P t µ t ( z ) ˆ p t ( z ) P t µ t ( z ) ; µ t ( z ) ≡ e − V ( z ; t )+∆ ˆ f t . (25)While this weighting scheme is optimal, in a minimal-variance sense, for independent samples from multiple equilibrium distributions, these assumptions do not holdfor time slices from nonequilibrium trajectories. How-ever, Oberhofer and Dellago did not observe substantialimprovement in PMF estimates when using other time-slice weighting schemes. By defining the path-ensemble, q z t [ X ] = δ ( z − z t ) e − w t [ X ] q f [ X ]; c z t = Z dX q z t [ X ] (26)and making use of Jarzynski’s equality (Eq. 18) for e − ∆ ˆ f t ,we can write Hummer and Szabo’s PMF estimator as e − ˆ g ( z ) = P t (ˆ c z t / ˆ c w t ) P t e − V ( z ; t ) (ˆ c f / ˆ c w t ) , (27)which can be readily analyzed in terms of EBS. WhileHummer and Szabo proposed using the unidirectionalpath average estimator (Eq. 14) to estimate the expec-tations in Eq. 27, Minh and Adib later applied a bidi-rectional estimator (Eq. 17), leading to significantly im-proved statistical properties. The asymptotic variance of these estimators can bedetermined by choosing φ = ψ = p ( z ) in Eq. 9. Forthe bidirectional estimator, the matrices M and ˆ Θ willcontain one column each for the f and r path-ensembles,and T +1 columns each for the path-ensembles associatedwith { w t } T t =0 and { z t } T t =0 . The relevant partial deriva-tives are, ∂p ( z ) ∂γ f = − p ( z ) (28) ∂p ( z ) ∂γ w t = − D c z t c w t + ND (cid:18) e − V ( z ; t ) c f c w t (cid:19) (29) ∂p ( z ) ∂γ z t = 1 D c z t c w t , (30)where γ i = ln c i , N = P t ( c z t /c w t ) is the numerator ofEq. 27, and D = P t e − V ( z ; t ) ( c f /c w t ) is its denominator.These lead to an estimate for σ (ˆ p ( z )). Finally, theasymptotic variance in the PMF is given by the errorpropagation formula, σ (ˆ g ( z )) ≈ σ (ˆ p ( z )) / ˆ p ( z ) . VII. ILLUSTRATIVE EXAMPLE
We demonstrate these results with Brownian dynam-ics simulations on a one-dimensional potential with U ( z ) = (5 z − z + 3) z , which were run as previ-ously described. A time-dependent external perturba-tion, V ( z ; t ) = k s ( z − ¯ z ( t )) /
2, with k s = 15 is applied,such that the total potential is U ( z ; t ) = U ( z ) + V ( z ; t ).After 100 steps of equilibration at the initial ¯ z ( t ), ¯ z ( t )is linearly moved over 750 steps from − . . . − . z t = z t − − dU ( x t − ) dx D ∆ t + (2 D ∆ t ) / R t , where the dif-fusion coefficient is D = 1, the time step is ∆ t = 0 . R t ∼ N (0 ,
1) is a random number from the standardnormal distribution.As previously noted, unidirectional samplingleads to significant apparent bias in estimates of ∆ f t (Fig.1). In addition to the increased bias as the system isdriven further from equilibrium, we further observe thatthe estimated variance also increases. Bidirectional sam-pling, on the other hand, leads to a significant reductionin bias and variance, such that free energy estimateis within error bars of the actual free energy. Because∆ ˆ f t represents the estimated free energy difference withrespect to t , the estimated σ (∆ ˆ F t ) increases with t , be-coming equal to the well-known Bennett acceptance ratioasymptotic variance estimate when t = τ .Similar trends are observed with the Hummer-SzaboPMF estimates (Fig. 2). For unidirectional sampling,the finite-sampling bias and estimated variance increaseswhen the PMF is far from the region sampled by theinitial state. With bidirectional sampling, the bias is sig-nificantly reduced; the PMF estimate is largely withinerror bars of the actual PMF. ∆ f t FIG. 1: Comparison of estimators for ∆ f t : This figure issimilar to Fig. 1 of Ref. , except that error bars are nowincluded and the sample size is halved. The unidirectionalestimator (Eq. 20) is applied to 250 forward (rightward tri-angles) or reverse (leftward triangles, time reversed) sampledpaths, and the bidirectional estimator (Eq. 21) to 125 pathsin each direction (upward triangles). The exact ∆ f t is shownas a solid line. Error bars (sometimes smaller than the mark-ers) denote one standard deviation of ∆ ˆ f t , estimated usingthe expressions presented here. The vertical dashed lines areat the times considered in Fig. (4). −1 0 1051015 g ( z ) (a) Unidirectional −1 0 1Position(b) Bidirectional FIG. 2: Comparison of PMF estimators: This figure is similarto Fig 2 of Ref. , except that error bars are now included andthe sample size is halved. In the left panel, the unidirectionalHummer and Szabo estimator is applied to (a) 250 forward(rightward triangles) or 250 reverse (leftward triangles) sam-pled paths. In the right panel, the bidirectional estimator isapplied to 125 sampled paths in each direction (upward trian-gles). The exact PMF is shown as a solid line in both panels.Error bars (sometimes smaller than the markers) denote onestandard deviation of ∆ˆ g ( z ), estimated using the expressionspresented here. The vertical dashed lines are at the positionsconsidered in Fig. (4). To analyze these trends more quantitatively, we re-peated the experiment 1000 times. For both ∆ f t and g ( z ), we calculated the bias as ¯ B ( ¯ F f ) = S P Ss =1 ( ¯ F f,s − hFi ) and the standard deviation as¯ σ ( ¯ F f ) = q S P Ss =1 ( ¯ F f,s − hFi ) , where S = 1000 is thenumber of replicates. The results from these more exten-sive simulations support our described trends (Fig. 3).For unidirectional sampling, the bias in both ∆ f t and g ( x ) appear to significantly increase around the barriercrossing. In the bidirectional free energy estimate, how- B / σ (a) ∆ f t −1 0 1Position(b) g (z) FIG. 3: Ratio of estimator bias to standard deviation: Thisratio is calculated for the (a) free energy and (b) PMF, using1000 independent estimates. Each estimate is obtained andthe type of path sample is indicated as in Figs. (1) and (2).The vertical dashed lines are at the times/positions consideredin Fig. (4). ever, the bias is small relative to the variance at all times.Notably, in the bidirectional PMF estimate, there is asmall spike in the bias near the barrier, potentially dueto reduced sampling in the region.While in the large sample limit, the bias in the unidi-rectional estimate is expected to be small compared tothe variance, our distribution of unidirectional e − ∆ f t estimates is significantly skewed and does not resemble aGaussian distribution expected by the central limit the-orem (data not shown). Hence, the asymptotic limit hasnot been reached and the large relative bias is caused byinsufficient sampling of rare events with low work valuesthat dominate the exponential average. Larger samplesizes would be necessary for the distribution of estimatesto be normally distributed and for the error to be dom-inated by the variance (which we estimate here) ratherthan the bias.The accuracy of variance estimates may be assessed bycomparing predicted and observed confidence intervals.If the estimates are indeed normally distributed aboutthe true value, about 68% of estimates from many inde-pendent replicates of the experiment should be withinone standard deviation of the true value, 95% withintwo, and so forth. Fig. (4) compares confidence inter-vals predicted using the described asymptotic varianceestimators and the actual fraction of estimates withinthe interval.We observe that the accuracy of our asymptotic vari-ance estimate in characterizing the confidence intervallargely depends on the presence of bias. In the bidirec-tional ∆ ˆ f t estimate, where there is little bias, the asymp-totic variance estimate works very well. For the uni-directional ∆ ˆ f t estimates, it works well near the initialstate but underestimates the error as the system is drivenfurther away from equilibrium, concurring with the biastrend. In the bidirectional PMF estimate, the asymptoticvariance estimate accurately describes the confidence in-terval except near the barrier, where it slightly underes- t (a) t = 0.12 0 0.5 100.51 g (z)(d) z = −1.050 0.5 100.51 O b s e r v ed f r a c t i on (b) t = 0.43 0 0.5 100.51 (e) z = 0.150 0.5 100.51 (c) t = 0.60 0 0.5 100.51Predicted fraction(f) z = 0.95 FIG. 4: Validation of asymptotic variance estimators: Pre-dicted vs. observed fraction of 1000 independent estimatesthat are within an interval of the true value for (a)-(c) ∆ f t and (d)-(f) g ( z ) at the indicated times or positions. Eachestimate is obtained and the type of path sample is indicatedas in Figs. (1) and (2). Error bars on these fractions are 95%confidence intervals calculated using a Bayesian scheme de-scribed in Appendix B of Chodera et. al., except that,for numerical reasons, the confidence interval was estimatedfrom the variance of the Beta distribution assuming approxi-mate normality, rather than from the inverse Beta cumulativedistribution function. timates the uncertainty, probably due to the small spikein bias.In the regime where the bias is much smaller than thevariance, ¯ B ≪ ¯ σ , the asymptotic variance estimate pro-vides a good estimate of the actual statistical error inthe estimate. This also permits us to model the posteriordistribution of quantity being estimated as a multivari-ate normal distribution with mean ¯ F and covariance ˆ Θ .Doing so provides a route to combining estimates fromindependent datasets collected from different path en-sembles — such as different pulling speeds or from equi-librium and nonequilibrium path ensembles — withoutknowledge of path probability ratios. This is achievedby maximizing the product of these posterior distribu-tions in a manner similar to the Bayesian approach forestimating ∆ f τ described in Ref. . VIII. ACKNOWLEDGEMENTS
We thank Attila Szabo and Zhiqiang Tan for helpfuldiscussions, and Christopher Calderon for useful com-ments on the manuscript. D.M. thanks Artur Adib forsupporting a postdoctoral fellowship. This research wassupported by the Intramural Research Program of the NIH, NIDDK.
APPENDIX A: CLOSED-FORM EXPRESSION FOR THE ASYMPTOTIC VARIANCE, GIVENUNIDIRECTIONAL DATA
In this appendix, we show that given unidirectional data, the optimal EBS estimate is the sample mean and itsvariance simplifies to the variance of a sample mean. First, consider the application of the optimal EBS estimator,Eq. 6, to estimating a nonequilibrium path-ensemble average from a unidirectional data set,ˆ c F f = N f X n =1 (cid:20) N f ˆ c f q f [ X fn ] q F f [ X fn ] (cid:21) − (A1)= N f X n =1 F [ X fn ]ˆ c f N f . (A2)Dividing both sides by ˆ c f , we obtain the sample mean estimator,¯ F f = 1 N f N f X n =1 F [ X fn ] . (A3)We shall now simplify the asymptotic variance estimate by closely following the procedure of Shirts and Chodera. When M has full column rank, ˆ Θ can be written as (Eq. D7 of Ref. ),ˆ Θ = [( M T M ) − − N + b K T K ] − , (A4)where b is an arbitrary multiplicative factor and K is a 1 X K matrix of ones.The weight matrix M consists of two columns, M nf = ˆ c − f q f [ X fn ] N f ˆ c − f q f [ X fn ] = 1 N f (A5) M n F f = ˆ c − F f q F f [ X fn ] N f ˆ c − f q f [ X fn ] = F [ X fn ] N f ¯ F f , (A6)obtained by applying Eqs. 8 and 11. This leads to, M T M = " N − f N − f N − f P N f n =1 M n F f ≡ (cid:20) a a a a (cid:21) . (A7)The matrix M T M has the determinant, D = 1 N f N f X n =1 M n F f − N f . (A8)which allows us to write the inverse covariance matrix as,ˆ Θ − = (cid:20) a D − N f + b − a D + b − a D + b a D + b (cid:21) . (A9)By applying the same steps as Appendix E of Shirts and Chodera, we then obtain the determinant | ˆ Θ − | = 4 abD , (A10)where a = a = a . We then obtain the asymptotic covariance estimate,ˆ Θ = D ab (cid:20) a D + b aD − b aD − b a D − N f + b (cid:21) (A11)To estimate the variance, we apply Eq. 13, leading to σ ( ¯ F f ) ≈ ¯ F f (Θ F f F f − F f f + Θ f f ) (A12)= ¯ F f N f X n =1 M n F f − N f (A13)= N f X n =1 F [ X fn ] N f − ¯ F f N f (A14)= 1 N f N f N f X n =1 F [ X fn ] − N f N f X n =1 F [ X fn ] (A15)= 1 N f N f N f X n =1 (cid:0) F [ X fn ] − ¯ F f (cid:1) , (A16)which is the variance of a sample mean estimate. APPENDIX B: CLOSED-FORM EXPRESSION FOR THE ASYMPTOTIC VARIANCE, GIVENBIDIRECTIONAL DATA
In this appendix, we obtain a closed-form expression for the asymptotic variance of the optimal EBS estimate for¯ F f , given bidirectional data. We will follow a similar procedure as in Appendix A. For the bidirectional case, theweight matrix M consists of three columns, M = [ m f m r m F f ], where m i is a column matrix of weights from Eqs.8 and 11 corresponding to path-ensemble i . The elements of M are, M nf = ˆ c − f q f [ X n ] N f ˆ c − f q f [ X n ] + N r ˆ c − r q r [ ˜ X n ] = 1 N f + N r e − ˆΩ[ X n ] = N − f ǫ ( L n ) (B1) M nr = ˆ c − r q r [ ˜ X fn ] N f ˆ c − f q f [ X n ] + N r ˆ c − r q r [ ˜ X n ] = 1 N f e ˆΩ[ X n ] + N r = N − r ǫ ( − L n ) (B2) M n F f = (cid:18) F [ X ]¯ F f (cid:19) N f + N r e − ˆΩ[ X n ] = (cid:18) F [ X ]¯ F f (cid:19) N − f ǫ ( L n ) , (B3)where ǫ is defined as the Fermi function, ǫ ( L n ) = e − Ln , and we define L n = W [ X n ] − ∆ ˆ f t + ln (cid:16) N f N r (cid:17) . This allowsus to write M T M as, M T M = N X n =1 N f ǫ ( L n ) N f N r ǫ ( L n ) ǫ ( − L n ) N f (cid:16) F [ X ]¯ F f (cid:17) ǫ ( L n ) N f N r ǫ ( L n ) ǫ ( − L n ) N r ǫ ( − L n ) N f N r (cid:16) F [ X ]¯ F f (cid:17) ǫ ( L n ) ǫ ( − L n ) N f (cid:16) F [ X ]¯ F f (cid:17) ǫ ( L n ) N f N r (cid:16) F [ X ]¯ F f (cid:17) ǫ ( L n ) ǫ ( − L n ) N f (cid:16) F [ X ]¯ F f (cid:17) ǫ ( L n ) ≡ a ff a fr a f F f a fr a rr a r F f a F f F a r F f a F f F f . (B4)Using the determinant, D = − a F f F f a fr + 2 a f F f a fr a r F f − a ff a r F f − a f F f a rr + a F f F f a ff a rr , (B5)we write the inverse covariance matrix estimator as,ˆ Θ − = − a r F f + a F f F f a rr D − N f + b − a F f F f a fr + a f F f a r F f D + b a fr a r F f − a f F f a rr D + b − a F f F f a fr + a f F f a r F f D + b − a f F f + a F f F f a ff D − N r + b a f F f a fr − a ff a r F f D + b a fr a r F f − a f F f a rr D + b a f F f a fr − a ff a r F f D + b − a fr + a ff a rr D + b . (B6)By applying the same steps as Appendix E of Shirts and Chodera, we obtain the determinant | ˆ Θ − | = 9 b ( a fr N f + a fr a rr N f ) D . (7)Applying Eq. 13 to ˆ Θ and simplifying, it can be shown that the variance estimate is, σ ( ¯ F f ) ≈ ¯ F f (Θ F f F f − F f f + Θ f f ) (8)= a F f F f a fr − a f F f a fr − a f F f a r F f + a ff a r F f a fr ( a fr N f + a rr N r ) (9) ∗ Electronic Address: [email protected] † Electronic Address: [email protected] C. Jarzynski, Phys. Rev. Lett. , 2690 (1997). C. Jarzynski, Phys. Rev. E , 5018 (1997). G. E. Crooks, Phys. Rev. E , 2361 (2000). R. M. Neal, Statistics and Computing , 125 (2001). G. Hummer and A. Szabo, Proc. Natl. Acad. Sci. U.S.A. , 3658 (2001). G. Hummer and A. Szabo, Acc. Chem. Res. , 504 (2005). D. D. L. Minh, Phys. Rev. E , 061120 (2006). D. D. L. Minh, J. Phys. Chem. B , 4137 (2007). I. Junier, A. Mossa, M. Manosas, and F. Ritort, Phys. Rev.Lett. , 070602 (2009). E. Lyman and D. M. Zuckerman, J. Chem. Phys. ,065101 (pages 6) (2007). J. Nummela, F. Yassin, and I. Andricioaei, J. Chem. Phys. , 024104 (2008). E. H. Feng and G. E. Crooks, Phys. Rev. E , 012104(2009). D. Collin, F. Ritort, C. Jarzynski, S. B. Smith, I. Tinoco,and C. Bustamante, Nature , 231 (2005). J. Liphardt, S. Dumont, S. B. Smith, I. Tinoco Jr., andC. Bustamante, Science , 1832 (2002). C. H. Bennett, J. Comput. Phys. , 245 (1976). P. Maragakis, M. Spichty, and M. Karplus, Phys. Rev.Lett. , 100602 (2006). P. Maragakis, F. Ritort, C. Bustamante, M. Karplus, andG. E. Crooks, J. Chem. Phys. , 024102 (2008). M. R. Shirts and V. S. Pande, J. Chem. Phys. , 144107(2005). C. P. Calderon, L. Janosi, and I. Kosztin, J. Chem. Phys. , 144908 (pages 13) (2009). Z. Tan, J. Am. Stat. Assoc. , 1027 (2004). N. Singhal and V. S. Pande, J. Chem. Phys. , 204909(2005). N. S. Hinrichs and V. S. Pande, J. Chem. Phys. ,244101 (2007). A. M. Hahn and H. Then,
A dynamic sampling strat- egy for two-sided free-energy estimation (2009), cond-mat/0904.0625v2. Y. Vardi, Ann. Stat. , 178 (1985). R. D. Gill, Y. Vardi, and J. A. Wellner, Ann. Stat. ,1069 (1988). A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae, andZ. Tan, J. R. Stat. Soc. Ser. B (Stat. Methodol.) , 585(2003). M. R. Shirts and J. D. Chodera, J. Chem. Phys. ,124105 (2008). D. D. L. Minh and A. B. Adib, Phys. Rev. Lett. ,180602 (2008). Honi Doss makes this suggestion in the conference discus-sion of . S. Sun, J. Chem. Phys. , 5769 (2003). F. M. Ytreberg and D. M. Zuckerman, J. Chem. Phys. , 10876 (2004). L. Pratt, J. Chem. Phys. , 5045 (1986). C. Dellago, P. G. Bolhuis, F. S. Csajka, and D. Chandler,J. Chem. Phys. , 1964 (1998). J. Nummela and I. Andricioaei, Biophys. J. , 3373(2007). G. E. Crooks, J. Stat. Phys. , 1481 (1998). G. E. Crooks, Phys. Rev. E , 2721 (1999). C. Jarzynski, Phys. Rev. E , 046105 (2006). L. Lu and T. B. Woolf, in
Free Energy Calculations , editedby C. Chipot and A. Pohorille (Springer, Berlin, 2007),vol. 86. D. D. L. Minh, J. Chem. Phys. , 204102 (2009). H. Oberhofer, C. Dellago, and P. Geissler, J. Phys. Chem.B , 6902 (2005). M. R. Shirts, E. Bair, G. Hooker, and V. S. Pande, Phys.Rev. Lett. , 140601 (2003). D. D. L. Minh and J. A. McCammon, J. Phys. Chem. B , 5892 (2008). A. M. Ferrenberg and R. H. Swendsen, Phys. Rev. Lett. , 1195 (1989). S. Kumar, D. Bouzida, R. H. Swendsen, P. A. Kollman, and J. M. Rosenberg, J. Comput. Chem. , 1011 (1992). H. Oberhofer and C. Dellago, J. Comput. Chem. , 1726(2009). D. M. Zuckerman and T. B. Woolf, Phys. Rev. Lett. ,180602 (2002). J. Gore, F. Ritort, and C. Bustamante, Proc. Natl. Acad. Sci. U.S.A. , 12564 (2003). D. M. Zuckerman and T. B. Woolf, J. Stat. Phys. ,1303 (2004). J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, andK. A. Dill, J. Chem. Theory Comput.3