[PDF] Statistically Unbiased Free Energy Estimates from Biased Simulations

Abstract

Estimating the free energy in molecular simulation requires, implicitly or explicitly, counting how many times the system is observed in a finite region. If the simulation is biased by an external potential, the weight of the configurations within the region can vary significantly, and this can make the estimate numerically unstable. We introduce an approach to estimate the free energy as a simultaneous function of several collective variables starting from data generated in a statically-biased simulation. The approach exploits the property of a free energy estimator recently introduced by us which provides by construction the estimate in a region of infinitely small size. We show that this property allows removing the effect of the external bias in a simple and rigorous manner. The approach is validated on model systems for which the free energy is known analytically and on a small peptide for which the ground truth free energy is estimated in an independent unbiased run. In both cases the free energy obtained with our approach is an unbiased estimator of the ground-truth free energy, with an error whose magnitude is also predicted by the model.

Full PDF

SStatistically Unbiased Free Energy Estimatesfrom Biased Simulations

Matteo Carli and Alessandro Laio ∗ SISSA, Via Bonomea 265, 34136 Trieste, Italy

E-mail: [email protected] a r X i v : . [ phy s i c s . c h e m - ph ] F e b bstract Estimating the free energy in molecular simulation requires, implicitly or explicitly,counting how many times the system is observed in a ﬁnite region. If the simulation isbiased by an external potential, the weight of the conﬁgurations within the region canvary signiﬁcantly, and this can make the estimate numerically unstable. We introducean approach to estimate the free energy as a simultaneous function of several collectivevariables starting from data generated in a statically-biased simulation. The approachexploits the property of a free energy estimator recently introduced by us, whichprovides by construction the estimate in a region of inﬁnitely small size. We show thatthis property allows removing the eﬀect of the external bias in a simple and rigorousmanner. The approach is validated on model systems for which the free energy is knownanalytically and on a small peptide for which the ground truth free energy is estimated inan independent unbiased run. In both cases the free energy obtained with our approachis an unbiased estimator of the ground-truth free energy, with an error whose magnitudeis also predicted by the model. Graphical TOC EntryKeywords

Free energy landscapes; reweighting; umbrella sampling; unbiased estimators2

Introduction

A central problem in atomistic simulations is the study of systems exhibiting a complex andmultidimensional free energy landscape. This landscape is typically estimated as a functionof a set of collective variables (CVs), which are explicit functions of all the coordinates of thesystem. In many practical applications, the sampling of all the relevant states of the systemis artiﬁcially enhanced using a wide range of methods (see for a review). The prototype ofmany of these methods is Umbrella Sampling, in which an external bias potential is addedto the potential energy with the aim of accelerating the transitions between the local freeenergy minima and explore all the relevant metastable states of the system. We shall focushere on this speciﬁc approach.During the simulation, the bias B ( x ) is added to the potential energy U ( x ) . This bias, isa function of a (possibly multidimensional) CV s ( x ) , namely B ( x ) ≡ B ( s ( x )) . In the mostgeneral case, one is willing to estimate the free energy as a function of another (also possiblymultidimensional) CV σ ( x ) .The unbiased free energy for a speciﬁc value ˜ σ can be estimatedas: F (˜ σ ) = − β − log (cid:90) ρ B ( x ) e βB ( s ( x )) δ (˜ σ − σ ( x )) d x + f B (1)where ρ B ( x ) = Z − B e − β ( V ( x )+ B ( s ( x ))) is the biased probability distribution, Z B is the biasedcanonical partition function and f B is an additive constant. In ordinary Umbrella Samplingthe biasing CV is an explicit function of the the variables σ , namely s ( x ) ≡ s ( σ ( x )) . Inthis case eq 1 takes the simple known form F (˜ σ ) = F B (˜ σ ) − B ( s (˜ σ )) where F B (˜ σ ) = − β − log (cid:82) ρ B ( x ) δ (˜ σ − σ ( x )) d x . If s ( x ) is a generic function of the coordinates, instead, theexponential factor cannot be brought out of the integral and there is no easy manner toestimate the unbiased probability density ρ ( σ ) from ρ B ( σ ) .Both in the case in which s is a function σ or not, what is generally done in literature is to relax the delta function in eq 1 and instead use a kernel K ∆ that converges to it onlywhen the limit to zero is taken on its scale parameter ∆ (also called smoothing parameter ).3n other words, instead of F in eq 1, the following quantity F K is computed: F K (˜ σ ) := − β − log (cid:90) ρ B ( x ) e βB ( s ( x )) K ∆ (˜ σ − σ ( x ))d x (2)so that, by replacing the true biased density ρ B ( x ) by its sample estimator ˆ ρ B ( x ) = N (cid:80) Nj =1 δ ( x − x j ) one obtains ˆ F K (˜ σ ) ∼ − β − log N (cid:88) j =1 e βB ( s ( x i )) K ∆ (˜ σ − σ ( x j )) (3)where ˆ F K denotes the estimator of F K in eq 2 and is deﬁned up to an additive constant.One simple example of such kernels K is the characteristic function χ I ( | ˜ σ b − σ | ) of theinterval I = [0 , ∆ / , which is the case of a histogram of bin size ∆ centered around thebin centers { ˜ σ b } b . The expression in eq 2 however also applies to other kernel methods or,with a point-dependent value of ∆ , to the k -Nearest Neighbour estimator ( k -NN). Thecommon feature of these estimators is that they are not punctual, namely they provide anestimate of the free energy on a ﬁnite-size region. Thus, in the limit n → ∞ they convergeto F K in eq 2 but not to F in eq 1. Over this neighbourhood the value of e βB ( s ) can belargely ﬂuctuating, making the estimators ill-behaved. The estimators ˆ F K only convergeto F asymptotically, namely in the limit when both ∆ → and n → ∞ . However, evenwhen big, n will always be ﬁnite, so a ﬁnite parameter ∆ may be required in order for theestimators ˆ F K to be statistically meaningful. This problem becomes more and more severe inhigh dimension, due to the curse of dimensionality and can happen even in the trivial case σ = s , if s is multidimensional or the sample is too small.We here show that this problem can be circumvented by exploiting a punctual free energyestimator, i.e. one in which the limit for ∆ → is implicitly taken. Under these conditions, it isnot anymore necessary to take the average of the bias factors e βB ( s ( x )) over the neighbourhoodof σ ( x i ) set by a ﬁnite ∆ : the reweighting involves only the punctual value of the bias appliedwhen generating a datapoint i . 4n estimator with such punctual property was introduced by us in ref. This procedure,which we called Point-Adaptive k-nearest neighbour free energy estimator (PA k ) allowsestimating the free energy in a very large dimensional space, ideally including all the relevantdegrees of freedom of the system (for example, the coordinates of all the heavy atoms of asolute). The advantage of PA k is that, for all points { σ i } i in a sample, it provides an unbiasedestimate of the free energy F ( σ i ) in eq 1 rather than the one in eq 2.As we will show, this allows removing the eﬀect of the bias in a simple, numerically robustand theoretically well-founded manner, also in the case in which the CV on which the bias isapplied is not an explicit function of the variables σ . k approach PA k is a generalization of the kNN density estimator in which the optimal value of k isdetermined independently for each datapoint by an unsupervised, non-parametric procedure.The approach works also if the dimensionality D of the space Σ of the coordinates σ is verylarge (for example the position of all the C α carbons in a protein ). What makes theestimate computationally feasible is the key assumption that, due to chemical restraints, ρ ( σ ) has a support on a manifold Ω ⊂ Σ , whose intrinsic dimension d is signiﬁcantly smaller than D . We here assume for simplicity that d is constant over the whole dataset and estimate itwith the TWO-NN approach . Another important assumption of PA k is that, at least in aneighbourhood where the density of points is approximately constant, Ω is well approximatedby its tangent hyperplane T Ω ( x ) . Therefore, around every point x i one can measure distancesin the Euclidean R D metric, which for points lying on T Ω ( x i ) is equivalent to the corresponding R d restriction. We indicate by r i,k the Euclidean distance between the x i and its k -th-nearest Notice the PA k approach is valid also in the case of no explicit dimensional reduction σ ( x ) := x , i.e. when Σ ≡ R N . In this case the free energy F i is equivalent to the potential energy of the conﬁguration x i enteringthe canonical Boltzmann factor. Even in this case, however, it is still appropriate to refer to this quantity as"free energy", since ρ ( σ ) is estimated in a space of dimension d ≤ D . l − and l of a point i is given by ν i,l = ω d ( r di,l − r di,l − ) , where ω d is the volume of the d -sphere of unitary radius and r i, isconventionally set to (hence, the volume of the d -sphere enclosed between a point i and its k -nearest neighbour is (cid:80) kl =1 ν i,l = ω d r di,k ). Using these data, one ﬁrst ﬁnds, by a statisticaltest performed independently for each point i , the maximum value of k for which the volumes ν can be considered as drawn from a probability distribution with an approximately constantdensity. We call this value ˆ k i (see ref for an explicit deﬁnition of the test). For each point i one then estimates the free energy F i = F ( σ ( x i )) by maximising the likelihood of a model inwhich the free energy is assumed to depend linearly on the the neighbourhood order l : ˆ F i := argmax F max a ˆ k i (cid:88) l =1 log( e − F + al e − exp( − F + al ) ν i,l ) (4)For each i , this maximisation is equivalent to the solution of a log-linear regression model inwhich the l observed responses are the random variables Y i = { ν i,l } l distributed exponentiallywith expected value (cid:104) Y il (cid:105) = e − F i + al . Therefore, F i is the intercept of such model and its value isequivalent to taking the limit l → or, analogously, ∆ → . The maximum likelihood approachalso allows estimating the uncertainty of ˆ F i , which is given by ε i = (cid:113) k i +2(ˆ k i − k i . Tested on varioussystems, PA k has been proven an unbiased punctual free energy estimator, outperformingother non-parametric and unsupervised methods, especially in high dimensionality. Thepoint-adaptive selection of the neighbourhood applied by PA k surely plays a role in achievingthis task. However, we believe the extrapolation feature of the model, described above, iscrucial for obtaining a punctual estimate of the free energy. Looking at eq 1 we see that if there exists a map σ (cid:55)→ ˆ s ( σ ) associating to each point σ ( x ) aunique value s ( x ) = ˆ s ( σ ( x )) then B ( s ( x )) can be formally expressed as an explicit function6 d potential surface A −1 −0.5 0 0.5 1 1.5 2 2.5 3−2−101234 00.10.20.30.40.50.6 EBias potential -

𝐵(𝑥)

Figure 1: In the ﬁrst column: (A) two-dimensional double-well potential surface used insimulations; (E) bias potential used along x coordinate in the biased run for both analyticpotentials. In the second and third column comparison of PA k and bPA k estimates againstthe analytic free energy showing correlation plots and pull distribution; (B),(C) respectivelyunbiased and biased case for 2d potential; (F),(G) respectively unbiased and biased case for6d potential. (D),(H) Statistical test comparing bPA k to PA k on the points of the biasedsamples for 2d and 6d potentials respectively. All free energies and the bias potentials are inunits of k B T .of σ ( x ) and the exponential factor e βB (ˆ s ( σ )) can be brought out of the integral. Hence eq 1for ˜ σ = σ ( x i ) takes the form F ( σ ( x i )) = F B ( σ ( x i )) − B ( s ( x i )) . (5)For future reference, we call the existence of such map the map-existence condition (MEC).A consequence of the MEC is that if for two conﬁgurations x , x we have σ ( x ) = σ ( x ) then one cannot have s ( x ) (cid:54) = s ( x ) . Again, we stress the MEC is required only for theconﬁgurations x in the thermal ensemble. In fact, in molecular systems the interactions amongatoms strongly reduce the independent directions in which the system can move. For thisreason the ensemble density ρ ( x ) is almost vanishing on a big portion of R N . The simplestcase in which the MEC is veriﬁed is for s ( x ) = s ( σ ( x )) , namely when s is an explicit functions7f the coordinates σ . In this case ˆ s ≡ s . However, eq 5 can also be valid if ρ B is estimated onthe σ but s is an explicit function of diﬀerent coordinates σ (cid:48) , as long as these can be expressedas function of σ . This is true if all relevant σ (cid:48) can be parametrised by an explicit function σ (cid:48) = ϕ ( σ ) . In this case s ( x ) ≡ s ( σ (cid:48) ( x )) ≡ s ( ϕ ( σ ( x ))) and ˆ s ≡ s ◦ ϕ . In the cases where eq 5 holds, if one is able to estimate F B ( σ ( x i )) via an unbiased punctualestimator ˆ F Bi , the unbiased free energy at point i can be estimated as: ˆ F i := ˆ F Bi − B i (6)where B i := B ( s ( x i )) is simply the numerical value of the bias applied when generatingdatapoint x i . Eq 6 applies in principle to any punctual estimator of the biased free energy. Bychoosing a suitable ˆ F Bi , the meaning of ˆ F i becomes operatively clear. Due to its properties,we propose to estimate ˆ F Bi with PA k , because it is the only non-parametric free energyestimator method to our knowledge that is punctual without taking the limit ∆ → . Withthis speciﬁcation, eq 6 deﬁnes a simple punctual reweighting scheme to estimate point bypoint the unbiased free energy of a set of data generated in a biased simulation. From now onwe shall for brevity refer to this procedure as bPA k .We will show that the procedure deﬁned in eq 6 gives consistent result even when theMEC is slightly violated, i.e. when s is not an explicit function of the σ , but there exists aparametrisation ϕ that, given some σ , is able to capture most relevant features in the spaceof the σ (cid:48) s. We now discuss how we validate the robustness of the punctual reweighting procedure thatwe called bPA k . We compare its performance to that of PA k on unbiased samples, since PA k

8s already established as a good free energy estimator on multidimensional data. Thus, wechoose our test systems such that we are able to generate both an unbiased and a biasedequilibrium sample. In order to assess the statistical compatibility of the results obtained withPA k and bPA k , we compute the correlation plot between estimated and ground truth freeenergies and the distribution of the quantity known as pull . The value of the pull betweentwo observables F a and F b at point i is given by: χ i := ( F ai − F bi ) (cid:113) ε F ai + ε F bi (7)where ε F indicates the uncertainty on the quantity F . If F a and F b are compatible, thedistribution of the pull over a full statistical sample { x i } i is expected to be a Gaussian withzero average and unitary variance: χ i ∼ N (0 , . Using these tools we ﬁrst of all compare theestimates of free energy in the case of PA k and bPA k directly to the true known value in thecase of multidimensional analytic potentials that we sample numerically. Secondly, still in theanalytic case, we compare directly the two estimators. Finally, we consider as realistic caseMD simulations of a -peptide; in this case there is no known ground truth free energy forthe system, therefore the only sensible test is to directly compare the unbiased and the biasedestimators. k on data sampled from multidimensional an-alytic potential surfaces As a ﬁrst step, we test bPA k on systems for which the ground truth potential is knownanalytically. We consider two functional forms: a double well potential in 2d, shown in Figure1A, and a 6d potential which is exactly as the previous one in the ﬁrst 2 dimensions plus 4other convex (harmonic) directions. For both potential we sample 10.000 points from a biased9nd an unbiased simulation. In order to bias the dynamics we apply as bias potential theinverse of the analytic free energy along the x axis, cutoﬀed such that it is non-zero only on aﬁnite interval (Figure 1E).We apply PA k to the unbiased sample and bPA k to the biased sample, obtaining twoestimates of the free energies, which in these two cases should coincide with the analyticpotential energy. We test the two estimators directly against the analytically known groundtruth with the statistical test described in section 2.4Looking ﬁrst at the 2-dimensional case, Figures 1B and 1F, the normal distribution withunitary variance and zero mean seems in both cases well approximated; the correlation isgood, with the diﬀerence that the biased simulation explores regions at higher free energy, asexpected. Also in the 6-dimensional case, Figures 1C and 1G, the agreement with the groundtruth potentials is good. While for PA k (unbiased case) this had previously been shown, alsofor bPA k we can conclude that the method performs excellently, at least in these two simplecases, and gives a statistically unbiased estimate of the correct free energies.We now consider the case in which the ground truth unbiased free energy is not knownexplicitly, but one is able to generate a sample of data from the unbiased distribution. In thiscase, the unbiased and biased simulations sample diﬀerent sets of points. Thus, since PA k only gives a punctual estimate of the free energy, we need to deﬁne an interpolation schemein order to directly compare PA k and bPA k results. For every point in the biased sample,we apply the PA k procedure described in section 2.1 adding only that point to the unbiasedsample. To take into account the fact that this point does not actually belong to the samplewe simply substitute l → l + 1 and set ν i, = 0 in eq 4.Figure 1 (D),(H) shows the results of this test for the 2d and the 6d potentials; all samplesgenerated have 10.000 points. Excellent compatibility between PA k and bPA k for bothpotentials is displayed. The example illustrates that the compatibility between a free energyestimated with and without the bias can be demonstrated also if the free energy is not knownanalytically. This will be used to analyse the simulation of the peptide.10 .1.1 Comparison with standard reweighting on a ﬁnite neighbourhood Figure 2: Comparison of diﬀerent biased free energy estimators and reweighting protocols ondatasets sampled from two distributions: p and p . (E) Heatmap representation of unbiased p . (F) Bias applied to p , which is exactly ten times the one applied to p . In both cased d = D = 2 , so what we call unbiased free energy corresponds identically to the potentialenergy entering the Boltzmann factor if p and p are interpreted as canonical p.d.f.’s. (A-D). All estimates are performed on data points. The values represented on the verticalaxes are averages over batches of points. In solid transparent bars the sample standarddeviation of the batch. (A),(B) refer to p ; (C),(D) refer to p . The chosen values of and for the k -NN estimators are chosen as the average optimal value of k predicted by PA k for the datasets. (A),(C) bPA k compared to k -NN with standard reweighting. (B),(D) bPA k compared to k -NN, both with punctual reweightingWe compare the performance of bPA k with that of other kernel methods in the estimateof the free energy from biased data. Firstly, we compare the results of bPA k to those of k -NNreweighted in the standard way illustrated in eq 3, i.e. by subtracting to the estimated biasedfree energy the quantity log (cid:104) e βB ( s j ) (cid:105) j . Secondly, we apply the punctual reweighting also to k -NN, for which this ansatz is not justiﬁed by a demonstration of punctuality of the densityestimator, but becomes correct only in the limit ∆ → .We use two datasets in dimensions: one sampled from the probability density function p represented in Figure 2E; the other one sampled from p re-normalised to , which for11urther reference we simply indicate by p . The two systems display metastability betweenthe two main basins. By construction, the potential barrier in p is exactly times as highas the one in p . The simulations are biased along the x coordinate, with a bias obtained fromthe histogram along x of the unbiased sampling (see Figure 2F).For both distributions, bPA k drastically outperforms the k -NN method reweighted in astandard manner (Figures 2A,2C). With punctual reweighting, the quality of k -NN estimatesof the unbiased free energy improve. However, bPA k still shows a better performance alongthe whole range of free energy values, especially for high values (Figures 2C,2D). Furthermore,bPA k yields much better relults also looking at the pull distributions between the estimatedand the ground truth free energies in all reweighting schemes, a sign that bPA k estimator ispreferable to standard k -NN also in terms of error estimate. Figure 3: (A) Free energy of the CLN025 peptide computed as histogram of the collectivevariable s . (B),(C) Statistical test comparing bPA k to PA k on the points of the biased samplein the case of the ψ -dihedral angles and of the C α distances respectively. In (C) the errorbars have been omitted from the correlation plot for a better readability; green dots representall the points in the biased dataset; the red dots neglect all points with ˆ k i ≤ .In order to test our method on a realistic system we choose a β -hairpin called CLN025 . This molecule is a small protein of 10 residues and 166 atoms and is one of the smallest12eptides that display a stable secondary structure, in this case a β -sheet. Thanks to therelatively small size of the molecule we are able to produce both a long unbiased MD trajectoryand a biased one. In this way we are able to estimate the ground truth free energy of thesystem computing PA k on the unbiased trajectory and to compare it with bPA k results.We simulate the protein in Gromacs in explicit solvent. Since we are not interested in theprecise phisical chemistry of the system, we use quite a small box with 931 water molecules.To enhance the sampling of conﬁguration space, we run a Replica Exchange MD simulationwith 16 replicas using equispaced temperatures from 340K to 470K as done previously inref. In order to bias the trajectory, then, we choose as collective variable the ψ -dihedraldistance from an equilibrium conﬁguration, deﬁned as: s = (cid:88) n =1 − cos ( ψ n − ψ refn )2 (8)where ψ n denotes the n -th backbone ψ -dihedral angle of the peptide in the present conﬁgurationand ψ refn is the value of the same dihedral angle in a chosen reference equilibrium conﬁguration(in our case we chose the crystal structure ); this CV takes values from s = 0 , in the referenceconﬁguration, to s = 9 . We evaluate s along the trajectory and compute the free energy F ( s ) from a histogram (Figure 3A). We ﬁt the lowest part ( ∼ kJ mol − ) of the free energy witha sum of Gaussians ˜ F ( s ) + c and use the negative of such sum as bias potential B ( s ) = − ˜ F ( s ) .Using PLUMED we run a umbrella sampling biased REMD simulation in the same settingof the unbiased one.Once the two trajectories are produced, we analyse them in the -dimensional ψ -backbone-dihedra space. This choice implies of course a drastic dimensional reduction on the over- -dimensional original atomic conﬁguration space; still, even after this huge projectionthe system will show complex features and a reasonably high dimensionality, so that we areentitled to consider it a realistic case. Thus D = 9 is the embedding space dimension. The13istance between two conﬁgurations X a and X b in this space is: θ ( X a , X b ) = (cid:115)(cid:88) n (( ψ an − ψ bn )) (9)where (( • )) stands for π -periodicity within the brackets.We extract 9.500 points from the unbiased trajectory to use them as reference samplesample for PA k Interpolator and 3.700 points from the biased one to be used as test sample(i.e. input of bPA k and virtual points for PA k Interpolator). The intrinsic dimension ofthe dataset is d ∼ . The comparison between bPA k estimate and our ground truth freeenergy (output of PA k Interpolator), in Figure 3B, shows excellent agreement, despite thehigh dimensionality. k under change of metric We ﬁnally test the robustness of bPAk using a diﬀerent coordinate system, in which s isnot anymore an explicit function of the σ . We choose as coordinates the distances amongalpha carbons ( C α ) distances. Since we have 10 residues there are alpha carbons, thus × / pairwise distances between them; therefore the embedding dimension of thechosen space is now D = 45 . The metric on this space is the RMSD: d RMS ( X a , X b ) = (cid:115) N ( N − (cid:88) i>j ( d aij − d bij ) (10)where X a and X b are two conﬁgurations in the -dimensional space and d qij is the distancebetween the i -th and the j -th C α in conﬁguration X q .This time we take 37.000 reference points from the biased sample and 38.000 from theunbiased one. The estimated intrinsic dimension is d ∼ . We carry out our protocol andshow the result of the statistical test in Figure 3C, green dots. Looking at the correlationplot we see a divergence from linearity especially at higher values of the free energy. Weknow that such values are associated to low densities in conﬁguration space, corresponding14ither to low ˆ k i or to high values for the distance of the ˆ k i -th neighbour. We notice (Figure3C, red dots) that neglecting all points with ˆ k i ≤ , corresponding to ∼ of the data,the correlation plot improves sensibly. Our explaination is that C α -distance and ψ -dihedrametrics are equivalent in the sense deﬁned in section 2.2 for low free energy conﬁgurations,while the MEC is partly violated at high F. Figure 4: Comparison between the two main clusters of the unbiased and of the biasedtrajectories. (A),(D) Average dihedral angle for each of the backbone dihedra for each cluster.(B),(C),(E),(F) backbone visualisation of the conﬁgurations closest in dihedral distance tothe cluster average.As a last test to assess whether bPA k captures the correct features of the free energylandscape we compare a cluster analysis in PA k and bPA k . As a clustering method we useDensity Peak, . In both cases we found eight clusters, but the two main clusters containtogether more than of all the conﬁgurations. These clusters contain the most structuredconﬁgurations, closest to the native state; they correspond in fact to the leftmost basin in thefree energy of Figure 3A. Looking at Figure 4 we see both quantitatively and qualitativelythat cluster u almost perfectly matches cluster b and the same happens with u and b .Both couples of clusters present a structured α -helix turn in the central region of the peptide(dihedra - ); in the case of u and b the tails (dihedra - and - ) lie in the β -domain, hencethe β -hairpin is fully folded; in clusters u and b the tails are unstructured; as expected, the15rst couple is also energetically slightly favoured. As for our initial purpose, we can concludethat bPA k preserves also the cluster structure of the free energy. We have presented bPA k , a procedure to estimate the free energy in high-dimensional spacesstarting from a sample of points generated in a biased simulation. The approach is based ona recently-introduced free-energy estimator called PA k .PA k is an optimised point-adaptive k -NN algorithm, non-parametric and unsupervised. Ittakes as input a set of data expressed in terms of D coordinates σ and produces punctualestimates of the free energy in these points. These coordinates can even coincide with thecoordinates of the full conﬁguration space. Importantly, we assume that these points lie on amanifold, embedded in the space of the σ , of intrinsic dimensionality d ≤ D . The densitiesin each point are computed in this low-dimensional manifold. However, the metric on themanifold is locally approximated as Euclidean, so volumes are measured in R d without everexplicitly parametrising the embedded manifold. Hence, there is no explicit dimensionalreduction in the space of the σ , although this is done implicitly working in d dimensions.At this level of description, the competitive advantage of PA k is twofold: on one hand itoptimally selects for every point in the dataset the size of the neighbourhood considered inthe free energy estimate; on the other hand, the likelihood maximisation peculiar of PA k extrapolates the value of the free energy in the limit of neighbourhood size going to zero,which makes the estimate more punctual than in other kernel-based methods.The bPA k protocol consists of computing the biased free energy at all points in thedataset using PA k and then reweighting this quantity point by point simply subtracting thenumerical value of the applied bias to retrieve an unbiased estimate. The simple additiveform of this reweighting procedure is a nontrivial result. First of all, it crucially relies onthe punctuality of PA k . Secondly, since it involves integration over degrees of freedom which16re not necessarily explicitly transversal to those involved in the computation of the biaspotential, some reasonable but necessary requirements must be satisﬁed. We have describedthe condition (MEC) under which it is possible to reweight in an Umbrella-Sampling fashionthe biased free energy over some coordinates σ ( x ) when the applied bias potential is a functionof some possibly diﬀerent CVs s ( x ) . In short, this is possible if all the information necessaryto deﬁne the biasing CVs s is encoded in the coordinates σ over which the free energy iscomputed. In other words, if the intrinsic manifold the { s i } i lie on can be mapped to asubmanifold of the manifold the { σ i } i lie on. A case in which the MEC is violated on a smallsubset of conﬁgurations has been presented in the case of the CLN025 peptide. The derivationof the MEC and the consequent punctual form of the reweighting is valid in general for allpunctual estimators, i.e. those for which the kernel introduced in eq 2 approaches a deltafunction. In order to apply the same reasoning to other ﬁnite-size-kernel methods, such as k -NN, multidimensional histograms, gaussian KDE or other than involve the partitioning ofthe space of the CVs, one should require that the value of the bias potential does not varymuch over the neighbourhoods that such density estimates consider for each point.We have tested bPA k comparing its results to various unbiased ground truth free energies,some known analytically, some estimated with PA k from an unbiased simulation. In all testedcases the pull distribution proved bPA k to be an unbiased estimator of the ground truth values.In the case of two analytically-known distributions, we have also compared the performanceof bPA k to that of other estimators, both with standard and with punctual reweighting.While bPA k is conﬁrmed to be our best choice, punctual reweighting visibly improved theestimates also in the case of ﬁnite-size-kernel estimators. The opportunity of adoptingpunctual reweighting even with non-punctual estimators could be further investigated, butthis is beyond the scope of this work. 17 cknowledgements We thank Alex Rodriguez and Aldo Glielmo for several important suggestions and fruitfuldiscussions.

References (1) Alex Rodriguez, Maria D’Errico, Elena Facco, and Alessandro Laio. Computing the FreeEnergy without Collective Variables.

Journal of Chemical Theory and Computation ,14(3):1206–1215, 2018.(2) Giacomo Fiorin, Michael L Klein, and Jérôme Hénin. Using collective variables to drivemolecular dynamics simulations.

Molecular Physics , 111(22-23):3345–3362, 2013.(3) Rafael C Bernardi, Marcelo CR Melo, and Klaus Schulten. Enhanced sampling techniquesin molecular dynamics simulations of biological systems.

Biochimica et Biophysica Acta(BBA)-General Subjects , 1850(5):872–877, 2015.(4) J.P Torrie, G. M Valleau. Nonphysical Sampling Distributions in Monte Carlo Free-Energy Estimation: Umbrella Sampling.

J. Comput. Phys. , 23:187, 1977.(5) Murray Rosenblatt. Remarks on Some Nonparametric Estimates of a Density Function.

The Annals of Mathematical Statistics , pages 832–837, 1956.(6) E Parzen. On the Estimation of Probability Density Functions and Mode.

Ann. Math.Statist , 33:1065–1076, 1962.(7) Bernard W Silverman.

Density estimation for statistics and data analysis , volume 26.CRC press, 1986.(8) N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression.

The American Statistician , 46(3):175–185, 1992.189) Shankar Kumar, John M. Rosenberg, Djamal Bouzida, Robert H. Swendsen, andPeter A. Kollman. The weighted histogram analysis method for free-energy calculationson biomolecules. i. the method.

Journal of Computational Chemistry , 13(8):1011–1021,1992.(10) Christian Bartels and Martin Karplus. Multidimensional adaptive umbrella sampling:Applications to main chain and side chain peptide conformations.

Journal of Computa-tional Chemistry , 18(12):1450–1462, 1997.(11) Michael R. Shirts and John D. Chodera. Statistically optimal analysis of samples frommultiple equilibrium states.

The Journal of Chemical Physics , 129(12):124105, 2008.(12) Zhiqiang Tan, Emilio Gallicchio, Mauro Lapelosa, and Ronald M. Levy. Theory ofbinless multi-state free energy estimation with applications to protein-ligand binding.

The Journal of Chemical Physics , 136(14):144102, 2012.(13) Bin W Zhang, Shima Arasteh, and Ronald M Levy. the uwham and swham softwarepackage.

Scientiﬁc reports , 9(1):1–9, 2019.(14) Byeong U. Park and J. S. Marron. Comparison of data-driven bandwidth selectors.

Journal of the American Statistical Association , 85(409):66–72, 1990.(15) E Fix and JL Hodges. Discriminatory Analysis. Nonparametric Discrimination: Consis-tency Properties.

USAF School of Aviation Medicine, Randolph Field, Texas , Report4(Project Number 21-49-004), 1951.(16) C. P. Loftsgaarden, D. O.; Quesenberry. A Nonparametric Estimate of a MultivariateDensity Function.

Annals of Mathematical Statistics , 36(3):1049–1051, 1965.(17) Y. P. Mack and M. Rosenblatt. Multivariate k-nearest neighbor density estimates.

Journal of Multivariate Analysis , 9(1):1–15, 1979.1918) Mark N. Kobrak. Systematic and statistical error in histogram-based free energycalculations.

Journal of Computational Chemistry , 24(12):1437–1446, 2003.(19) Michael R. Shirts and Vijay S. Pande. Comparison of eﬃciency and bias of free energiescomputed by exponential averaging, the bennett acceptance ratio, and thermodynamicintegration.

The Journal of Chemical Physics , 122(14):144107, 2005.(20) Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearestneighbor” meaningful? In

International conference on database theory , pages 217–235.Springer, 1999.(21) Giulia Sormani, Alex Rodriguez, and Alessandro Laio. Explicit Characterization of theFree-Energy Landscape of a Protein in the Space of All Its C α Carbons.

Journal ofChemical Theory and Computation , 16(1):80–87, 2020.(22) Matteo Carli, Giulia Sormani, Alex Rodriguez, and Alessandro Laio. Candidate BindingSites for Allosteric Inhibition of the SARS-CoV-2 Main Protease from the Analysisof Large-Scale Molecular Dynamics Simulations.

The Journal of Physical ChemistryLetters , pages 65–72, 2020.(23) Elena Facco, Maria D’Errico, Alex Rodriguez, and Alessandro Laio. Estimating theintrinsic dimension of datasets by a minimal neighborhood information.

ScientiﬁcReports , 7(1):1–11, 2017.(24) J.A. Nelder. Generalized Linear Models Author.

J. R. Statist. Soc. A , 3(135):370–384,1972.(25) Luc Demortier and Louis Lyons. Everything you always wanted to know about pulls.

CDF note , 43, 2002.(26) Shinya Honda, Toshihiko Akiba, Yusuke S Kato, Yoshito Sawada, Masakazu Sekijima,Miyuki Ishimura, Ayako Ooishi, Hideki Watanabe, Takayuki Odahara, and Kazuaki20arata. Crystal structure of a ten-amino acid protein.

Journal of the American ChemicalSociety , 130(46):15327–15331, 2008.(27) Yuji Sugita and Yuko Okamoto. Replica-exchange molecular dynamics method forprotein folding.

Chemical physics letters , 314(1-2):141–151, 1999.(28) Alex Rodriguez, Pol Mokoema, Francesc Corcho, Khrisna Bisetty, and Juan J. Perez.Computational study of the free energy landscape of the miniprotein CLN025 in explicitand implicit solvent.

Journal of Physical Chemistry B , 115(6):1440–1449, 2011.(29) Massimiliano Bonomi, Davide Branduardi, Giovanni Bussi, Carlo Camilloni, DavideProvasi, Paolo Raiteri, Davide Donadio, Fabrizio Marinelli, Fabio Pietrucci, Ricardo ABroglia, et al. Plumed: A portable plugin for free-energy calculations with moleculardynamics.

Computer Physics Communications , 180(10):1961–1972, 2009.(30) Alex Rodriguez and Alessandro Laio. Clustering by fast search and ﬁnd of density peaks.

Science , 344(6191):1492–1496, 2014.(31) Maria d’Errico, Elena Facco, Alessandro Laio, and Alex Rodriguez. Automatic topogra-phy of high-dimensional data sets by non-parametric Density Peak clustering. arXivpreprintarXivpreprint