[PDF] Building intuition for binding free energy calculations: bound state definition, restraints, and symmetry

Abstract

The theory behind computation of absolute binding free energies using explicit-solvent molecular simulations is well-established, yet somewhat complex, with counter-intuitive aspects. This leads to frequent frustration, common misconceptions, and sometimes, erroneous numerical treatment. To improve this, we present the main practically relevant segments of the theory with constant reference to physical intuition. We pinpoint the role of the implicit or explicit definition of the bound state (or the binding site), to make a robust link between an experimental measurement and a computational result. We clarify the role of symmetry, and discuss cases where symmetry number corrections have been misinterpreted. In particular, we argue that symmetry corrections as classically presented are a source of confusion, and could be advantageously replaced by restraint free energy contributions. We establish that contrary to a common intuition, partial or missing sampling of some modes of symmetric bound states does not affect the calculated decoupling free energies. Finally, we review these questions and pitfalls in the context of a few common practical situations: binding to a symmetric receptor (equivalent binding sites), binding of a symmetric ligand (equivalent poses), and formation of a symmetric complex, in the case of homodimerization.

Full PDF

BBuilding an intuition for binding free energy calculations: bound statedeﬁnition, restraints, and symmetry.

E. Duboué-Dijon

1, 2, a) and J. Hénin

1, 2, b) CNRS, Université de Paris, UPR 9080, Laboratoire de Biochimie Théorique, 13 rue Pierre et Marie Curie, 75005, Paris,France Institut de Biologie Physico-Chimique – Fondation Edmond de Rothschild, PSL Research University, Paris,France (Dated: 12 February 2021)

The theory behind computation of absolute binding free energies using explicit-solvent molecular simulations is well-established, yet somewhat complex, with counter-intuitive aspects. This leads to frequent frustration, common mis-conceptions, and sometimes, erroneous numerical treatment. To improve this, we present the main practically relevantsegments of the theory with constant reference to physical intuition. We pinpoint the role of the implicit or explicitdeﬁnition of the bound state (or the binding site), to make a robust link between an experimental measurement and acomputational result. We clarify the role of symmetry, and discuss cases where symmetry number corrections havebeen misinterpreted. In particular, we argue that symmetry corrections as classically presented are a source of con-fusion, and could be advantageously replaced by restraint free energy contributions. We establish that contrary to acommon intuition, partial or missing sampling of some modes of symmetric bound states does not affect the calculateddecoupling free energies. Finally, we review these questions and pitfalls in the context of a few common practical sit-uations: binding to a symmetric receptor (equivalent binding sites), binding of a symmetric ligand (equivalent poses),and formation of a symmetric complex, in the case of homodimerization.

I. INTRODUCTION

Binding free energy calculations aim to quantify the bind-ing between two interacting chemical species. Here we fo-cus on explicit, formally exact methods, although many high-throughput approximate methods exist.

In a biological con-text, absolute binding free energy calculations are increas-ingly used in a number of applications, for instance theprediction of ligand-enzyme afﬁnities in the search of ef-fective inhibitors, or the study of the stability of protein-protein or protein-nucleic acid complexes.

Such calcu-lations are typically performed to predict the afﬁnity of aligand to a biomolecule—for instance in the drug discov-ery community to replace long or expensive experiments.They can also be compared with available experimental data,for instance to test the quality of a simulation that can then beused, thanks to the atomic level details of the simulations, togain additional insight into the binding process or decomposeit into different components. Hence, it is crucial to make surethat the calculated binding free energies or binding constantsare comparable to experimentally determined values, whichrequires special care in the theoretical deﬁnitions as well as inthe computational protocol.The computation of absolute binding free energies has beenthe central topic of a number of works beginning in the 1980s,with seminal theoretical contributions that laid the statis-tical physical foundations of the methods, as well as the prac-tical applications to biological systems.

Thanks to theincrease in computational power, the democratization of sim-ulation tools and availability of pedagogical tutorials, e.g. a) Electronic mail: [email protected] b) Electronic mail: [email protected] as listed in 25, binding free energy calculations are no longerreserved to a small community of experts, but are applied in awide variety of contexts, outside the groups that specialize indeveloping the techniques and software.However, a few key issues are not as consensual as wouldbe expected in a mature ﬁeld, and the related literature is stillscattered, partly contradictory or confusing, and some notionsare hardly accessible to non-specialists. This is in particu-lar the case of the use of symmetry corrections and the ef-fects of partial sampling, which are, as we will see, intimatelylinked to the use of binding restraints and to the question ofthe binding site deﬁnition. If is difﬁcult for a non-specialistto acquire a good physical intuition of these complex matters,which, in our experience, can lead to uncritical application ofill-understood formulae, sometimes erroneously, and is ham-pering the wider adoption of absolute binding free energy cal-culations.Our goal here is to present a clear version of the theory, withconstant reference to physical intuition, enabling practitionersto understand in depth each step of a calculation and prop-erly handle possibly tricky steps such as standard state, re-straint, and symmetry corrections. Only a rigorous treatmentof these points makes it possible to obtain well-deﬁned stan-dard free energies and report meaningful comparisons withexperimental data. We present typical cases to illustrate com-mon problems encountered in practice, and discuss how toproperly treat them, hoping that these will serve as more gen-eral guidelines applicable to a broad range of cases.In this work, we limit our discussion and practical exam-ples to classical all-atom simulations with explicit solvent.For implicit solvent approaches, one may refer for instanceto Ref. 26,27. Two families of methods can be used to quan-tify binding from simulations: spatial and energy-based. Inthe ﬁrst case, simulations are used to compute the statisti-cal distribution of spatial conﬁgurations of the receptor and a r X i v : . [ phy s i c s . c h e m - ph ] F e b ligand. This may be done in unbiased simulations, or, morecommonly, using enhanced sampling techniques and estimat-ing a potential of mean force (PMF)—or more generally a freeenergy surface. This case is treated in detail in Section II C.In the second case, the binding free energy is expressed asfunction of interaction energies between the receptor, ligand,and solvent. Statistics on these interaction energies are col-lected in “alchemical” simulations where the potential energyis modiﬁed so that the simulations samples often unphysical,but mathematically well-deﬁned states. The most commonform of this approach is the double decoupling method. Thisis described in Section II D.In a ﬁrst part, we present a pedagogical review of the twomain classes of binding free energy calculation methods, fo-cusing in more details on the double decoupling method thatwe will use in our examples. A second theoretical part is de-voted to how to properly account for symmetry (both of lig-and or binding sites) and partial sampling. Finally we analyze3 typical test cases that are chosen to cover the range of sit-uations where symmetry of the ligand, receptor, or complexneeds to be accounted for. II. BINDING SITE DEFINITION AND FREE ENERGYESTIMATORS: A PEDAGOGICAL REVIEWA. A few deﬁnitions and notations

We consider a binding equilibrium involving three chemi-cal species: a “receptor” R , a “ligand” L , and a complex RL .The receptor could be a macromolecule with a binding pocketthat completely surrounds a much smaller ligand, or the struc-tures and roles of both species could be similar: the receptorand ligand can even be the very same species in the case of ho-modimerization (section IV C, where we show, however, thatthe macroscopic binding constant is different from that forheteroassociation). For simplicity, we use the conventionalnames receptor and ligand to cover all cases, but the theorydoes not depend on each of these having any speciﬁc property.Note, however, that differences between them (of size, ﬂexi-bility, etc.) do have a practical impact on the convergence ofnumerical quantities from simulations. We will refer looselyto those species as molecules, even though they might be non-molecular species such as monoatomic ions or noble gases.The binding equilibrium writes R + L (cid:29) RL . Under con-stant temperature and pressure conditions, the Gibbs free en-ergy is stationary at equilibrium: ∆ G bind = µ RL − ( µ R + µ L ) = ideal solution , which isrelevant to many chemical and biological applications. Fora more general treatment including the non-ideal case, seeRef. 28. If the chemical potentials of the solutes exhibit theideal concentration dependence, the equilibrium condition 1 writes: ∆ G ◦ bind = − RT ln (cid:18) [ RL ] C ◦ [ R ][ L ] (cid:19) (2) K ◦ bind = e − ∆ G ◦ bind / RT = [ RL ] C ◦ [ R ][ L ] (3)where C ◦ is the standard concentration, commonly taken tobe 1 mol/L. Equation 3 is the law of mass action in termsof volume concentrations. The goal of afﬁnity calculationsis to estimate ∆ G ◦ bind , or equivalently, K ◦ bind . Note that therelated quantity K bind is often deﬁned without standard statenormalization, which then appears in the alternate bind-ing free energy relationship ∆ G ◦ bind = − k B T ln ( K bind C ◦ ) . B. Estimation from direct simulation

In principle, a molecular dynamics (MD) trajectory of a so-lution containing R and L molecules, run long enough to showmany binding and unbinding events, provides enough infor-mation to compute the afﬁnity. The probability of associa-tion in the microscopic simulation system can be related tothe macroscopic binding equilibrium. In practice, the timescales required to sample the binding equilibrium are oftenout of reach, especially in the case of strong binding. For thatreason, direct simulation is rarely used in practice for quantita-tive afﬁnity estimation. Furthermore, the theoretical treatmentnecessary to rigorously connect the microscopic and macro-scopic statistics is non-trivial, and becomes more complexstill in the presence of long-range interactions between thesolutes. In the end, this apparently intuitive approach raisesconsiderations that can be counter-intuitive. Because of thelimited adoption of this method, we will not discuss it in fur-ther detail here, but we refer the interested reader to the exist-ing literature.

Closely related to direct simulation, however, is anenhanced-sampling variant that relies on computing the freeenergy along binding coordinates. C. ∆ G ◦ bind from the Potential of Mean Force In practice, most binding simulations rely on a unitary sys-tem of volume V containing 1 R and 1 L . Calling x the prob-ability of the bound state (or bound fraction) in that system,and applying the law of mass action (Eq. 3) leads to: K ◦ bind = x ( − x ) VV ◦ , (4)where V ◦ = / C ◦ is the standard volume. However, it hasbeen shown that this expression is not always applicable toa microscopic system such as a simulation box, which exhibitsa ﬁnite-size bias. Correcting for this bias, and assuming thatthe solutes do not interact outside of the bound state, K ◦ bind may instead be expressed as: K ◦ bind = x ( − x ) VV ◦ (5)In the dilute limit that is relevant to the deﬁnition of thestandard state, the complex is mostly dissociated, so that thebound probability x (cid:28)

1. Eq. 5 then simpliﬁes to K ◦ bind = x VV ◦ . (6)We consider a simple case where the association processis described by a one dimensional potential of mean force(PMF) w ( r ) along a single coordinate r —typically the dis-tance between the ligand and receptor molecules. The PMF w ( r ) is related to the radial distribution function g ( r ) through w ( r ) = − RT ln [ g ( r )] . This does not assume spherical sym-metry of the site or the complex, but rather incorporates allinformation about the energetics of binding into w ( r ) . Wealso suppose that the binding site can be delineated by a givenrange [ r min , r max ] of r . In this case, the binding constant canbe obtained by integration of the radial distribution functionover the binding site ( i.e. ensemble of "bound" conﬁgura-tions), hence the well-known relationship: K ◦ bind = V ◦ (cid:90) site π r g ( r ) dr = V ◦ (cid:90) site e − β w ( r ) π r dr (7)The full derivation of this formula in a more general casecan be found in ref. 17. In the limit of non-interacting species,as in an ideal gas, K ◦ bind simply measures the volume of thebinding site with respect to the standard volume: K ◦ bind = V site V ◦ (8)Note that this is a special case of both Eq. 7 (with w ( r ) =

0) and Eq. 6 (with x = V site / V ◦ ), illustrating the consistencybetween those expressions.Although inﬁnitely dilute conditions ( x →

0) are necessaryto derive Eq. 7, in practice, the PMF between the two interact-ing species—typically obtained with enhanced sampling tech-niques such as umbrella sampling—can be estimated in a ﬁ-nite simulation box, which only needs to be large enough sothat the two species are non-interacting (in practice, weaklyenough) at the largest distance considered for the PMF calcu-lation. Application of this PMF approach to complex cases,with the use of restraints and the associated adequate correc-tions, has been amply discussed from a practical and numeri-cal perspective elsewhere.

This well-established relation immediately raises three im-portant points. First, as is now usually known, the bindingfree energy is not simply the well depth of the PMF. Second,the standard state has to be explicitly accounted for in bind-ing free energy calculations by dividing the integral by thestandard volume. Finally, K ◦ bind explicitly depends on a “site”deﬁnition, which we comment on in detail in section II E. D. ∆ G ◦ bind from alchemical double decoupling

1. Thermodynamic cycle.

The PMF method to calculate binding free energies, thougharguably the most intuitive, becomes difﬁcult to handle and converge for complex geometries. Thus, many applicationsfavor the so-called “double decoupling” alchemical approach,which is formally equivalent to the PMF calculation, providedall terms are properly estimated and a consistent deﬁnition ofthe binding site is used.

It builds upon the fact that, since the Gibbs free energy is astate function, ∆ G ◦ bind can be calculated using any thermody-namic path between the two states "RL" and "R+L". In par-ticular, the double decoupling method makes use oftwo alchemical transformations, decoupling the ligand fromits environment respectively in the bulk and in the bindingsite. For these respective transformations, free energy differ-ences ∆ G ∗ bulk and ∆ G ∗ site are estimated using numerical esti-mators such as the exponential formula, Bennett’s AcceptanceRatio (BAR) or its multi-state variant MBAR, or Thermo-dynamic Integration (TI). To improve convergence of thesequantities, the transformations go through a number of inter-mediate states with sufﬁcient overlap.

Note that in theory, ∆ G ◦ contains a P ∆ V term, but that term very nearly cancelsout when computing the difference ∆∆ G ∗ = ∆ G ∗ bulk − ∆ G ∗ site .As this term does not intervene in any of the speciﬁc issuesthat we discuss, we will omit it henceforth. ∆ G ∗ bulk and ∆ G ∗ site are excess free energies of decoupling,and ∆∆ G ∗ is the excess free energy of binding , in the sense thatin a system of non-interacting particles, their values would bezero. As a result, the ideal-gas, “cratic” contribution to thebinding free energy due to translational entropy can only beaccounted for by other terms. Therefore, the standard bindingfree energy is not simply obtained as the difference of thosetwo terms— i.e. ∆ G ◦ bind (cid:54) = ∆∆ G ∗ as was sometimes done inthe literature —because one has to take into account thestandard state reference, as proposed in 1986 by Jan Her-mans and Shankar Subramaniam, as well as correct for anyrestraints used to estimate ∆ G ∗ site , as we will discuss later (sec-tion II D 2).The complete typical thermodynamic cycle employed isrepresented in Fig. 1. If intramolecular interactions are notperturbed, ∆ G ∗ bulk corresponds to decoupling the ligand fromthe bulk into the gas phase, and is simply related to the solva-tion free energy of the ligand: ∆ G ∗ bulk = − ∆ G solv . Note thatthis is not the case if internal interactions within the ligandare also turned off. ∆ G ∗ site is the free energy of decoupling theligand in the “bound” state. In current practice, the cor-responding alchemical transformation is performed using a re-straining potential V rest (red triangle in Fig. 1) that ensures thatthe ligand stays in the active site during the whole transforma-tion. Note that if the ligand is charged, the overall charge ofthe ﬁnite simulation box changes during the alchemical trans-formations, which introduces artifactual free energy contribu-tions that must be accounted for. For a detailed discussion ofthese issues and simple expression of correction terms, onecan refer to Ref. 45 for instance.In addition to the two decoupling steps already discussed,one needs to evaluate the free energies associated with re-straining the decoupled ligand, ∆ G V ◦ → restdecoupled —which is oftendone in the same step as taking into account the standard statereference—and removing the restraints on the coupled ligandin the bound state ∆ G rest → sitecoupled . The desired standard bindingfree energy ∆ G ◦ bind is then obtained combining all the differ-ent steps: ∆ G ◦ bind = ∆ G ∗ bulk + ∆ G V ◦ → restdecoupled − ∆ G ∗ site + ∆ G rest → sitecoupled (9)

2. Use of restraints and corrections.

Restraints not only allow faster convergence of the free en-ergy calculation because they limit the amount of phase spaceto explore in the (partly) decoupled state(s), but are also, as previously noted, necessary from a theoret-ical perspective, even in the bound state, contrary to what issometimes stated in the literature. Speciﬁcally, restraintsare necessary to enforce sampling of (some approximation of)the bound state, because unbinding events, even rare, are in-compatible with the estimation of ∆ G ∗ site , which is the freeenergy to decouple the ligand from the binding site and thusrequires the ligand in its coupled state to explore only “bound”conﬁgurations. Without restraints, the ligand would eventu-ally leave the active site even in the fully coupled state. Thisis especially apparent in the case of weak binding, when ithappens spontaneously in a relatively short simulation time,but would also be the case even for strong binding in thelimit of inﬁnitely long sampling. It has been historically over-looked because of the kinetic stability of the bound state onthe timescale of relatively short simulation times. Unbindingwas a rare event that could have been treated as a numeri-cal accident without raising theoretical questions. The lackof restraints causes particularly noticeable numerical issueswhen the backwards transformation ( recoupling ) is sampled,as unbinding then becomes likely, creating a hysteresis in thefree energy. In contrast, one could in principle avoid using re-straints in the decoupled state (thus obtaining directly the freeenergy associated with the process RL rest → R + L*), but thecalculation would then suffer from severe convergence issues,as pointed out repeatedly.

To discuss the practical evaluation and magnitude of thetwo terms, ∆ G V ◦ → restdecoupled and ∆ G rest → sitecoupled , it is useful to presentthe commonly used types of restraints. While it is temptingto maintain the ligand in the binding site at all stages of thealchemical transformation with a simply restrain on the dis-tance between the receptor and ligand, the free rotation of theligand allowed by such simple restraints may make samplingof all different conformations complicated, especially in theintermediate stages of the transformation. Explicit restraintson each translational and rotational degree of freedom havethus been suggested, possibly in addition to the internalligand RMSD , or using an external ligand RMSD as solerestraint coordinate. Independently of the chosen degrees offreedom, the restraining potential can be of different forms,typically either harmonic or ﬂat-bottom harmonic, i.e. actingonly if the chosen degree of freedom is outside a given spec-iﬁed range. Harmonic restraints would typically restrict theligand position to remain very close to the reference boundstate, while ﬂat-bottom potentials allow for free movement ofthe ligand in the "bound" region. The free energy associated with adding the restraints on theligand in the decoupled state ( i.e. gas phase), initially freelyevolving in the standard volume, ∆ G V ◦ → restdecoupled , does not admita simple expression in general. With arbitrary complex re-straints, one can estimate this term in a separate calculation,for instance gradually switching off the restraints and comput-ing the associated free energy with standard techniques. For speciﬁc forms of restraints however, ∆ G V ◦ → restdecoupled can bemore straightforwardly obtained. For instance, when the sixbody rigid rotations and translations are restrained using har-monic potentials as suggested by reference 18, ∆ G V ◦ → restdecoupled ad-mits, with a few assumptions, a simple analytical expressionthat depends only on the shape of the harmonic potentials. An-other simple case is when a restraining potential is applied onthe distance between the ligand and receptor. ∆ G V ◦ → restdecoupled canthen be numerically estimated by integration of the restraintpotential U rest : ∆ G V ◦ → restdecoupled = − RT ln (cid:18) QV ◦ (cid:19) , where (10) Q = (cid:90) ∞ π r e − β U rest ( r ) dr . (11)Finally, one has to estimate the free energy associated withreleasing the restraints on the coupled ligand in the bindingsite ∆ G rest → sitecoupled . It is important here to stress that this is not the free energy to remove all restraints on the coupled systemand let the ligand freely evolve in the whole box . Instead,it corresponds to the free energy difference between the lig-and freely evolving in the binding site and the ligand evolv-ing under the restraints chosen for the alchemical transforma-tion, which are designed to keep it, more or less strongly, inthe binding site. Evaluating this term obviously presupposesprior deﬁnition of the so-called bound state, i.e. the ensembleof conﬁgurations that are considered as bound. We will comeback to this point in Section II E. Evaluation of this term hasoften been overlooked because restraints are usually chosen toweakly perturb ligand ﬂuctuations in the bound state. In thetypical case of strong ligand-protein binding, weak restraintsin practice give a negligible contribution in the bound state.Flat-bottom restraints can be speciﬁcally designed so that theunbiased region covers the bound state ensemble, and the re-straints therefore have a negligible free energy contribution. One can even use them directly as the deﬁnition of the boundstate, so that by deﬁnition ∆ G rest → sitecoupled =

0. In contrast, har-monic restraints may, depending on their strength, be associ-ated with much more signiﬁcant corrections. Note that in thisframework one can choose as strong restraining potentials asdesired for convergence of the alchemical transformation, pro-vided their impact on sampling of the bound state is properlyestimated and corrected for.

In general, if the bound state is deﬁned with an indicatorfunction I site ( x ) , where x denotes all the coordinates of thesystem ( I site = I site = ∆ G rest → sitecoupled can be estimated as: ∆ G rest → sitecoupled = − RT ln (cid:68) I site ( x ) e + β U rest ( x ) (cid:69) rest (12) bound + ++R L R L*R L* rest RL* rest RL rest RL ∆ G° bind ∆ G* bulk ∆ G=0 −∆ G* site ∆ G decoupled V° → rest ∆ G coupled rest → site FIG. 1. Thermodynamic cycle for double decoupling with ligand restraints. R: unbound receptor. L: ligand coupled with its environment.L*: decoupled ligand. RL: receptor-ligand complex. The transformation R + L ∗ rest → RL ∗ rest can be seen as a translation of the non-interactingligand, with zero contribution to the free energy. where (cid:104) . . . (cid:105) denotes the ensemble average over conﬁgurationsin presence of the restraining potential. Eq. 12 can be simplynumerically estimated on the fully coupled simulation win-dow if the binding site is well sampled in presence of the re-straints (as was for instance done by us and others, wherea simpliﬁed version of Eq. 12 was reported, which in thosecases is numerically equivalent to Eq. 12). If this is not thecase, for instance when using strong harmonic restraints, thenEq. 12 would not numerically converge and ∆ G rest → sitecoupled has tobe calculated through a separate free energy calculation. Inparticular, if the restraint acts on a single coordinate z and thebinding site can also be simply deﬁned by a criterion on thatsame coordinate, then ∆ G rest → sitecoupled can be calculated from thefree energy proﬁle along z : e − β ∆ G rest → sitecoupled = (cid:82) e − β A ( z ) I site ( z ) e + β U rest ( z ) dz (cid:82) e − β A ( z ) dz (13) E. ∆ G ◦ bind should be dependent on, but not sensitive to, thesite deﬁnition. From the PMF deﬁnition of the binding constant (Eq. 7), itis clear that K ◦ bind , and equivalently the binding free energy,formally depends on the binding site deﬁnition. The integral in Eq. 7 does not converge at large distances, so a range of in-tegration obviously needs to be deﬁned. As discussed above,in the double decoupling framework, a deﬁnition of the bind-ing site is also needed to estimate ∆ G rest → sitecoupled . This has histor-ically raised many questions and is still often a matter ofconfusion for non-specialists. The origin of the confusion isthat it is tempting to argue that since experiments do not needto deﬁne the binding site, calculations should not either. This reasoning, however, has repeatedly being demonstratedto be ﬂawed.

All theoretical derivations of a bindingfree energy need to deﬁne the conﬁgurations that are consid-ered as forming a complex, the bound state, often through abinding site indicator function I , taken as 1 for bound conﬁg-urations and 0 otherwise. Alternative expressions allow-ing unrestricted integration do not yield the desired bindingfree energy but rather relate to the second virial coefﬁcient,probing non speciﬁc interactions as well as binding. How todeﬁne the bound state in then a question that naturally arises.As discussed in several works, experimental determi-nations of binding constants implicitly deﬁne the bound stateas the conﬁgurations producing the signal used for detection.Different techniques being sensitive to different types of com-plexes, the choice of a bound state deﬁnition in calculationsshould in principle depend on the experiment one wants tocompare with. For instance, spectroscopic techniques typi-cally detect only closely associated ligand-protein complexes,while calorimetry gives a signal coming from all conforma-tions giving rise to even small heat transfer (even non site-speciﬁc association). Explicit calculation of the experimentalsignal often being out-of-reach, simulations usually assume atwo-state model with "bound" and "unbound" conﬁgurations,which is a simpliﬁed description of the possibly complex asso-ciation process, especially in case of long range, e.g. electro-static, interactions, where determining which conﬁgurationscontribute to the signal is not trivial. Note that one strength ofsimulations is to be able to decompose the total binding freeenergy into the contributions of different binding modes, oftenhard to access experimentally. In practice, for strong bindingprocesses, the outcome of the simulations is insensitive to theexact binding site deﬁnition, i.e. the integration limits in Eq. 7,so long as all the low free energy regions are included in thebound state. This is illustrated by Fig 1 in Ref. 15. A reason-able choice is to put the boundary at a maximum of free energyso that the calculated binding constant are relatively insensi-tive to the its exact location. Note that the restraining potentialitself can be used as the deﬁnition of the binding site, the in-dicator function I then being taken as I = e − β V rest , which re-moves the need to correct for the restraints in the bound state,as ∆ G rest → sitecoupled =

0. Such a deﬁnition of the binding site is oftenimplicitly taken with ﬂat-bottom restraints.

III. ACCOUNTING FOR SYMMETRYA. Symmetry number corrections in binding free energies

1. Origin of the classic symmetry correction

Perhaps the most inﬂuential expression for a binding freeenergy is that proposed by the landmark paper of Gilson etal. In addition to the decoupling free energies and pressure-volume term, their main expression for the absolute bindingfree energy includes a correction ∆ G sym = RT ln (cid:18) σ RL σ R σ L (cid:19) , (14)where σ x denotes the symmetry number of species x . Thissuggests that the binding free energy should include an ex-plicit term to account for changes in symmetry between theseparated components and the complex. Gilson et al.’s ex-pression also features separate corrections for restraint con-tributions to the calculated free energies with respect to thestandard state.The symmetry term arises from the expressions of thechemical potential of each species based on its classical par-tition function, as formulated by Chandler and Pratt (seein particular section II, paragraphs A to E in that work).Quantum-mechanically, all atoms of the same element are in-distinguishable, so two conﬁgurations that differ by e.g. swap-ping the positions of two carbon atoms are really one and thesame microstate. When deﬁning a classical partition function,all atoms are labeled as if they were distinguishable. If the po- tential energy function treats atoms of the same element iden-tically, the resulting atomic partition function must be dividedby the number of permutations of atoms of the same elementin the system to compensate for the overcounting.This is not the case in force-ﬁeld based classical simula-tions, where molecules are deﬁned by non-dissociative bonds(usually harmonic) between arbitrarily labeled atoms. As aresult, most atom permutations break the bonded structure ofthe molecule and lead to high energies, and the resulting con-ﬁguration does not contribute to the computed partition func-tion. However, if a molecule is symmetric, some permutationsof atoms, called symmetry permutations—for instance, swap-ping the two oxygen atoms of a carboxylate group—preservethe molecular structure and the energy. All symmetry-relatedconﬁgurations thus contribute equally to the partition func-tion. Therefore, deﬁning non-dissociative bonds leads to dou-ble or multiple counting only for symmetry-related conﬁgura-tions . Dividing the partition function by the molecular sym-metry number corrects for this overcounting. To be precise,the relevant molecular symmetry number is the number of per-mutations of atoms within the molecule that leave the poten-tial energy unchanged. In practice, it is the symmetry numberof the molecular graph, deﬁned by non-dissociative bonds inthe force ﬁeld.This runs counter to an intuitive understanding of the mean-ing of σ RL . Crucially, it is not the symmetry number thata chemist would ascribe to the complex; for example, onewould ascribe the monodentate acetate-cation complex (Sec-tion IV A 2) a symmetry number of 1—as opposed to 2 for freeacetate—because intuitively, the bound cation breaks the sym-metry of the carboxylate group. Inserting these into Eq. 14would lead to a spurious correction term ∆ G sym = − RT ln 2.If one considers the complete statistics of an unrestrained sim-ulation of the complex, though, the cation is just as likely tobind with either oxygen atom of acetate, therefore the mon-odentate complex retains the order-2 symmetry of acetate, ina statistical, if not instantaneous, sense.

2. Is the symmetry correction necessary?

If non-covalent binding is described only by dissociativeenergy terms, as typically done in force-ﬁeld simulations ofbiomolecules, the overcounting of bound ( RL ) and unbound( R + L ) conﬁgurations is precisely the same (the potential en-ergy function has the same symmetry in the bound and un-bound states), so that σ RL = σ R σ L . In this case, the symmetrycorrection of Eq. 14 becomes: ∆ G sym = RT ln (cid:18) σ RL σ R σ L (cid:19) = . (15)Computing this term using the chemist’s notion of molecu-lar symmetry instead yields unwarranted corrections, as hap-pened in our own past work. In this regard, we think that ourconvention (Section II D 2) leaves much less room for ambi-guity: we introduce separately a deﬁnition of the bound state,and a set of binding restraints, giving rise to the free energyterm linking the two: ∆ G rest → sitecoupled .Thus, we may sample the bound state under restraints thatmake it less symmetric than the true complex (consideringonly one orientation of a symmetric ligand, or one among aset of symmetric sites). This should obviously be accountedfor, but since this can only be imposed as external restraints ontop of the force ﬁeld description, we argue that these symme-try corrections are really corrections to the restraint free en-ergy, and, in the framework of the thermodynamic cycle Fig.1, should therefore be accounted for within ∆ G rest → sitecoupled . Thisseems to us more intuitive, and less error-prone, than relyingon symmetry considerations that are rooted in deeper strata ofthe underlying statistical-mechanical theory.Our formalism differs from Gilson et al.’s, because theirdeﬁnition of the complex includes an indicator function ofbound conﬁgurations, which is also used as a restraint in sim-ulations. Such restraint terms can behave as non-dissociativebonds. If they are integrated into the deﬁnition of the com-plex and reduce its symmetry, then they give rise to a non-zero symmetry correction. Thus, the symmetry correction ofGilson et al. plays the role of the symmetry contributions to ∆ G rest → sitecoupled in our formalism. Crucially though, for a givenbinding process, the restraints may or may not break the sym-metry of the complex: therefore the symmetry correction de-pends on arbitrary choices made in the restraint scheme (foran example, see III C).In summary, the symmetry correction does not directly ac-count for the difference in symmetry between the complex (asdeﬁned by a chemist) and its isolated components, as couldbe surmised based on equation 14. Rather, it accounts forthe difference between the symmetry of the complex and thesymmetry of the restraints used to enforce it in the simulation.This is the spirit of the formalism of Hermans and Wang ,who state that in the presence of symmetric binding orienta-tions, “application of the body restraint potential effectivelybreaks the symmetry of the molecule and artiﬁcially restrictsthe molecule to a single one of these,” and that this causes theconventional expressions to overestimate the effect of orienta-tional restraints, requiring a symmetry correction. B. Partial sampling: “if you’ve seen one, you’ve seen ’em all”

Whether or not there are restraints that restrict the symme-try of the bound complex, the ensemble of conﬁgurations thatare effectively sampled in ﬁnite time can be smaller than thetrue ensemble, and in particular, some modes of a symmet-ric distribution can be poorly sampled, or not sampled at all.In this case, it has been argued that unsampled modes lead tomissing terms in the calculation of partition functions, whichbiases the calculated free energies and requires a symmetrycorrection. We argue here that this statement is incorrect inthe case of alchemical perturbation. It is, however, true in thecase of restraint free energy perturbation, when the restraintpotential is asymmetric, as discussed in Section IV B, or if therestraint potentials are perturbed in the course of the alchemi-cal transformation.The argument would be valid only if the partition functionsfor the relevant states were estimated separately by integration over conﬁguration space based on independent simulations.Then any region missed in sampling in the complex wouldbe effectively unaccounted for. However, in alchemical free-energy calculations, this is not the case: partition functionsfor individual end states are never estimated as integrals , butrather, their ratios are computed as ensemble averages . Recallthat the free energy of alchemical perturbation from state A tostate B can be written : e − β ∆ F AB = (cid:68) e − β ∆ U AB (cid:69) A (16)Therefore, unsampled or partially sampled regions do notresult in missing terms in a sum used to estimate one particularpartition function, but missing terms in an ensemble average.Numerically this ensemble average is estimated by an aver-age over conﬁgurations generated in state A . Suppose that theconﬁgurations are drawn from two symmetric modes Γ and ∆ .Then conﬁgurations from Γ and ∆ yield the same distributionof values ∆ U AB , and the computed average does not dependon the number of samples drawn from each mode. Thus, thenumber of transitions (if any) observed between symmetricmodes during the course of the simulation does not impact theconvergence of the free energy estimate.A signiﬁcant source of confusion about this question maycome from the process of stratiﬁcation, whereby discrete in-termediate states between A and B are simulated to computethe total free energy difference. This suggests that both endpoints of the transformation are not sampled identically, and e.g. only one of the modes is kinetically accessible in thecoupled state, but the whole space is readily sampled in thedecoupled state. This has been argued to require individ-ual symmetry corrections for each “pair of states” along thetransformation. In a stratiﬁed setting, the states A and B above would correspond to intermediate states, not the ﬁnalend-states. That is, as shown above, the free energy differ-ence for each window is unbiased by unbalanced sampling ofsymmetric modes. The fact that this happens to various de-grees over different windows is immaterial. This remains truewhen combining data from all states at once to estimate thefree energy using MBAR, as the estimator only depends onthe statistical distribution of comparison energies between thestates, which is itself insensitive to which of the symmetricmodes the samples were collected from.Shortly put, the ﬁrst step in all free energy estimators isto take a set of conﬁguration samples x and map it to a setof comparison energy samples ∆ U AB ( x ) (or for TI, energyderivatives ∂ U ( λ , x ) / ∂ λ ). At this step, symmetry-relatedconﬁgurations map to the same energy, and their distributionamong symmetric modes becomes irrelevant. This is illus-trated numerically below in the case of acetate-cation binding(IV A 2), where we show that partial sampling of symmetricmodes does not affect free energy estimates. C. The case of symmetry-breaking restraints

1. Cancellation of symmetry eﬀects in restraint free energies

A corollary of the result above, perhaps even more counter-intuitive, is that any restraint that has no other effect than re-stricting sampling to one of the symmetric modes does notaffect the decoupling free energy . Suppose that the same bind-ing process is studied with two numerical approaches, eithera symmetry-compatible restraint or a symmetry-breaking re-straint. Despite the difference in restraints, the two approachesgive the same values for both decoupling free energies ∆ G ∗ bulk (of course) and ∆ G ∗ site (because of the arguments above). Onemay then wonder if the change in restraint contributions to thefree energy causes the two approaches to disagree with eachother. In fact, differences appear in both restraint free energies ∆ G V ◦ → restdecoupled and ∆ G rest → sitecoupled , and these two contributions cancelout in Equation 9, yielding the same binding free energy forboth schemes.This is not a purely theoretical consideration, as differ-ent applications make one or the other approach more con-venient in practice. For example, complexes with spatiallydistinct binding sites (IV A 1) are more naturally describedwith a restraint surrounding a single site. Conversely, com-plexes with fast-exchanging binding modes such as the differ-ent orientations of a benzene ligand might be easier to sim-ulate with a simple translational restraint encompassing allbinding modes. Therefore it is useful to have a consistentframework to describe these two cases. The thermodynamiccycle and notations of section II D 2 are well-suited for that.

2. Analytical treatment under simplifying assumptions

To make the symmetry contributions to the restraint freeenergies apparent, we describe a case where ∆ G V ◦ → restdecoupled and ∆ G rest → sitecoupled admit simple analytical expressions. In most prac-tical applications, at least one of these terms has to be com-puted numerically. Suppose that we want to estimate the bind-ing afﬁnity of a bound state that comprises either n symmetry-equivalent binding poses (symmetric ligand) or n symmetry-equivalent binding sites (symmetric receptor): these two casesare largely equivalent from a formal perspective. Crucially,we apply ﬂat-bottom restraints that do not perturb binding toan individual mode. Different deﬁnitions of both the site andthe restraints are possible. Table I summarizes the value of thedifferent free energy terms depending on these choices. Asstated above, the values of the excess free energies of decou-pling, ∆ G ∗ bulk and ∆ G ∗ site , are independent from the symmetryof the restraint scheme or that of the site deﬁnition. We call V rest the restraint volume for one mode. If we use an asym-metric site deﬁnition, i.e. limited to a single binding site orbinding pose, then we naturally employ asymmetric restraints(ﬁrst line of Table I) and the obtained binding free energy isthen: ∆ G ◦ bind, asym = ∆∆ G ∗ − RT ln ( V rest / V ◦ ) . (17) In contrast, if we deﬁne the binding site as symmetric, thendifferent restraints can be used (last two lines of Table I), ei-ther obeying or breaking the symmetry of the binding site.Both setups eventually yield the same standard free energy offorming the n -fold-symmetric complex: ∆ G ◦ bind = ∆∆ G ∗ − RT ln ( V rest / V ◦ ) − RT ln ( n ) . (18) IV. TYPICAL PRACTICAL PROBLEM CASES

Below, we describe three typical situations that cover mostof the binding cases where symmetry issues are encountered.They include two ways in which binding can break symme-try (either a symmetric receptor or a symmetric ligand, thesymmetry of which is lost in the complex), and a case wherebinding increases symmetry, when two identical molecules as-sociate to form a symmetric homodimer.

A. Case 1: symmetric receptor – Equivalent sites

1. Equivalent binding sites on a protein complex

A common case of equivalent symmetric binding sites isencountered with protein multimers, when symmetric bindingsites are found on the supramolecular complex. This is, forinstance, the case of phenol binding to the insulin hexamer, which is relevant for pharmaceutical preparation of insulin,and has been recently studied by simulations. FIG. 2. Structure of the insulin hexamer bound to six phenolmolecules, rendered with VMD. We will ﬁrst consider the ﬁrst binding free energy, that is,the binding of one ligand to the protein multimer.

Site def. Restraint ∆ G ∗ bulk − ∆ G ∗ site ∆ G V ◦ → restdecoupled ∆ G rest → sitecoupled ∆ G ◦ bind asym asym ∆∆ G ∗ − RT ln ( V rest / V ◦ ) ∆∆ G ∗ − RT ln ( V rest / V ◦ ) sym asym ∆∆ G ∗ − RT ln ( V rest / V ◦ ) − RT ln ( n ) ∆∆ G ∗ − RT ln ( V rest / V ◦ ) − RT ln ( n ) sym sym ∆∆ G ∗ − RT ln ( n × V rest / V ◦ ) ∆∆ G ∗ − RT ln ( V rest / V ◦ ) − RT ln ( n ) TABLE I.

Decomposition of the absolute binding free energy (Eq. 9), in the case of binding modes with or without order- n symmetry,with symmetric or asymmetric ﬂat-bottom restraints. Simple analytical expressions apply because of assumptions listed in the text. In amore general case, restraint free energies may be estimated numerically, and the RT ln ( n ) terms listed here are, at convergence, accounted forby the numerical estimators, and need not be explicitly added. See section IV B for a discussion of cases where these might not converge inpractice, and strategies to handle it. Typically, in such a case, one would model the complexwith a single ligand bound, and perform alchemical decou-pling using, for instance, a ﬂat-bottom or harmonic restrainton the distance between the ligand center-of-mass and thatof selected protein atoms around the binding site. This re-straint is asymmetric within the framework developed in sec-tion III C.In practice, it is usual to ﬁrst consider binding to a singlesite, and evaluate the restraint contributions with respect tothis single site deﬁnition.This procedure yields the microscopic binding constant toa single site K ◦

1, asym , which ignores the other, symmetric sites.This is the inverse of the microscopic dissociation constant ofthe biophysics literature. The overall ﬁrst binding constant toany of of the n identical sites is: K ◦ = n K ◦

1, asym (19)To predict binding of more ligands to the mono-ligandedcomplex, we must take into account the fact that this complexis less symmetric than the isolated receptor, so nothing morecan be said without further simplifying assumptions. Typi-cally, we expect multimeric proteins like insulin to exhibitcooperative binding. If, however, one assumes independentbinding in the n different sites, then the binding constant of i ligands to the receptor can be expressed as a general functionof K ◦

1, asym as: K ◦ i = n − i + i K ◦

1, asym (20)In the more realistic framework of cooperative binding, wewould have to simulate explicitly the second binding event,with sufﬁcient timescales to allow relaxation of the interac-tions that couple the two sites, and taking into account fur-ther changes in symmetry, depending on which combinationof sites is populated. This is evidently a much more involvedenterprise, and beyond our scope here.

2. Cation binding to carboxylate groups: an example of asymmetric binding site and partial sampling.

In the case discussed above of symmetric binding siteson large objects such as proteins, sampling is usuallyattempted—with restraints chosen accordingly— of only oneof the binding sites at a time. However, symmetric “receptors” can also present two symmetric binding modes that are spa-tially close, in which case one would more spontaneously at-tempt sampling both modes at once. We will discuss in detailsuch a situation, showing how it relates and differs from theprevious one, on the typical case of ion binding to a symmetriccarboxylate group—acetate for the sake of simplicity—thatwe recently encountered, and for which we needed bindingpredictions to help interpret experiments. Different modes of cation binding to carboxylate groupshave been described.

A cation can interact directly witheither with both carboxylate oxygen atoms in the bidentatebinding mode, or with only one of the two oxygens in themonodentate mode. Binding can also occur through watermolecules without any direct interactions, with the formationof solvent-shared ion pairs. For the purpose of this discussion,we now focus on the monodentate binding mode, our goalbeing to properly estimate the 1:1 monodentate binding of acation ( e.g. Mg + ) to acetate. The binding site correspond-ing to monodentate binding is symmetric, since the cation caninteract symetrically with either of the two oxygen atoms aspictured in Fig. 3. Mg O1O FIG. 3. a) Simulation snapshot of a magnesium cation interactingwith an acetate anion in a monodentate mode. b) Qualitative schemeof the two symmetric monodentate binding lobes. The ﬁgures wereprepared using the VMD visualization software. The distance d between the carbon atom of the carboxy-late group and the cation is the natural coordinate, previouslyused in the literature, to deﬁne the monodentate bind-ing mode and separate it from bidentate and solvent-sharedbinding modes. For instance, in the case of Mg + bind-ing to acetate, monodentate bound geometries correspond to2 . < d < . d < . d > d > . d < . and the magnesium ion. Thefree energy associated with each alchemical transformationis obtained using the BAR algorithm. Other simulationdetails are provided in the Methods section (see SupportingInformation). Table II summarizes the free energies asso-ciated with each step of the thermodynamic cycle (Fig. 1),including the appropriate second order correction to ∆ G ∗ bulk due to the change in overall charge during the alchemicaltransformation. ∆ G rest → sitecoupled is estimated on the fully coupledsimulation window using Eq. 12, the site being deﬁned strictlyas 2 . < d < . ∆ G rest → sitecoupled is small (in the presentcase, negligible). ∆ G V ◦ → restdecoupled is evaluated with Eqs. 10-11 bynumerical integration of the restraining potential. The ﬁnalstandard binding free energy ∆ G ◦ bind = − . − is thenobtained combining all the different terms according to Eq. 9.Since our deﬁnition of the “bound state” encompasses bothof the two symmetric binding modes (to both oxygen atoms),the obtained binding free energy ∆ G ◦ bind directly measures thebinding to both of the symmetric sites, without need for anysymmetry correction, despite the symmetry of the bindingsite.We did not comment so far on the actual conformationssampled during the alchemical simulation. In practice, in thecoupled state and in the ﬁrst windows of the alchemical trans-formation, only one of the two binding sites is visited, dueto kinetic trapping of the cation. Intuitively, it would thenbe tempting to try and correct for this partial sampling, sinceonly half of the bound conﬁgurations are visited in the fullycoupled state. However, we argue that counter-intuitively anddespite partial sampling, the decoupling free energy is not af-fected and no corrections are needed. This derives from thefact that the BAR estimator that we employ depends only thepotential energy distributions in each decoupling window. Asdiscussed in III B, so long as the partial sampling does notmodify the potential energy distribution, then the same freeenergy is obtained, irrespective of the sampling. We use thefollowing numerical experiments to illustrate our point.First, we can reanalyze the data from our initial alchemi-cal transformation (Table II) where only binding to O1 wassampled in the ﬁrst few windows, and keep in all windows only the conformations corresponding to this binding mode.We now obtain ∆ G ∗ site = . ± . − , which is notdifferent, within the error bars of our calculations, from the value initially obtained using all the samples. To further il-lustrate our point, we can use a slightly different Mg + forceﬁeld, so that both monodentate binding modes are sampledin all the windows of the alchemical transformation. We nowevaluate ∆ G ∗ site using, in all the windows, only the conforma-tions where the cation is closer to O1 [resp. O2]. The ob-tained ∆ G ∗ site values do not signiﬁcantly change—1647 . ± . . ± . − respectively—and are identicalwithin the error bars to that obtained using all the samples, ∆ G ∗ site = . ± . − .In both these examples, we showed that ∆ G ∗ site does not de-pend at all on the extent of sampling of one or both symmet-ric binding modes in all or a few windows of the alchemicaltransformation. As detailed in section III B, this stems fromthe use of a free energy estimator that depends only on thedistribution of comparison potential energies ( ∆ U AB ) in eachwindow. In our case, due to the symmetry of the two bindingmodes, the distribution is the same irrespective of the (partial)sampling, and thus the same is true for the computed bindingfree energy. B. Case 2: symmetric ligand – Phenol binding to lysozyme.

In contrast with large biomolecules, small ligands ( e.g. phe-nol, benzene, may exhibit molecular symmetry. As an exam-ple, we consider the classic case of phenol binding to the engi-neered binding sites of the lysozyme from phage T4.

Vari-ous mutant lysozymes (L99A, L99A/M102Q, L99A/M102H)have been shown to bind a variety of organic molecules, suchas phenol, with afﬁnities that depend on the more or less polarnature of the cavity. This system has been extensively charac-terized experimentally and has become a benchmark for bind-ing free energy (afﬁnity) calculations.

FIG. 4. L99A/M102H mutant of T4 lysozyme complexed with phe-nol. The protein is rendered as cartoon representation and coloredbased on secondary structure elements. Residues from the bindingsite are rendered as a smooth surface and colored based on residuetype (white: hydrophobic, green: polar, blue: basic). Phenol is ren-dered as spaceﬁll and colored by element. Rendered using VMD based on PDB structure 4I7L. Due to the symmetry of the phenol molecule, 180 ◦ rotationof the aromatic ring around the CO bond axis leads macro-1 Free energy term ∆ G ∗ bulk ∆ G ∗ site ∆ G V ◦ → restdecoupled ∆ G rest → sitecoupled ∆ G ◦ bind kJ mol − − scopically to the exact same binding pose. However, the twosymmetric binding poses can be artiﬁcially distinguished insimulations because of the labelling of each atom.In double decoupling free energy calculations, differentkinds of restraints can be employed. If simple harmonic (orﬂat bottom) restraints on the distance between the phenol cen-ter of mass and the protein binding site are used, then thesampling of both symmetric poses is in principle (if not inpractice) allowed. In contrast, if orientational restraints (forinstance, overall ligand RMSD with respect to the bindingsite ) are employed, then the sampling is restricted to oneof the two symmetric modes during the alchemical transfor-mation. As explained in Section II D, assuming numericalconvergence, the different restraint schemes give the samevalue of ∆ G ◦ bind , provided the restraint terms ∆ G rest → sitecoupled and ∆ G V ◦ → restdecoupled are properly taken into account.However, we should underline that all protocols arenot equivalent in terms of the rate of convergence of the ∆ G rest → sitecoupled term. If the restraint is symmetric, then for thesame reason as those developed in section III C, the restraintfree energy is totally insensitive to partial sampling of one ofthe two modes, which makes its convergence depend solelyon relaxation within a binding pose, hence generally faster.However, with an asymmetric restraint (allowing sampling ofonly one of the two modes), the estimation of the restraint freeenergy ∆ G rest → sitecoupled crucially depend on proper sampling of thetwo modes, because the perturbation now involves an asym-metric potential energy term, so comparison energies dependon the sampled mode (if the allowed mode is sampled in therestraint FEP, but not the forbidden mode, a spurious RT ln ( ) appears in the free energy). Hence, for numerical efﬁciency, it is generally advisable to use restraints of the same sym-metry as the binding site deﬁnition . If, however, asymmetricrestraints are preferred for any reason, and assuming that theyforbid one of the poses entirely, then we recommend decom-posing the evaluation of ∆ G rest → sitecoupled into two steps: ﬁrst, esti-mate the restraint free energy with respect to only one of thebinding poses; then add the analytical correction − RT ln ( ) totake into account the existence of the other equivalent pose. C. Case 3: symmetric complex – Homodimerization

Biomolecules such as proteins frequently assemble intomultimers, which are homomultimers (homodimers etc.) ifthey are formed by the assembly of identical molecules. Themacroscopic thermodynamic description of homodimeriza-tion is slightly different from that of heterodimerization. Topinpoint the effect of symmetry on the deﬁnition and calcu-lation of a homodimerization constant, we describe a thoughtexperiment wherein the very same physical process can be de-scribed as either homo- or heterodimerization. We reason at the macroscopic level because it allows for a more direct con-nection with the macroscopic binding free energies that weseek to characterize.Consider a solution containing solutes that can be arbitrar-ily considered of the same type M , or as distinct types A and B —a concrete example would be methane molecules contain-ing either C or C , which could be considered identical ordifferent depending on the method used to detect them.We ﬁrst consider the heterodimerization equilibrium: A + B (cid:10) AB Suppose that the system is large enough that it obeys the lawof mass action, and that the total concentration of both A and B is 1. We also assume that A and B have exactly thesame behavior with respect to dimerization. The population ofdimers can be thought of as evenly split between AA , BB , AB ,and BA , the latter two being of course the same object; thusthe heterodimer is twice as concentrated as each homodimer.This can be arrived at by a symmetry argument: this is thedistribution that would be obtained by random assignment ofthe labels A and B to otherwise identical molecules. If wedeﬁne x ≡ [ AB ] , then [ A ] = [ B ] = x /

2. The free monomerconcentrations are [ A ] = [ B ] = − x .We can write the law of mass action for heterodimerization: K ◦ AB ≡ [ AB ] C ◦ [ A ][ B ] = xC ◦ ( − x ) (21)If we compute a PMF w AB ( r ) between A and B , we can di-rectly apply Eq. (7) to the heterodimerization, and obtain thevalue of K ◦ AB . Note that this requires no particular assumptionon the possible homodimerization of species A and B sepa-rately.Now re-analyze the very same physical system, ignoringthe labels A and B and regarding all solutes as the genericmonomer M . The homodimerization equilibrium writes:2 M (cid:10) M And [ M ] = [ AB ] + [ A ] + [ B ] = x , and [ M ] = [ A ] + [ B ] = ( − x ) .The law of mass action for homodimerization is: K ◦ M ≡ [ M ] C ◦ [ M ] = xC ◦ [ ( − x )] (22)Comparing equations 21 and 22: K ◦ M = K ◦ AB (23)That is, keeping the same interactions between themonomers, the homodimerization constant is half the het-erodimerization constant. This thought experiment may seem2contrived, but any homodimerization equilibrium can be re-duced to it by arbitrarily labeling each half of the monomers(isotope labeling is again a concrete analogy). Thus, this re-sult is valid for homodimerization in general.Since A and B are identical to M with respect to intermolec-ular interactions, the self-association PMF of M is that of A and B : w MM ( r ) = w AB ( r ) . Therefore the homodimerizationconstant K ◦ M can be expressed by modifying Eq. (7): K ◦ M = V ◦ (cid:90) site π r e − β w MM ( r ) dr (24)This expression has been arrived at by another route. However, on many occasions, the factor of 2 has been omitted,including by us.

The resulting discrepancy of RT ln ( ) is small enough to be usually inconspicuous, given the er-ror margins of computational—and experimental—estimatesof binding free energies. V. CONCLUSION

We have recalled the main exact theoretical results allow-ing for the practical determination of macroscopic, abso-lute binding free energies based on explicit-solvent molecu-lar simulations. Adding to the existing practical advice onhow to perform absolute binding free calculations using al-chemical transformations, we point to several controversialsteps in the calculations that can lead to erroneous numericaltreatment–even though the resulting error is often hidden inthe large error bars of the alchemical transformation itself. Weargue that a key step to obtain well-deﬁned binding constants(or binding free energies) that can be compared to experi-mental data is to precisely deﬁne the binding site (or mode),the thermodynamics of which you wish to characterize. Thisshould be done as consistently as possible with the experimen-tal measurements that can be predicted by or compared withsimulations. This often requires simpliﬁcations (such as as-suming a two-state model), unless direct computation of theexperimental signal is possible.Another key point is to carefully correct for the restraintsboth in the decoupled and coupled state, in a way that is con-sistent with the site deﬁnition. We show that this clariﬁesseveral practical situations and, in the case of symmetric re-ceptor or ligand, eliminates the need for a speciﬁc often ill-interpreted symmetry correction term. Instead, symmetry ef-fects are accounted for when evaluating the restraint contri-butions to the free energy in both the coupled and decoupledstates, and we discuss the potential pitfalls of different pro-tocols. We have also shown that counter to a very commonintuition, in symmetric cases, estimators of the excess free en-ergies of decoupling are not affected by total, partial, or non-existent sampling of some symmetry-equivalent modes.Finally, we have shown that the computation of macro-scopic constants for homodimerization equilibria requires aspecial correction factor, and explained its origin. VI. COMPUTATIONAL DETAILS

Computation of the binding free energy of Mg + to acetate(IV A 2) was performed using the Gromacs 5.1.1 software. The simulation box contained one acetate anion, one cationand 1723 water molecules. The simulations were performedin the constant temperature/constant pressure (NpT) ensem-ble, using the same setup as previously reported. Watermolecules were described by the SPC/E force ﬁeld, and twodifferent force ﬁelds for Mg + , giving rise to two differentbehaviors with respect to sampling of one or both symmetricsites, were used. Note that these force ﬁelds are known tostrongly overestimate ion-acetate binding, as we discussed ina previous work, and their use here is only meant to illustratetypical issues encountered in free energy calculations. Freeenergy calculations were performed using the double decou-pling procedure described in II D. The free energy differenceassociated with each alchemical transformation was recon-structed using the Bennett Acceptance Ratio (BAR) methodas implemented in Gromacs. The restraint used during thealchemical transformation in the bound state is a ﬂat bottomwell potential, ﬂat for 2 . < d < .

8, and harmonic on eachside with a k = − nm − force constant. ACKNOWLEDGMENTS

This work was supported by the “Initiative d’Excellence”program from the French State (Grants “DYNAMO”, ANR-11-LABX-0011, and “CACSICE”, ANR-11-EQPX-0008).Computational work was performed using HPC resourcesfrom LBT/HPC. We thank our colleagues H. Martinez-Seara,P. Jungwirth and V. Palivec (UOCHB, Prague, CZ), C. H.Robert (LBT, IBPC, Paris), and G. Brannigan (Rutgers Cam-den, NJ, USA) for stimulating discussions.

DATA AVAILABILITY STATEMENT

The data that support the ﬁndings of this study are openlyavailable in Zenodo athttp://doi.org/10.5281/zenodo.4519498. P. A. Kollman, I. Massova, C. Reyes, B. Kuhn, S. Huo, L. Chong, M. Lee,T. Lee, Y. Duan, W. Wang, O. Donini, P. Cieplak, J. Srinivasan, D. A. Case,and T. E. Cheatham, “Calculating structures and free energies of complexmolecules: Combining molecular mechanics and continuum models,” Acc.Chem. Res. , 889–897 (2000). S. Holderbach, L. Adam, B. Jayaram, R. C. Wade, and G. Mukher-jee, “RASPD+: Fast protein-ligand binding free energy prediction us-ing simpliﬁed physicochemical features,” Front. Mol. Biosci. (2020),10.3389/fmolb.2020.601065. C. Chipot, A. E. Mark, V. S. Pande, and T. Simonson, “Applications of freeenergy calculations to chemistry and biology,” in

Free Energy CalculationsTheory and Applications in Chemistry and Biology , edited by C. Chipot andA. Pohorille (Springer, 2007) pp. 463–492. D. L. Mobley and M. K. Gilson, “Predicting binding free energies: Fron-tiers and benchmarks,” Annu. Rev. Biophys. , 531–558 (2017). D. L. Mobley and P. V. Klimovich, “Perspective: Alchemical free energycalculations for drug discovery,” J. Chem. Phys. , 230901 (2012). M. Aldeghi, A. Heifetz, M. J. Bodkin, S. Knapp, and P. C. Biggin, “Predic-tions of ligand selectivity from absolute binding free energy calculations,”J. Am. Chem. Soc. , 946–957 (2017). L. Jiang, Y. Gao, F. Mao, Z. Liu, and L. Lai, “Potential of mean forcefor protein-protein interaction studies,” Proteins: Structure, Function andGenetics , 190–196 (2002). D. Suh, S. Jo, W. Jiang, C. Chipot, and B. Roux, “String method for pro-tein–protein binding free-energy calculations,” J. Chem. Theory Comput. , 5829–5844 (2019). T. Siebenmorgen and M. Zacharias, “Computational prediction of pro-tein–protein binding afﬁnities,” Wiley Interdiscip. Rev.: Comp. Mol. Sci. , 1–18 (2020). D. Jakubec and J. Vondrášek, “Efﬁcient Estimation of Absolute BindingFree Energy for a Homeodomain-DNA Complex from NonequilibriumPulling Simulations,” J. Chem. Theory Comput. , 2034–2041 (2020). L. F. Song and K. M. Merz, “Evolution of alchemical free energy methodsin drug discovery,” J. Chem. Inf. Model. , 5308–5318 (2020). Z. Cournia, B. K. Allen, T. Beuming, D. A. Pearlman, B. K. Radak, andW. Sherman, “Rigorous free energy simulations in virtual screening,” J.Chem. Inf. Model. , 4153–4169 (2020). J. Hermans and S. Subramaniam, “The Free Energy of Xenon Bindingto Myoglobin from Molecular Dynamics Simulation,” Israel J. Chem. ,225–227 (1986). B. Roux, M. Nina, R. Pomès, and J. Smith, “Thermodynamic stabilityof water molecules in the bacteriorhodopsin proton channel: a moleculardynamics free energy perturbation study,” Biophys. J. , 670–681 (1996). M. K. Gilson, J. A. Given, B. L. Bush, and J. A. McCammon, “Thestatistical-thermodynamic basis for computation of binding afﬁnities: Acritical review,” Biophys. J. , 1047–1069 (1997). V. Helms and R. C. Wade, “Hydration energy landscape of the active sitecavity in cytochrome p450cam,” Proteins: Structure, Function, and Genet-ics , 381–396 (1998). H. Luo and K. Sharp, “On the calculation of absolute macromolecular bind-ing free energies,” Proc. Nat. Acad. Sci. , 10399–10404 (2002). S. Boresch, F. Tettinger, M. Leitgeb, and M. Karplus, “Absolute BindingFree Energies: A Quantitative Approach for Their Calculation,” J. Phys.Chem. B , 9535–9551 (2003). M. Mihailescu and M. K. Gilson, “On the theory of noncovalent binding,”Biophys. J. , 23–36 (2004). H.-J. Woo and B. Roux, “Calculation of absolute protein-ligand bindingfree energy from computer simulations.” Proc. Nat. Acad. Sci. , 6825–30 (2005). D. L. Mobley, J. D. Chodera, and K. A. Dill, “On the use of orientationalrestraints and symmetry corrections in alchemical free energy calculations,”J. Chem. Phys. , 084902 (2006). S. Miyamoto and P. A. Kollman, “Absolute and relative binding free energycalculations of the interaction of biotin and its analogs with streptavidinusing molecular dynamics/free energy perturbation approaches,” Proteins:Struct. Func. Bioinf. , 226–245 (1993). Y. Deng and B. Roux, “Calculation of standard binding free energies: Aro-matic molecules in the T4 lysozyme L99A mutant,” J. Chem. Theory Com-put. , 1255–1273 (2006). A. S. J. S. Mey, B. Allen, H. E. B. Macdonald, J. D. Chodera, M. Kuhn,J. Michel, D. L. Mobley, L. N. Naden, S. Prasad, A. Rizzi, J. Scheen, M. R.Shirts, G. Tresadern, and H. Xu, “Best Practices for Alchemical Free En-ergy Calculations,” Living J. Comp. Mol. Sci. , 1–51 (2020). J. M. Swanson, R. H. Henchman, and J. A. McCammon, “Revisiting FreeEnergy Calculations: A Theoretical Connection to MM/PBSA and DirectCalculation of the Association Free Energy,” Biophys. J. , 67–74 (2004). S. Genheden and U. Ryde, “The MM/PBSA and MM/GBSA methods toestimate ligand-binding afﬁnities Samuel,” Expert Opin. Drug Discov. ,449–461 (2015). R. Salari, T. Joseph, R. Lohia, J. Hénin, and G. Brannigan, “A Streamlined,General Approach for Computing Ligand Binding Free Energies and ItsApplication to GPCR-Bound Cholesterol,” J. Chem. Theory Comput. ,6560–6573 (2018). W. L. Jorgensen, “Interactions between Amides in Solution and the Ther-modynamics of Weak Binding,” J. Am. Chem. Soc. , 3770–3771 (1989). J. Pranata and W. L. Jorgensen, “Monte Carlo Simulations Yield AbsoluteFree Energies of Binding theory and the OPLS potential functions lar dy-namics,” Tetrahedron , 2491–2501 (1991). D. Shoup and A. Szabo, “Role of diffusion in ligand binding to macro-molecules and cell-bound receptors,” Biophys. J. , 33–39 (1982). D. H. D. Jong, L. V. Schafer, A. H. D. Vries, S. J. Marrink, H. J. C.Berendsen, and H. Grubmüller, “Determining Equilibrium Constants forDimerization Reactions from Molecular Dynamics Simulations,” J. Com-put. Chem. , 1919–1928 (2011). A. Jost Lopez, P. K. Quoika, M. Linke, G. Hummer, G. Hummer, andJ. Köﬁnger, “Quantifying Protein-Protein Interactions in Molecular Simu-lations,” J. Phys. Chem. B , 4673–4685 (2020). Y. Deng and B. Roux, “Computations of standard binding free energieswith molecular dynamics simulations,” J. Phys. Chem. B , 2234–2246(2009). S. Doudou, N. A. Burton, R. H. Henchman, S. Doudou, N. A. Burton,and R. H. Henchman, “Standard Free Energy of Binding from a One-Dimensional Potential of Mean Force,” J. Chem. Theory Comput. , 909–918 (2009). J. C. Gumbart, B. Roux, and C. Chipot, “Standard binding free energiesfrom computer simulations: What is the best strategy?” J. Chem. TheoryComput. , 794–802 (2013). R. A. Corey, O. N. Vickery, M. S. P. Sansom, and P. J. Stansfeld, “Insightsinto membrane protein–lipid interactions from free energy calculations,”Journal of Chemical Theory and Computation , 5727–5736 (2019). C. H. Bennett, “Efﬁcient estimation of free energy differences from MonteCarlo data,” J. Comput. Phys. , 245–268 (1976). M. R. Shirts and J. D. Chodera, “Statistically optimal analysis of samplesfrom multiple equilibrium states.” J. Chem. Phys. , 124105 (2008). J. G. Kirkwood, “Statistical mechanics of ﬂuid mixtures,” J. Chem. Phys. , 300–313 (1935). N. Lu, D. A. Kofke, and T. B. Woolf, “Improving the efﬁciency and relia-bility of free energy perturbation calculations using overlap sampling meth-ods,” J. Comput. Chem. , 28–40 (2003). J. Janin, “For Guldberg and Waage, With Love and Cratic Entropy,” Pro-teins Struct. Funct. Gen. , 0–1 (1996). W. L. Jorgensen, J. K. Buckner, S. Boudon, and J. Tirado-Rives, “Efﬁcientcomputation of absolute free energies of binding by computer simulations.Application to the methane dimer in water,” J. Chem. Phys. , 3742–3746(1988). H. Fujitani, Y. Tanida, and A. Matsuura, “Massively parallel computationof absolute binding free energy with well-equilibrated states,” Phys. Rev. E , 1–12 (2009). T. Simonson and B. Roux, “Concepts and protocols for electrostatic freeenergies,” Mol. Sim. , 1090–1101 (2016). V. Helms and R. Wade, “Thermodynamics of water mediating protein-ligand interactions in cytochrome p450cam: a molecular dynamics study,”Biophys. J. , 810–824 (1995). E. Gallicchio and R. M. Levy,

Advances in Protein Chemistry and Struc-tural Biology , 1st ed., Vol. 85 (Elsevier Inc., 2011) pp. 27–80. H. Fujitani, Y. Tanida, M. Ito, G. Jayachandran, C. D. Snow, M. R. Shirts,E. J. Sorin, and V. S. Pande, “Direct calculation of the binding free energiesof FKBP ligands,” J. Chem. Phys. , 1–5 (2005). M. Kumar, T. Simonson, G. Ohanessian, and C. Clavaguéra, “Structureand thermodynamics of Mg:phosphate interactions in water: A simulationstudy,” ChemPhysChem , 658–665 (2015). D. M. De Oliveira, S. R. Zukowski, V. Palivec, J. Hénin, H. Martinez-Seara,D. Ben-Amotz, P. Jungwirth, and E. Duboué-Dijon, “Binding of divalentcations to acetate: Molecular simulations guided by Raman spectroscopy,”Phys. Chem. Chem. Phys. , 24014–24027 (2020). J. Wang, Y. Deng, and B. Roux, “Absolute binding free energy calcula-tions using molecular dynamics simulations with restraining potentials,”Biophys. J. , 2798–2814 (2006). R. D. Groot, “The association constant of a ﬂexible molecule and a singleatom: Theory and simulation,” J. Chem. Phys. , 3537–3549 (1992). D. Chandler and L. R. Pratt, “Statistical mechanics of chemical equilibriaand intramolecular structures of nonrigid molecules in condensed phases,”J. Chem. Phys. , 2925–2940 (1976). D. N. LeBard, J. Hénin, R. G. Eckenhoff, M. L. Klein, and G. Brannigan,“General anesthetics predicted to block the GLIC pore with micromolar afﬁnity,” PLoS Comput. Biol. , e1002532 (2012). J. Hermans and L. Wang, “Inclusion of loss of translational and rotationalfreedom in theoretical estimates of free energies of binding. application toa complex of benzene and mutant t4 lysozyme,” J. Am. Chem. Soc. ,2707–2714 (1997). R. W. Zwanzig, “High-temperature equation of state by a perturbationmethod. i. nonpolar gases,” J. Chem. Phys. , 1420–1426 (1954). U. Derewenda, Z. Derewenda, E. J. Dodson, G. G. Dodson, C. D. Reynolds,G. D. Smith, C. Sparks, D. Swenson, Z. Derewenda, C. D. Reynolds,U. Derewenda, C. Sparks, E. J. Dodson, D. Swenson, and G. D. Smith,“Phenol stabilizes more helix in a new symmetrical zinc insulin hexamer,”Nature , 594–596 (1989). V. Palivec, C. M. Viola, M. Kozak, T. R. Ganderton, K. Kˇrížková, J. P.Turkenburg, P. Haluˆsková, L. Žáková, J. Jiráˆcek, P. Jungwirth, and A. M.Brzozowski, “Computational and structural evidence for neurotransmitter-mediated modulation of the oligomeric states of human insulin in storagegranules,” J. Biol. Chem. , 8342–8355 (2017). W. Humphrey, A. Dalke, and K. Schulten, “Vmd: visual molecular dynam-ics,” J. Mol. Graph. , 33–8, 27–8 (1996). C. R. Cantor and P. R. Schimmel,

Biophysical Chemistry. Part III: The Be-havior of Biological Macromolecules (WH Freeman, 1980). T. Dudev and C. Lim, “Effect of Carboxylate-Binding Mode on Metal Bind-ing / Selectivity,” Acc. Chem. Res. , 53–56 (2007). H. Einspahr and C. E. Bugg, “The geometry of calcium carboxylate inter-actions in crystalline complexes,” Acta Cryst. B , 1044–1052 (1981). M. D. Daily, M. D. Baer, and C. J. Mundy, “Divalent Ion ParameterizationStrongly Affects Conformation and Interactions of an Anionic BiomimeticPolymer,” J. Phys. Chem. B , 2198–2208 (2016). T. Martinek, E. Duboué-Dijon, Š. Timr, P. E. Mason, K. Baxová, H. E. Fis-cher, B. Schmidt, E. Pluharová, and P. Jungwirth, “Calcium ions in aque-ous solutions: Accurate force ﬁeld description aided by ab initio molecular dynamics and neutron scattering,” J. Chem. Phys. , 222813 (2018). P. Li, B. P. Roberts, D. K. Chakravorty, and K. M. Merz, “Rational designof particle mesh ewald compatible lennard-jones parameters for +2 metalcations in explicit solvent,” J. Chem. Theory Comput. , 2733–2748 (2013). K. M. Callahan, N. N. Casillas-Ituarte, M. Roeselová, H. C. Allen, and D. J.Tobias, “Solvation of magnesium dication: molecular dynamics simulationand vibrational spectroscopic study of magnesium chloride in aqueous so-lutions.” J. Phys. Chem. A , 5141–5148 (2010). A. E. Eriksson, W. A. Baase, J. A. Wozniak, and B. W. Matthews, “Acavity-containing mutant of t4 lysozyme is stabilized by buried benzene,”Nature , 371–373 (1992). M. Merski and B. K. Shoichet, “Engineering a model protein cavity tocatalyze the kemp elimination,” Proc. Nat. Acad. Sci. , 16179–16183(2012). S. E. Boyce, D. L. Mobley, G. J. Rocklin, A. P. Graves, K. A. Dill, and B. K.Shoichet, “Predicting ligand binding afﬁnity with alchemical free energymethods in a polar model binding site,” J. Mol. Biol. , 747–763 (2009). J. Doma´nski, G. Hedger, R. B. Best, P. J. Stansfeld, and M. S. P. San-som, “Convergence and sampling in determining free energy landscapes formembrane protein association,” J. Phys. Chem. B , 3364–3375 (2016). J. Hénin, A. Pohorille, and C. Chipot, “Insights into the recognition andassociation of transmembrane α -helices. The free energy of α -helix dimer-ization in glycophorin A,” J. Am. Chem. Soc. , 8478–8484 (2005). J. Hénin, A. Pohorille, and C. Chipot, “Erratum: Insights into the recogni-tion and association of transmembrane α -helices. the free energy of α -helixdimerization in glycophorin A,” J. Am. Chem. Soc. , 9510–9510 (2010). D. Van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, andH. J. C. Berendsen, “GROMACS: Fast, ﬂexible, and free,” J. Comput.Chem. , 1701–1718 (2005). H. J. C. Berendsen, J. R. Grigera, and T. P. Straatsma, “The missing termin effective pair potentials,” J. Phys. Chem.91