[PDF] Bayesian active learning for optimization and uncertainty quantification in protein docking

Abstract

Motivation: Ab initio protein docking represents a major challenge for optimizing a noisy and costly "black box"-like function in a high-dimensional space. Despite progress in this field, there is no docking method available for rigorous uncertainty quantification (UQ) of its solution quality (e.g. interface RMSD or iRMSD). Results: We introduce a novel algorithm, Bayesian Active Learning (BAL), for optimization and UQ of such black-box functions and flexible protein docking. BAL directly models the posterior distribution of the global optimum (or native structures for protein docking) with active sampling and posterior estimation iteratively feeding each other. Furthermore, we use complex normal modes to represent a homogeneous Euclidean conformation space suitable for high-dimension optimization and construct funnel-like energy models for encounter complexes. Over a protein docking benchmark set and a CAPRI set including homology docking, we establish that BAL significantly improve against both starting points by rigid docking and refinements by particle swarm optimization, providing for one third targets a top-3 near-native prediction. BAL also generates tight confidence intervals with half range around 25% of iRMSD and confidence level at 85%. Its estimated probability of a prediction being native or not achieves binary classification AUROC at 0.93 and AUPRC over 0.60 (compared to 0.14 by chance); and also found to help ranking predictions. To the best of our knowledge, this study represents the first uncertainty quantification solution for protein docking, with theoretical rigor and comprehensive assessment. Source codes are available at this https URL.

Full PDF

BBioinformatics

Advance Access Publication Date: Day Month YearManuscript Category

Structural Bioinformatics

Bayesian active learning for optimization anduncertainty quantiﬁcation in protein docking

Yue Cao and Yang Shen ∗ Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States. ∗ To whom correspondence should be addressed.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation:

Ab initio protein docking represents a major challenge for optimizing a noisy and costly“black box"-like function in a high-dimensional space. Despite progress in this ﬁeld, there is no dockingmethod available for rigorous uncertainty quantiﬁcation (UQ) of its solution quality (e.g. interface RMSDor iRMSD).

Results:

We introduce a novel algorithm, Bayesian Active Learning (BAL), for optimization and UQ of suchblack-box functions and ﬂexible protein docking. BAL directly models the posterior distribution of the globaloptimum (or native structures for protein docking) with active sampling and posterior estimation iterativelyfeeding each other. Furthermore, we use complex normal modes to represent a homogeneous Euclideanconformation space suitable for high-dimension optimization and construct funnel-like energy modelsfor encounter complexes. Over a protein docking benchmark set and a CAPRI set including homologydocking, we establish that BAL signiﬁcantly improve against both starting points by rigid docking andreﬁnements by particle swarm optimization, providing for one third targets a top-3 near-native prediction.BAL also generates tight conﬁdence intervals with half range around 25% of iRMSD and conﬁdence levelat 85%. Its estimated probability of a prediction being native or not achieves binary classiﬁcation AUROCat 0.93 and AUPRC over 0.60 (compared to 0.14 by chance); and also found to help ranking predictions.To the best of our knowledge, this study represents the ﬁrst uncertainty quantiﬁcation solution for proteindocking, with theoretical rigor and comprehensive assessment.

Availability:

Source codes are available at https://github.com/Shen-Lab/BAL.

Contact: [email protected]

Supplementary information: https://github.com/Shen-Lab/BAL/tree/master/Paper_SI/

Protein-protein interactions underlie many cellular processes, which hasbeen increasingly revealed by the quickly advancing high-throughputexperimental methods. However, compared to the binary informationabout what protein pairs interact, the structural knowledge about howproteins interact remains relatively scarce (Mosca et al. , 2013). Proteindocking helps close such a gap by computationally predicting the 3Dstructures of protein-protein complexes given individual proteins’ 1Dsequences or 3D structures (Smith and Sternberg, 2002).

Ab initio protein docking is often recast as an energy (or other objectivefunctions) optimization problem. For the type of objective functionsrelevant to protein docking, neither the analytical form nor the gradientinformation would help global optimization as the functions are non-convex and extremely rugged thus their gradients are too local to informglobal landscape. So the objective functions are often treated as de facto “black-box” functions for global optimization. Meanwhile, these functionsare very expensive to evaluate. Various methods, especially reﬁnement-stage methods, have progressed to effectively sample the high-dimensionalconformational space against the expensive functional evaluations (Gray et al. , 2003; Moal and Bates, 2010; Shen, 2013; Jiménez-García et al. ,2017; Marze et al. , 2018; Pfeiffenberger and Bates, 2018).When solving such optimization problems still remains a greatchallenge, quantifying the uncertainty of protein-docking results in somequality of interest (very often interface RMSD or iRMSD) is not addressedby any protein-docking method. Even though such information is muchneeded by the end users, current protein-docking methods often generatea rank-ordered list of results without giving quality estimation anduncertainty quantiﬁcation to individual results and without providing theconﬁdence in whether the entire list contains a quality result (for instance,a near-native protein-complex model with iRMSD (cid:54) a r X i v : . [ q - b i o . B M ] J a n Cao and Shen classiﬁed as epistemic uncertainty (Der Kiureghian and Ditlevsen, 2009).For instance, energy models as objective functions provide noisy andapproximate observations of the assumed ground truth — the Gibbsfree energy; and iterative sampling techniques suffer from both theapproximation of the search space (e.g. rotamerized side chains) andinsufﬁcient data in the approximated space (e.g. small numbers of samplesconsidering the high dimensionality of the search space). In addition,uncertainty in protein structure data (e.g. X-ray crystal structures ofproteins being "averaged" versions of their native conformations andderived from ﬁtting observed diffraction patterns), which can be classiﬁedas aleatoric uncertainty, also enters protein docking methods when crystalstructures are used as ground-truth native structures for training objectivefunctions or tuning parameters in protein-docking methods.Whereas the forward propagation of aleatoric uncertainty in proteinstructure data to structure-determined quantities has been studiedempirically (Li et al. , 2016) and theoretically (Rasheed et al. , 2017),the much more difﬁcult, inverse quantiﬁcation of uncertainty inpredicted protein or protein-complex structures originating from epistemicuncertainty in computational methods, is still lacking a mathematicallyrigorous solution. A unique challenge for uncertainty quantiﬁcation (UQ)in protein docking is that the desired quality of interest here is directlydetermined by the optimum itself rather than the optimal value. In otherwords, closeness to native structures (for instance, measured by iRMSD)is an indicator for the usefulness of the docking results, but closeness tonative structures’ energy values is not necessarily the case. Therefore, UQin protein docking has to be jointly solved with function optimization whenﬁnding the inverse mapping from a docking objective function to its globaloptimum is neither analytically plausible nor empirically cheap.In this study, we introduce a rigorous Bayesian frameworkto simultaneously perform function optimization and uncertaintyquantiﬁcation for expensive-to-evaluate black-box objective functions. Tothat end, our Bayesian active learning (BAL) iteratively and adaptivelygenerate samples and update posterior distributions of the global optimum.We propose a posterior in the form of the Boltzmann distributionwith adaptive annealing schedule that gradually shifts the search fromexploration to exploitation and with an objective-function estimator(surrogate) that is a non-parametric Kriging regressor. The iterativelyupdated posterior carries the belief (and uncertainty as well) on wherethe global optimum is given historic samples and guides next-iterationsamples, which presents an efﬁcient data-collection scheme for bothoptimization and UQ. Compared to typical Bayesian optimization methods(Shahriari et al. , 2016) that ﬁrst model the posterior of the objectivefunction and then optimize the resulting functional, our BAL frameworkdirectly models the posterior for the global optimum and overcomes theintensive computations in both steps of typical Bayesian optimizationmethods. Compared to another work (Ortega et al. , 2012) that also modelsthe posterior of the global optimum, we provide both theoretical andempirical results that our BAL has a consistent and unbiased estimatoras well as a global uncertainty-aware and dimension-dependent annealingschedule.We also make innovative contributions in the application domainof protein docking. Speciﬁcally, we design a machine learning-basedobjective function that estimates binding afﬁnities for docked encountercomplexes as well as assesses the quality of interest, iRMSD, for dockingresults. We also re-parameterize the search space for both externalrigid-body motions (Shen et al. , 2008) and internal ﬂexibility, into alow-dimensional homogeneous and isotropic space suitable for high-dimensional optimization, using our (protein) complex normal modes(cNMA) (Oliwa and Shen, 2015). Considering that protein dockingreﬁnement often starts with initial predictions representing separateconformational clusters/regions, we use estimated local posteriors overindividual regions to construct local and global partition functions; and then calculate the probability that the prediction for each conformationalcluster, each conformational cluster, or the entire list of conformationalclusters is near-native.The rest of the paper is organized as following. In Materials andMethods, we ﬁrst give a mathematical formulation for the optimizationand the UQ, then introduce our Bayesian active learning (BAL) thatiteratively update sampling and posterior estimation. We next introducethe parameterization of the search space that allows concurrent andhomogeneous sampling of external rigid-body and internal ﬂexible-bodymotions as well as newly-developed machine learning models as thenoisy energy function that estimates the binding free energy for encountercomplexes. And we end the Materials and Methods with uncertaintyquantiﬁcation for protein docking. In Results and Discussion, using acomprehensive protein docking benchmark set involving unbound dockingand a CAPRI set involving homology docking, we assess optimizationresults for BAL with comparison to starting structures from ZDOCKand reﬁned structures by particle swarm optimization (PSO). We furtherassess the uncertainty quantiﬁcation results: accuracy of the conﬁdencelevels and tightness of the conﬁdence region, as well as the conﬁdencescores on the near-nativeness of predictions. Lastly, before reachingConclusions, we visualize the estimated energy landscape and conﬁrmthat the funnel-like energy landscapes do exist near native structures inthe homogeneous conformational space blending external rigid-body andinternal ﬂexible-body motions.

We consider a black-box function f ( x ) (e.g. ∆ G , the change in the Gibbsfree energy upon protein-protein interaction) that can only be evaluated atany sample with an expensive yet noisy observation y ( x ) (e.g. modeledenergy difference). Our goal in optimization is x ∗ = arg min x ∈X f ( x ) (e.g. the native structure of a protein complex). And our goal in uncertaintyquantiﬁcation is the probability that ˆ x , a prediction of x ∗ , falls in aninterval [ lb, ub ] of quality relative to x ∗ : P ( lb (cid:54) Q (ˆ x , x ∗ ) (cid:54) ub ) = 1 − σ where − σ is the conﬁdence level; [ lb, ub ] is the conﬁdence interval inthe solution quality; and Q ( · , · ) , the quality of interest measuring somedistance or dissimilarity, can be an Euclidean norm (as in our assessmentfor test functions), another distance metric, or other choices of users (forinstance, iRMSD as in our assessment for protein docking with ub = 4 Å). x ∗ We address the problem above in a Bayesian perspective: instead of treating x ∗ as a ﬁxed point, we model x ∗ as a random variable and construct itsprobability distribution, p ( x ∗ |D ) , given samples D = { ( x , y ) } . Thisprobability distribution, carrying the belief and the uncertainty on thelocation of x ∗ , is a prior when D = ∅ (no sample) and a posteriorotherwise. Considering the cost of function evaluation, we iterativelycollect new samples in iteration t (where all samples collected by theend of the t -th iteration are denoted D ( t ) ) based on the latest estimatedposterior, p ( x ∗ |D ( t − ) ; and we update the posterior p ( x ∗ |D ( t ) ) basedon D ( t ) . An illustration of the iterative approach is given in Fig. 1.For optimization, we set ˆ x to be the best sample with the lowest y value given a computational budget (reﬂected in the number of samples oriterations). For UQ, given the posterior p ( x ∗ |D ( t ) ) in the ﬁnal iteration,one can propagate the inferred uncertainty in x ∗ forwardly to that in the ayesian active learning for protein docking -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 E n e r g y A -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 E n e r g y B -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 E n e r g y C -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 E n e r g y D E n t r o p y E -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 F -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 G -2.5 -1.5 -0.5 0.5 1.5 2.5 PC1 H Fig. 1: Illustration of the Bayesian active learning (BAL) algorithm. (A): A typical energy landscape projected onto the ﬁrst principle component (PC1).The dashed line indicates the location of the optimal solution. (B)-(D): The samples (dots) and the kriging regressors (light curves) at the stage of the1st, 4th and 10th iteration, respectively. Samples are colored from cold to hot by the iterations and those in the same iteration have the same color. (E):The entropy (measuring uncertainty) of the posterior reduces as the number of samples increases. Its quick drop from 30 to 100 samples corresponds toa drastic change of kriging regressor, which suggests increasing exploitation in possible function basins. After 100 samples, the entropy goes down moreslowly, meaning that the regressor only has a small amount of adjustment. (F)-(H): The corresponding posterior distributions for (B)-(D), respectively.quality of interest, Q (ˆ x , x ∗ ) , for the found ˆ x , using techniques such asMarkov chain Monte Carlo. x ∗ We propose to use the Boltzmann distribution to describe the posterior p ( x ∗ | D ( t ) ) ∝ exp( − ρ · ˆ f ( x )) where ˆ f ( x ) is an estimator for f ( x ) , and ρ is a parameter (sometimes /RT where R is the gas constant and T the temperature of the molecularsystem).To iteratively guide the expensive sampling and balance betweenexploration and exploitation in a data efﬁcient way, we choose ρ to followan adaptive annealing schedule over iteration t : ρ t = ρ · exp(( h ( t − p ) − n d t ) where ρ , the initial ρ , is a parameter; h ( t − p ) is the (continuous) entropyof the last-iteration posterior, a shorthand notation for h ( p ( x ∗ | D ( t − )) ; n t = |D ( t ) | is the number of samples collected so far; and d is thedimensionality of the search space X .This annealing schedule is inspired by the adaptive simulated annealing(ASA) (Ingber, 2000), especially the exponential form and the n d t term.However, we use the ( h ( t − p ) − term rather than a constant as in ASAso that we exploit all historic samples D ( t ) . In this way, as the uncertaintyof x ∗ decreases, ρ t increases and shifts the search toward exploitation.The function estimator ˆ f ( x ) also updates iteratively according to theincrementaly increasing n t samples D ( t ) = { ( x i , y i ) } n t i =1 . We use aconsistent and unbiased Kriging regressor (Chilès and Delﬁner, 2012)which is known to be the best unbiased linear estimator (BLUE): ˆ f ( x ) = f ( x ) + ( κ ( x )) T ( K + (cid:15) I ) − ( y − f ) where f ( x ) is the prior for E [ f ( x )] ; κ ( x ) is the kernel vector with the i thelement being the kernel, a measure of similarity, between x and x i ; K isthe kernel matrix with the ( i, j ) element being the kernel between x i and x j ; y and f are the vector of y , . . . , y n t and f ( x ) , . . . , f ( x n t ) , respectively; and (cid:15) reﬂects the noise in the observation and is estimated tobe 2.1 as the prediction error for the training set.We derive the Kriging regressor in the Supporting Information ( SI )Sec. 1.2.2. And we will use the regressor to evaluate binding energy andestimate iRMSD for UQ over multiple regions in Sec. 2.5. For a sequential sampling policy that balances exploration and exploitationduring the search for the optimum, we choose Thompson sampling (Russo et al. , 2017) which samples a batch of points in the t -th iteration basedon the latest posterior p ( x ∗ | D t − ) . This seemingly simple policy hasbeen found to be theoretically (Agrawal and Goyal, 2011) and empirically(Chapelle and Li, 2011) competitive compared to other updating policiessuch as Upper Conﬁdence Bound (Shahriari et al. , 2016). In our case, it isactually straightforward to implement given the posterior on x ∗ .There are multiple reasons to collect in each iteration a batch of samplesrather than a single one. First, given the high dimension of the search space,it is desired to collect adequate data before updating the posterior. Second,the batch sampling weakens the correlation among samples and make themmore independent, which beneﬁts the convergence rate of the Krigingregressor. Last, parallel computing could be trivially applied for batchsampling, which would signiﬁcantly improve the algorithm throughput.Fig. 1 gives an illustration of the algorithm behavior. The initialsamples drawn from a uniform distribution leads to a relatively ﬂatposterior whose maximum is off the function optimum (Fig. 1F). Asthe iteration progresses, the uncertainty about the optimum graduallyreduces(Fig. 1E) and newer samples are increasingly focused (Fig. 1C,D)as the posteriors are becoming narrower with peaks shifting toward thefunction optimum (Fig. 1G,H).In our docking study, d = 12 for a homogeneous space spanned bycomplex normal modes (see Sec. 2.3). We construct a prior and collect 30samples in the ﬁrst iteration and 20 in each of the subsequent iterations.We limit the number of iterations (samples) to be 31 (630) as a way toimpose a computational budget. Cao and Shen

The kernel in the Kriging regressor for the posterior is a measure ofsimilarity. For test functions deﬁned in an Euclidean space, we use theradial basis function (RBF) kernel: κ ( x i , x j ) = exp ( − || x i − x j || l ) where || x − x || , a measure of dissimilarity, is the Euclidean distanceand l , the bandwidth of the kernel, is set as l = l · n d t following (Györﬁ et al. , 2002). l , dependent on search space, is set at 2.0 for docking withoutparticular optimization.For protein docking, we replace the Euclidean distance in theRBF kernel with the interface RMSD (iRMSD) between two samplestructures. iRMSD captures sample dissimlarity relevant to function-valuedissimilarity and is independent of search-space parameterization.For this purpose, we also have to address two technical issues. First,protein interface information is determined by x ∗ and thus unknown. Weinstead use the putative interface seen in the samples (see more details in the SI Sec. 2.4). Second, kernel calculation with iRMSD is time consuming.The time complexity of iRMSD calculation is O ( N ) and that of regressorupdate is O ( Nn ) , where N , the number of interfacial atoms, can easilyreach hundreds or thousands, and n , the number of samples, can alsobe large. To save computing time, we develop a fast RMSD calculationmethod that reduces its time complexity from O ( N ) down to O (1) (seedetails in SI Sec. 2.1).

Current Bayesian optimization methods typically model the posteriordistribution of f ( x ) rather than that of x ∗ = arg min x ∈X f ( x ) directly.After modeling the posterior distribution over the functional space (acommon non-parametric way is through Gaussian processes), they wouldsubsequently sample the functional space and optimize sample functions.For instance, Villemonteix et al. (2006) used Monte Carlo sampling;and Henrández-Lobato et al. (2014) discretized the functional space toapproximate the sample paths of Gaussian processes using a ﬁnite numberof basis functions then optimized each sample path to get one sample of x ∗ .The two-step approach of current Bayesian optimization methods involveintensive sampling and (non-convex) optimization that is computationallyintensive and not pragmatically useful for protein docking.To our knowledge, Ortega et al. presented the only other studythat directly models the posterior distribution over the optimum (Ortega et al. , 2012). Both methods fall in the general category of Bayesianoptimization and use consistent non-parametric regressors. However, weprove in Sec. 1.2 of the SI that their regressor is biased whereas ourKriging regressor is unbiased. We explain in Sec. 1.1 there that theirannealing schedule (temperature control) only considers the pairwisedistance between samples without location awareness and is independentof dimensionality d ; whereas ours has a term involving location-awareglobal uncertainty and generalizes well to various dimensions. Beyondthose theoretical comparison, we also included empirical results to showthe superior optimization and UQ performances for BAL.The rest of the Materials and Methods section involve methodsspeciﬁc to the protein docking problem: parameterization, dimensionalityreduction, and range reduction of the search space X ; machine learningmodel as y ( x ) , i.e., an energy model for encounter complexes; uncertaintyquantiﬁcation for a predicted structure or a list of predictions as well asthe corresponding quality estimation for ranking predictions or classifyingtheir nativeness. X In protein docking the full search space X captures the degrees of freedomfor all atoms involved. Let one protein be receptor whose position is ﬁxedand the other be ligand. And let N R , N L , and N be the number of atoms for the receptor, the ligand, and the complex respectively. Then X = R N − is a Euclidean space whose dimension easily reaches for a smallprotein complex without surrounding solvent molecules. If accuracy issacriﬁced for speed, proteins can be (unrealistically) considered rigid and X = SE (3) = R × SO (3) (ligand translations and rotations) is aRiemannian manifold. Docking methods fall in the spectrum between thesetwo ends that are represented by all-atom molecular dynamics and FFTrigid docking, respectively. For instance, one can consider locally rigidpieces of a protein rather than a globally rigid protein, then X becomesthe product of many SE (3) for local rigidity (Mirzaei et al. , 2015); orone can model individual proteins’ internal ﬂexible-body motions normalmodes on top of the ligand rigid-body motions, thus X becomes the productof R K (where K (cid:28) N R/L ) and SE (3) (Moal and Bates, 2010).From the perspective of optimization and UQ, both the high-dimensionality of R N − and the geometry of the lower-dimensionalmanifold present challenges. Almost all dimensionality reduction effortsin protein docking impose conditions (such as aforementioned local orglobal rigidity) in the full Euclidean space and lead to embedded manifoldsdifﬁcult to (globally) optimize over. The challenge from the manifold hasbeen either disregarded in protein docking or addressed by the local tangentspace (Shen et al. , 2007, 2008; Mirzaei et al. , 2015).Could and how could the dimensionality of the conformational spacebe reduced while its geometry maintains homogeneity and isotropy ofa Euclidean space and its basis vectors span conformational changes ofproteins upon interactions? In this subsection we give a novel approachto answer this question for the ﬁrst time. In contrast to commonconformational sampling that separates internal ﬂexible-body motions(often Euclidean) and external rigid-body motions (a manifold) (Marze et al. , 2018), we re-parameterize the space into a Euclidean space spannedby complex normal modes Oliwa and Shen (2015) blending both ﬂexible-and rigid-body motions. The mapping preserves distance metric in theoriginal full space. We further reduce the dimensionality and the range inthe resulting space. We previously introduced complex normal mode analysis, cNMA (Oliwaand Shen, 2015), to model conformational changes of proteins duringinteractions. Using encounter complexes from rigid docking, cNMAextends anisotropic network model (ANM) to capture both conformationalselection and induced ﬁt effects. After the Hessian matrix is projected toremove the rigid-body motion of the receptor, its non-trivial eigenvectors µµµ j ( j = 1 , . . . , N − ) form orthonormal basis vectors. We showedthat µµµ R j , the components of the complex normal modes, better capture thedirection of individual proteins’ conformational changes than conventionalNMA did (Oliwa and Shen, 2015). We also showed that the re-scaledeigenvalues for these components, λ R j = λ j || µµµ R j || , can be used to constructfeatures for machine learning and predict the extent of the conformationalchanges. In this study we focus on the motions of a whole complex rather thanindividual proteins and develop sampling techniques for protein docking.Each single complex normal mode simultaneously captures concertedﬂexible-body motions of individual proteins (receptor and ligand) andrigid-body motion of the non-ﬁxed protein (ligand). They together span ahomgeneous and isotropic Euclidean space where the distance between twopoints is exactly the RMSD between corresponding complex structures.For dimensionality reduction in the resulting space, we choose the ﬁrst K non-trivial eigenvectors µµµ j ranked by increasing eigenvalues λ j ; andwe additionally include K µµµ j (not in the ﬁrst K ) ranked by increasing λ Rj . In other words, we choose K slowest modes for the complex and ayesian active learning for protein docking another K for the receptor as the set of basis vectors B . K and K areset at 9 and 3, respectively, leading to the dimension of the reduced spaceto d = 12 . A supplemental video illustrates the motion of the slowest suchmodes. For range reduction in the dimension-reduced space, we perturb a startingcomplex structure (cid:126)C ∈ R N − along aforementioned basis vectors togenerate sample (cid:126)C while enforcing a prior on the scaling factor s in theﬁrst iteration. Speciﬁcally (cid:126)C = (cid:126)C + (cid:88) j ∈B r j s (cid:112) λ j · µµµ j where r j , the coefﬁcient of the j th normal mode µµµ j , is uniformly sampledon S d , the surface of a d -dimensional standard sphere with a unit radius.The scaling factor s is given by s = τ R √ N R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) j ∈B r j √ λ j · µµµ R j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) where τ R is the estimated conformational change (measured by RMSD)between the unbound and the bound receptor. Note that µµµ Ri ’s are notorthonormal to each other.We previously predicted τ R by a machine learning model giving (cid:92) RMSD R , a single value for each receptor (Chen et al. , 2017). Herewe replace (cid:92) RMSD R with a predicted distribution by multiplying it to atruncated normal distribution N ( µ = 0 . , σ = 0 . within [0, 2.5].The latter distribution is derived by ﬁtting the ratios between the actual andthe predicted values, RMSD R / (cid:92) RMSD R , for 50 training protein complexes(see more details about datasets in Sec. 2.7 and those about distributionﬁtting in Sec. 2.2 of the SI ). Therefore, our parameterization produces x = s · r ∈ R d whose prior is derived as above.Since the ligand component of complex normal modes includesimultaneous ﬂexible- and rigid-body motions, conformational samplingcould lead to severely distorted ligand geometry. We thus further restrictthe ligand perturbation ∆ L (ﬂexible- and rigid-body together) to be within ∆ L ∆ L = (cid:115) N L (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) j ∈B r j s (cid:112) λ j · µµµ L j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) ∆ L We set ∆ L at 6Å according to the average size of binding energy attractionbasins seen in conformational clusters (Kozakov et al. , 2005). For samplesgenerated from the aforementioned prior or the updated posterior, wereject those violating the ligand perturbation limit. We discuss about thefeasibility of the search region in SI Sec. 2.3. y ( x ) for Sampling We introduce a new energy model based on binding afﬁnities K (cid:48) d ( x ) of structure samples x that are often encounter complexes. The modelassumes that K (cid:48) d correlates with K d , the binding afﬁnity of the nativecomplex, and deteriorates with the increase of the sample’s iRMSD (theencounter complex is less native-like): K (cid:48) d ( x ) = K d · exp( α · ( iRMSD ( x )) q ) where α and q are hyper-parameters optimized through cross-validation. In other words, we assume that the fraction of binding afﬁnityloss is exponential in a polynomial of iRMSD. Therefore, the bindingenergy, a machine learning model y ( x ; w ) of parameters w can berepresented as y ( x ; w ) = RT ln( K (cid:48) d ( x )) = RT ln( K d ) + RT α · ( iRMSD ( x )) q Given an observed or regressed y ( x ) value, one can estimate iRMSD ( x ) with given K d . We train machine learning models, including ridge regression with linearor RBF kernel and random forest, for y ( x ; w ) . The 8 features includechanges upon protein interaction in energy terms such as internal energiesin bond, angle, dihedral and Urey-Bradley terms, van der Waals, non-polar contribution of the solvation energy based on solvent-accessiblesurface area (SASA), and electrostatics modeled by Generalized Born witha simple SWitching (GBSW), all of which are calculated in a CHARMM27force ﬁeld.We use the same training set of 50 protein pairs (see details in SI Sec. 2.5) as in predicting the extent of conformational change. Fromrigid docking and conformational sampling we generate 13,004 complexsamples for 50 protein pairs, including 6,464 near-native and 6,540 non-native examples in the training set. Hyper-parameters of ridge regressionwith RBF kernel as well as random forest are optimized by cross-validation. And model parameters w are trained again over the entiretraining set with the best hyper-parameters. More details can be found inSec. 2.6 of the SI . For the assessment, we use the test set a of 26 proteinpairs (again in SI Sec. 2.5) and generate 20 samples similarly for each ofthe 10 initial docking results for each pair, leading to 5,200 cases.

A unique challenge to protein docking reﬁnement is that, instead ofoptimization and UQ in a single region X , we may do so in K separateones X i ( i = 1 , . . . , K ) where each X i is a promising conformationalregion/cluster represented by a initial docking result. This is oftennecessitated by the fact that the extremely rugged energy landscape ispopulated with low-energy basins separated by frequent high-energy peaksin a high-dimensional space, thus preferably searched over multiple stages(Vajda and Kozakov, 2009).Our goal of UQ for protein docking results is to determine, for each ˆ x i – the prediction in X i (the i th structure model), its quality bounds [ lb , ub ] such that P ( lb (cid:54) Q ( ˆ x i , x ∗ ) (cid:54) ub ) = 1 − σ where the quality of interest Q (ˆ x , x ∗ ) here is iRMSD between a predictedand the native structure and − σ a desired conﬁdence level.To that end, we forwardly propagate the uncertainty from x ∗ (nativestructure) to iRMSD, given the ﬁnal posterior p ( x ∗ |D t ) in individualregions (local posteriors). Speciﬁcally, we generate 1,000,000 samplesfollowing the local posterior using Markov chain Monte Carlo, evaluatetheir binding energies using the Kriging regressor, and estimate theiriRMSD using our binding afﬁnity prediction formula inversely. We thenuse these sample iRMSD values to determine conﬁdence intervals [ lb , ub ] for various conﬁdence score − σ so that P ( iRMSD < lb ) = P ( iRMSD > ub ) = σ/ . We next calculate the probability that a prediction ˆ x i is near-native, i.e., P ( Q (ˆ x i , x ∗ ) (cid:54) (Méndez et al. , 2005). Calculating this quantitywould demand the probability that the native structure lies in the i thconformational region / cluster, P ( x ∗ ∈ X i ) ( P ( X i ) in short) as well asthat the probability that it lies in all the K regions, P ( x ∗ ∈ ∪ Ki =1 X i ) ( P ( U K ) in short). By following the chain rule we easily reach Cao and Shen P ( iRMSD (ˆ x i , x ∗ ) (cid:54)

4) = P ( iRMSD (ˆ x i , x ∗ ) (cid:54) |X i ) · P ( X i | U K ) · P ( U K ) . Here we use the fact that { x ∗ ∈ X i } ⊂ { x ∗ ∈ ∪ Ki =1 X i } and assumethat { iRMSD (ˆ x i , x ∗ ) (cid:54) } ⊆ { x ∗ ∈ X i } (the range of conformationalclusters in iRMSD is usually wider than 4Å).We discuss how to calculate each of the three terms for the product. P ( iRMSD (ˆ x i , x ∗ ) (cid:54) |X i ) If the native structure x ∗ (unknown) is contained in the i th region/cluster X i , what is the chance that the predicted structure ˆ x i (known) is within4Å? We again use forward uncertainty propagation starting with the localposterior p ( x ∗ |D t ) in X i . We sample 100,000 structures following theposterior with Markov chain Monte Carlo, calculate their iRMSD to theprediction ˆ x i , and empirically determine the portion within 4Åfor theprobability of interest here. Notice that the native interface is unknownthus the putative interface is used instead. P ( X i | U K ) If the native structure is contained in at least one of the K regions, whatis the chance that it is in X i ? Following statistical mechanics, we reach P ( X i | U K ) = Z i Z = (cid:82) x ∈X i exp( − RT ˆ f i ( x )) d x (cid:80) Kj (cid:82) x ∈X j exp( − RT ˆ f j ( x )) d x where Z i and Z are local and global partition functions, respectively; and ˆ f i ( x ) is the ﬁnal Kriging regressor. Different regions are assumed to bemutually exclusive. The integrals are calculated by Monte Carlo sampling.Anther approach is to replace RT above with ρ i , the ﬁnal ρ for temperature control in X i . In practice we did not ﬁnd signiﬁcantperformance difference between the two approaches, partly due to thefact that ﬁnal ρ i in various clusters / regions reached similar values for thesame protein complex. P ( U K ) What is chance that the native structure is within the union of the initialregions, i.e., some initial region is near-native? The way to calculate P ( U ) is very similar to that in uncertainty quantiﬁcation. Speciﬁcally,100,000 structures are sampled following the posterior of each region X i , evaluated for binding energy using the Kriging regressor ˆ f i ( x ) ,and estimated with iRMSD using the binding afﬁnity predictor formulainversely. We empirically calculate the portion q i in which sample iRMSDvalues are above 4Å. Assuming the independence among regions, we reach P ( U K ) = 1 − (cid:81) Ki =1 q i . We use a comprehensive protein docking benchmark set 4.0 Hwang et al. (2010) of 176 protein pairs that diversely and representatively coversequence and structure space, interaction types, and docking difﬁculty.We split them into a training set, test sets a and b with stratiﬁed samplingto preserve the composition of difﬁculty levels in each set. The “training"set is not used for tuning BAL parameters (Sec. 2.2). Rather, it is just fortraining energy model ( y ( x ; w ) in Sec. 2.4) and conformational-changeextent prediction ( τ R in Sec. 2.3.3). The training and test a sets contain50 and 26 pairs with known K d values ((Kastritis and Bonvin, 2010)),respectively. And the test set b contain 100 pairs with K d values predictedfrom sequence alone (Yugandhar and Gromiha, 2014).We also use a smaller yet more challenging CAPRI set of 15 recentCAPRI targets (Chen et al. , 2017). Unlike the benchmark set for unbounddocking, the CAPRI set contains 11 cases of homology docking, 8 of whichstart with just sequences for both proteins and demand homology modelsof structures before protein docking. Compared to the benchmark test set of 86 (68%), 22 (18%) and 18 (14%) cases classiﬁed rigid, medium, andﬂexible, respectively; the corresponding statistics for the CAPRI set are4 (27%), 5 (33%) and 6 (40%), respectively. Their K d values are alsopredicted from sequence alone.The complete lists of the benchmark sets and the CAPRI set, withdifﬁculty classiﬁcation, are provided in Sec. 2.5 of the SI .For each protein pair, we use 10 distinct encounter complexes asstarting structures ( K = 10 ). As reported previously (Chen et al. , 2017),those for the benchmark sets are top-10 cluster representatives by ZDOCK ,kindly provided by the Weng group; and those for the CAPRI set are top-10models generated by the ZDOCK webserver. We ﬁrst tested our BAL algorithm on four non-convex test functions ofvarious dimensions and compared it to particle swarm optimization (PSO)(Kennedy and Eberhart, 1995; Clerc and Kennedy, 2002), an advancedoptimization algorithm behind a very successful protein-docking methodSwarmDock (Moal and Bates, 2010). Detailed settings are provided inSec. 2.7 of the SI .For optimization we assess || ˆ x − x ∗ || , the distance between thepredicted and actual global optima, a measure of direct relevance to thequality of interest in protein docking – iRMSD. Compared to PSO, BALmade predictions that are, on average, closer to the global optima withsmaller standard deviations; and the improvement margins increased withthe increasing dimensions (Table S8 in the Supplementary Material).For UQ we assess r , the distance upper bound of 90% conﬁdence,i.e., P ( || ˆ x − x ∗ || (cid:54) r ) = 1 − σ = 90% . The metrics to assess r include η , the relative error ( η = | r || ˆ x − x ∗ || − | ); and ˆ P , the portionof the conﬁdence intervals from 100 runs that actually encompass thecorresponding global optimum. We found in Table S9 that our conﬁdenceintervals are usually tight judging from η and they contain the globaloptima with portions ˆ P close to 90%, the desired conﬁdence level. Theportions agreed less with the desired conﬁdence level for some functions asthe dimensionality increase, which suggests the challenge of optimizationand UQ in higher dimensions. We compare the performances of three machine learning models over thetraining and test sets for energy model (Sec. 2.4.2). As no actual bindingafﬁnities of encounter complexes are available, we estimated the iRMSDvalues based on the random forest model’s binding energy prediction (Sec.2.4.1) and compared them to the actual iRMSD (of native interfaces) usingRMSE for absolute error. Random forest gave the best performances thusused as the energy model y ( x ) hereinafter. Speciﬁcally, performances aresplit to encounter complexes of varying quality (iRMSD) in Fig.2A andFig. 2B. The random forest model (blue bars) led to RMSE of 0.70Å (1.0Å)for the near-native samples in the training (test) set. The RMSEs increasedslowly as iRMSD (cid:54) Å and did sharply beyond (a region too far fromthe native for reﬁnement).We also assess how “funnel-like" the energy model is. We thuscalculated for each protein pair the Spearman’s ranking coefﬁcient ρ between the energy model and the actual iRMSD. The random forest ofMM-GBSW features showed the highest ρ of 0.72 and 0.60 for the trainingand the test sets, respectively (Fig. 2C), albeit with large deviation acrossprotein pairs.We lastly assess the energy model’s ability to rank across proteinpairs. Speciﬁcally, we estimated each native protein-complex’s bindingenergy by setting iRMSD to be zero in the energy model and comparedthe estimated and actual binding energy using RMSE and Pearson’s r in ayesian active learning for protein docking Dataset − σ η ˆ P a η ˆ P b η ˆ P η ˆ P Table 1. Uncertainty quantiﬁcation performances of BAL on protein dockingbased on η , the relative error in iRMSD; and ˆ P , the portion of conﬁdenceintervals from 100 runs encompassing the global optima. For η , means (andstandard deviations in parentheses) are reported. the supplemental Table S11. The random-forest energy model achieved2.45 (4.78) Kcal/mol in RMSE and Pearson’s r of 0.79 (0.75) in bindingenergy ∆ G = − RT ln ( K d ) for the training (test) set. We show the improvements in PSO and BAL solution quality (measured bythe decrease of iRMSD) against the starting ZDOCK solutions in Fig. S4 ofthe Supplemental Material. Speaking of the amount of improvement, BALimproved iRMSD by 1.2Å, 0.74Å, and 0.76Å for the training, test, andCAPRI sets, respectively, outperforming PSO’s corresponding measuresof 0.82Å, 0.45Å, and 0.49Å. It also outperformed PSO for the morechallenging near-native cases (note that BAL’s iRMSD improvement forthe near-native test set or CAPRI set was almost neutral). Speaking of theportion with improvement, BAL improved iRMSD in the near-native casesfor 75%, 68%, and 73% of the training, test, and CAPRI sets, respectively;whereas the corresponding statistics for PSO were 59%, 50%, and 53%,respectively. More split statistics based on docking difﬁculty can be foundin Fig. S5.We also compared BAL and PSO solutions head-to-head over subsetsof varying difﬁculty levels for protein docking (Fig.3). Overall, BAL’ssolutions are better (or signiﬁcantly better by at least 0.5Å) than thoseof PSO for 70%-80% (31%-45%) of the cases, which was relativelyinsensitive to the docking difﬁculty level.Both PSO and BAL use a single trajectory of 31 iterations and630 samples for each region/cluster. Most time is on local structureminimization and energy evaluations using CHARMM (Brooks et al. ,2009). The BAL running time for optimization of each cluster thus almostlinearly grows with the size of the protein pair (Fig. S8), ranging from 7hours for a 200-residue complex to 13 hours for a 1700-residue one.

We next assess the UQ results for protein docking. Similar to that for testfunctions, we assess r − σ , the half length of ( − σ ) conﬁdence interval[ lb , ub ], i.e., P ( lb (cid:54) iRMSD (ˆ x , x ∗ ) (cid:54) ub ) = 1 − σ . The metricsto assess r − σ include η , the relative error ( η = | r − σ | iRMSD (ˆ x , x ∗ ) − | );and ˆ P , the portion of the conﬁdence intervals that actually contain thecorresponding native structure across all docking runs (10 for 10 modelsof each protein pair in each set).Table 1 shows that the portions matched well with the conﬁdence levelsover all four data sets. Test set b and the CAPRI set did not have actual K d values available and were thus impacted further by the uncertainty of K d prediction, although the impact did not appear signiﬁcant. There wasa trade off between the conﬁdence level and the length of the conﬁdenceinterval, as narrower conﬁdence intervals (with less η ) corresponded tolower conﬁdence levels. A balance seems to be at the 85% conﬁdencelevel where the relative iRMSD uncertainty is around 25%. For quality estimation, two metrics are used for assessing the performance.The ﬁrst is Spearman’s ρ for ranking protein-docking predictions (structuremodels) for each pair. The second is the area under the Precision Recall Curve (AUPRC), for the binary classiﬁcation of each prediction being near-native or not. Considering that the near-natives are minorities among allpredictions, AUPRC is a more meaningful measure than the more commonAUROC.With these two metrics we assess four scoring functions on predictions ˆ x i : (1) ∆ E (ˆ x i ) , the MM-GBSW binding energy, i.e., the sum of the8 features; (2) our random-forest energy model y (ˆ x i ) ; (3) P ( X i | U K ) ,the conditional probability that the i th prediction’s region is near-nativegive that there is at least such one in the top K predictions; and (4) P ( iRMSD (ˆ x i , x ∗ ) (cid:54) , the unconditional probability that the i thprediction is near-native.For ranking assessment, from Fig. 4 we ﬁnd that, whereas theoriginal MM-GBSW model merely achieved merely 0.2 for Spearman’s ρ , our energy model using the same 8 terms as features in random forestdrastically improved the ranking performance with a Spearman’s ρ around0.6 for training, benchmark test, and CAPRI test sets. Furthermore, theconﬁdence scores P ( X i | U K ) and P ( iRMSD (ˆ x i , x ∗ ) (cid:54) furtherimproved ranking. In particular, the unconditional probability for aprediction to be near-native achieved around 0.70 in ρ even for thebenchmark and the CAPRI test sets. Note that this probability, a conﬁdencescore on the prediction’s near-nativeness, was derived from the posteriordistribution of x ∗ ; thus it uses both enthalpic and entropic contributions.For binary assessment on classifying the nativess of predictions, thetest set is split into a , 26 pairs with known K d values and b , 100 withpredicted ones. From Table 2 we conclude that the MM-GBSW energymodel performed close to random (AUROC close to 0.5) and the randomforest energy model using the same features drastically improved AUROCto around 0.8 and AUPRC . ∼ . across sets. Since AUROCis uninformative for highly imbalanced data (for instance, near-nativepredictions are 14% over all data sets), we focus on AUPRC. The nextthree probabilities from our BAL’s conﬁdence scores improved the AUPRCto nearly 0.80 for the training and above 0.60 for test sets. The additionaluncertainty in K d prediction from test sets b and CAPRI did not noticeablyimpact the performance compared to test set a . Dataset Assessment ∆ E (ˆ x i ) y (ˆ x i ) P ( X i | U K ) P ( iRMSD (ˆ x i , x ∗ ) (cid:54) | U K ) P ( iRMSD (ˆ x i , x ∗ ) (cid:54) Training AUROC 0.489 0.806 0.903 0.944 0.967AUPRC 0.241 0.624 0.684 0.771 0.796Testing a AUROC 0.460 0.810 0.892 0.929 0.939AUPRC 0.199 0.550 0.592 0.613 0.634Testing b AUROC 0.490 0.789 0.847 0.898 0.927AUPRC 0.203 0.540 0.571 0.609 0.615CAPRI AUROC 0.491 0.771 0.844 0.893 0.919AUPRC 0.214 0.561 0.600 0.610 0.614

Table 2. Binary quality estimation for our 5 scoring functions on a prediction ˆ x i being near-native or not: MM-GBSW energy model ∆ E (ˆ x i ) , random-forestenergy model y (ˆ x i ) , and 3 BAL-determined probabilities that a region/cluster X i is near-native given a native-containing list, the prediction ˆ x i in that regionis near-native given a native-containing list, or ˆ x i is near-native. We summarize our docking results (BAL predictions x i ranked byconﬁdence scores on their nativeness P ( iRMSD (ˆ x i , x ∗ ) (cid:54) in Table3; and compare them to the ZDOCK starting results (ranked by clustersize roughly reﬂecting entropy) and the PSO reﬁnement results (using thesame energy model as BAL and ranked by the energy model). We use N K to denote the number of targets with at least one near-native predictionsin top K ; and F K the fraction of such targets among all in a given set(training, benchmark test, or CAPRI test set). Compared to the ZDOCKstarting results and PSO reﬁnements, BAL has improved the portion ofacceptable targets with top 3 predictions from 23% and 26%, respectively,to 32% for the benchmark test set. Similar improvements were found forthe CAPRI set. The portion for top 10 from BAL reached 40% compared toZDOCK’s 33% over the benchmark test set. Notice that BAL only reﬁned Cao and Shen irmsd⩽4 410 R M S E ( a g s t r o m ) A irmsd⩽4 410 R M S E ( a g s t r o m ) B Training set Test set Sp e a r m a n c o rr e l a t i o n C Fig. 2: The root mean square error (RMSE) between the predicted and actual iRMSD for (A): Training set and (B): Test set. (C): The Spearman’s ρ between predicted y ( x ) and the real iRMSD for the training and test sets. All PPIs Rigid Medium Flexible020406080100 P e r c e n t a g e o f i m p r o v e m e n t o v e r P S O ( % ) All modelsNear-native models A All PPIs Rigid Medium Flexible020406080100 P e r c e n t a g e o f i m p r o v e m e n t o v e r P S O ( % ) All modelsNear-native models B All PPIs Rigid Medium Flexible020406080100 P e r c e n t a g e o f i m p r o v e m e n t o v e r P S O ( % ) All modelsNear-native models C Fig. 3: The percentage of BAL predictions with iRMSD improvement against PSO for A. the training set, B. the benchmark test set, and C. the CAPRIset. The darker gray portions correspond to signiﬁcant improvement (over 0.5Å in iRMSD) compared to corresponding PSO predictions. a b c d0.00.20.40.60.81.0 Sp e a r m a n C o rr e l a t i o n A a b c d0.00.20.40.60.81.0 Sp e a r m a n C o rr e l a t i o n B a b c d0.00.20.40.60.81.0 Sp e a r m a n C o rr e l a t i o n C Fig. 4: Ranking performance shown in the bar plot (with error bar in black) of Spearman’s ρ for (A) Training set, (B) Test set and (C) CAPRI set,respectively. a,b,c,d in each ﬁgure correspond to the MM-GBSW model, our random-forest energy model, and conﬁdence scores P ( X i | U K ) , and P ( iRMSD (ˆ x i , x ∗ ) (cid:54) , respectively.top 10 starting results from ZDOCK thus this improvement was purelyfrom optimization (no ranking effect). We lastly investigate energy landscapes during BAL sampling. Our krigingregressor ˆ f ( x ) is an unbiased estimator and works even better than thenoisy observations from the random-forest energy model y ( x ) . Energylandscapes are visualized for 37 near-native regions for the benchmark testset (Fig. S6 in the Supplementary Material) and for the CAPRI set (Fig.S7) (non-rigid cases only). Two examples are shown in Fig. 5, depicting a(multiple) funnel-like energy landscape with a clear association paths fromthe starting to the end or the native complex along gradient descents. Threemore supplemental videos are provided to visualize the BAL samplingtrajectories using protein structures. We present the ﬁrst uncertainty quantiﬁcation (UQ) study for proteindocking. This is accomplished by a rigorous Bayesian framework thatactively samples a noisy and expensive black-box function (i.e., collectingdata D ) while updating a posterior distribution p ( x ∗ | D ) directly over theunknown global optimum x ∗ . The iterative feedback between Thompsonsampling and posterior updating is linked by a Boltzmann distribution withadaptive annealing schedule and non-parametric Kriging regressor. Theinverse uncertainty quantiﬁcation on the location of the global optimum can −2 −1 0 1 2PC1−2−1012 P C −2 −1 0 1 2PC1−2−1012 P C Fig. 5: The estimated energy landscapes along the ﬁrst two principalcomponents (PC) for two medium-difﬁculty docking cases withnear-native starting models. Black dots are the samples. The grey triangleis the estimated end structure and the grey star is the true native structure.The starting structure is a thicker black dot at the origin. All the energyvalues are in the unit of RT and relative to the lowest sample energyvalue within each region.easily forward-propagate for the uncertainty quantiﬁcation of any qualityof interest as a function of the global optimum, including the interfaceRMSD that measures dissimilarity between protein-docking solutions andnative structures.We demonstrate the superb performances of Bayesian active learning(BAL) on a protein docking benchmark set as well as a CAPRI set full ayesian active learning for protein docking ZDOCK (Starting Point)

PSO BAL

Dataset (size) N ( F ) N ( F ) N ( F ) N ( F ) N ( F ) N ( F ) N ( F ) N ( F ) N ( F ) Training (50) 11 (22%) 13 (26%) 17 (34%) 15 (30%) 17 (34%) 20 (40%) 19 (38%) 20 (40%) 22 (44%)Test (126) 29 (23%) 33 (26%) 41 (33%) 33 (26%) 39 (31%) 45 (36%) 40 (32%) 42 (33%) 50 (40%)CAPRI (15) 2 (13%) 3 (20%) 4 (27%) 3 (20%) 4 (27%) 5 (33%) 4 (27%) 5 (33%) 5 (33%)

Table 3. Summary of docking results measured by the number and the portion of targets in each set that have an acceptable near-native top 3, 5, or 10 prediction. of homology docking. Compared to the starting points from initial rigiddocking as well as the reﬁnement from PSO, BAL shows signiﬁcantimprovement, accomplishing a top-3 near-native prediction for aboutone-third of the benchmark and CAPRI sets. Its UQ results achievetight uncertainty intervals whose radius is 25% of iRMSD with a 85%conﬁdence level attested by empirical results. Moreover, its estimatedprobability of a prediction being near-native achieves an AUROC over 0.93and AUPRC over 0.60 (more than 4 times over random classiﬁcation).Besides the optimization and UQ algorithms, we for the ﬁrst timerepresent the conformational space for protein docking as a ﬂat Euclideanspace spanned by complex normal modes blending ﬂexible- and rigid-bodymotions and anticipating protein conformational changes, a homogeneousand isotropic space friendly to high-dimension optimization. We alsoconstruct a funnel-like energy model using machine learning to associatebinding energies of encounter complexes sampled in docking with theiriRMSD. These innovations also contribute to the excellent performancesof BAL; and lead to direct visualization of binding energy funnels andprotein association pathways in conformational degrees of freedom.

Acknowledgements

We thank Thom Vreven and Zhiping Weng for the ZDOCK rigid-dockingdecoys for the benchmark set, Haoran Chen for helping on cNMA, andYuanfei Sun for proofreading the manuscript. Part of the CPU time wasprovided by the Texas A&M High Performance Research Computing.

Funding

This work was supported by the National Institutes of Health(R35GM124952) and the National Science Foundation (CCF-1546278).

Conﬂict of interest:

None declared.

References

Agrawal, S. and Goyal, N. (2011). Analysis of Thompson Sampling for the multi-armed bandit problem. arXiv:1111.1797 [cs] .Brooks, B. R. et al. (2009). CHARMM: the biomolecular simulation program.

Journal of Computational Chemistry , (10), 1545–1614.Chapelle, O. and Li, L. (2011). An Empirical Evaluation of Thompson Sampling.In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems 24 , pages 2249–2257.Curran Associates, Inc.Chen, H. et al. (2017). Predicting protein conformational changes for unboundand homology docking: learning from intrinsic and induced ﬂexibility.

Proteins:Structure, Function, and Bioinformatics , (3), 544–556.Chilès, J.-P. and Delﬁner, P. (2012). Geostatistics: Modeling Spatial Uncertainty,2nd Edition .Clerc, M. and Kennedy, J. (2002). The particle swarm-explosion, stability,and convergence in a multidimensional complex space.

IEEE transactions onEvolutionary Computation , (1), 58–73.Der Kiureghian, A. and Ditlevsen, O. (2009). Aleatoric or epistemic? does it matter? Structural Safety , (2), 105–112.Gray, J. J. et al. (2003). Protein–protein docking with simultaneous optimizationof rigid-body displacement and side-chain conformations. Journal of molecularbiology , (1), 281–299.Györﬁ, L. et al. (2002). A Distribution-Free Theory of Nonparametric Regression .Springer Series in Statistics. Springer-Verlag, New York.Henrández-Lobato, J. M. et al. (2014). Predictive Entropy Search for Efﬁcient GlobalOptimization of Black-box Functions. In

Proceedings of the 27th InternationalConference on Neural Information Processing Systems - Volume 1 , NIPS’14, pages918–926, Cambridge, MA, USA. MIT Press. Hwang, H. et al. (2010). Protein-Protein Docking Benchmark Version 4.0.

Proteins , (15), 3111–3114.Ingber, L. (2000). Adaptive simulated annealing (ASA): lessons learned. CoRR , cs.MS/0001018 .Jiménez-García, B. et al. (2017). Lightdock: a new multi-scale approach to protein–protein docking. Bioinformatics , (1), 49–55.Kastritis, P. L. and Bonvin, A. M. J. J. (2010). Are Scoring Functions in Protein-Protein Docking Ready To Predict Interactomes? Clues from a Novel BindingAfﬁnity Benchmark. Journal of Proteome Research , (5), 2216–2225.Kennedy, J. and Eberhart, R. (1995). Particle swarm optimization, proceedings ofieee international conference on neural networks (icnn’95) in.Kozakov, D. et al. (2005). Optimal clustering for detecting near-native conformationsin protein docking. Biophysical journal , (2), 867–875.Li, W. et al. (2016). Estimation of Uncertainties in the Global Distance Test (GDT_ts)for CASP Models. PLOS ONE , (5), e0154786.Marze, N. A. et al. (2018). Efﬁcient Flexible Backbone Protein-Protein Docking forChallenging Targets. Bioinformatics (Oxford, England) .Mirzaei, H. et al. (2015). Energy minimization on manifolds for docking ﬂexiblemolecules.

Journal of chemical theory and computation , (3), 1063–1076.Moal, I. H. and Bates, P. A. (2010). SwarmDock and the Use of Normal Modesin Protein-Protein Docking. International Journal of Molecular Sciences , (10),3623–3648.Mosca, R. et al. (2013). Interactome3d: adding structural details to protein networks. Nature methods , (1), 47.Méndez, R. et al. (2005). Assessment of CAPRI predictions in rounds 3–5 showsprogress in docking procedures. Proteins: Structure, Function, and Bioinformatics , (2), 150–169.Oliwa, T. and Shen, Y. (2015). cNMA: a framework of encounter complex-basednormal mode analysis to model conformational changes in protein interactions. Bioinformatics , (12), i151–i160.Ortega, P. et al. (2012). A Nonparametric Conjugate Prior Distribution for theMaximizingArgumentofaNoisyFunction. InF.Pereira, C.J.C.Burges, L.Bottou,and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems25 , pages 3005–3013. Curran Associates, Inc.Pfeiffenberger, E. and Bates, P. A. (2018). Reﬁnement of protein-protein complexesin contact map space with metadynamics simulations.

Proteins: Structure,Function, and Bioinformatics .Rasheed, M. et al. (2017). Statistical Framework for Uncertainty Quantiﬁcation inComputational Molecular Modeling.

IEEE/ACM Trans Comput Biol Bioinform .Russo, D. et al. (2017). A Tutorial on Thompson Sampling. arXiv:1707.02038 [cs] .Shahriari, B. et al. (2016). Taking the Human Out of the Loop: A Review of BayesianOptimization.

Proceedings of the IEEE , (1), 148–175.Shen, Y. (2013). Improved ﬂexible reﬁnement of protein docking in capri rounds22–27. Proteins: Structure, Function, and Bioinformatics , (12), 2129–2136.Shen, Y. et al. (2007). Optimizing noisy funnel-like functions on the euclidean groupwith applications to protein docking. In Decision and Control, 2007 46th IEEEConference on , pages 4545–4550. IEEE.Shen, Y. et al. (2008). Protein docking by the underestimation of free energyfunnels in the space of encounter complexes.

PLoS computational biology , (10),e1000191.Smith, G. R. and Sternberg, M. J. (2002). Prediction of protein–protein interactionsby docking methods. Current opinion in structural biology , (1), 28–35.Vajda, S. and Kozakov, D. (2009). Convergence and combination of methods inprotein–protein docking. Current opinion in structural biology , (2), 164–170.Villemonteix, J. et al. (2006). An informational approach to the global optimizationof expensive-to-evaluate functions. CoRR , abs/cs/0611143 .Yugandhar, K. and Gromiha, M. M. (2014). Protein–protein binding afﬁnityprediction from amino acid sequence. Bioinformatics ,30