Learning to Optimize in Swarms
LLearning to Optimize in Swarms
Yue Cao, Tianlong Chen, Zhangyang Wang, Yang ShenDepartments of Electrical and Computer Engineering & ComputerScience and EngineeringTexas A&M University, College Station, TX 77840 {cyppsp,wiwjp619,atlaswang,yshen}@tamu.edu
November 19, 2019
Abstract
Learning to optimize has emerged as a powerful framework for variousoptimization and machine learning tasks. Current such “meta-optimizers”often learn in the space of continuous optimization algorithms that arepoint-based and uncertainty-unaware. To overcome the limitations, wepropose a meta-optimizer that learns in the algorithmic space of bothpoint-based and population-based optimization algorithms. The meta-optimizer targets at a meta-loss function consisting of both cumulativeregret and entropy. Specifically, we learn and interpret the update formulathrough a population of LSTMs embedded with sample- and feature-levelattentions. Meanwhile, we estimate the posterior directly over the globaloptimum and use an uncertainty measure to help guide the learning process.Empirical results over non-convex test functions and the protein-dockingapplication demonstrate that this new meta-optimizer outperforms existingcompetitors. The codes and the supplement information are publiclyavailable at: https://github.com/Shen-Lab/LOIS . Optimization provides a mathematical foundation for solving quantitative prob-lems in many fields, along with numerical challenges. The no free lunch theoremindicates the non-existence of a universally best optimization algorithm for allobjectives. To manually design an effective optimization algorithm for a givenproblem, many efforts have been spent on tuning and validating pipelines, archi-tectures, and hyperparameters. For instance, in deep learning, there is a galleryof gradient-based algorithms specific to high-dimensional, non-convex objectivefunctions, such as Stochastic Gradient Descent [22], RmsDrop [25], and Adam[16]. Another example is in ab initio protein docking whose energy functionsas objectives have extremely rugged landscapes and are expensive to evaluate.1 a r X i v : . [ c s . L G ] N ov radient-free algorithms are thus popular there, including Markov chain MonteCarlo (MCMC) [12] and Particle Swarm Optimization (PSO) [19].To overcome the laborious manual design, an emerging approach of meta-learning (learning to learn) takes advantage of the knowledge learned from relatedtasks. In meta-learning, the goal is to learn a meta- learner that could solve a setof problems, where each sample in the training or test set is a particular problem.As in classical machine learning, the fundamental assumption of meta-learning isthe generalizability from solving the training problems to solving the test ones.For optimization problems, a key to meta-learning is how to efficiently utilizethe information in the objective function and explore the space of optimizationalgorithms.In this study, we introduce a novel framework in meta-learning, wherewe train a meta-optimizer that learns in the space of both point-based andpopulation-based optimization algorithms for continuous optimization. To thatend, we design a novel architecture where a population of RNNs (specifically,LSTMs) jointly learn iterative update formula for a population of samples(or a swarm of particles). To balance exploration and exploitation in search,we directly estimate the posterior over the optimum and include in the meta-loss function the differential entropy of the posterior. Furthermore, we embedfeature- and sample-level attentions in our meta-optimizer to interpret thelearned optimization strategies. Our numerical experiments, including globaloptimization for nonconvex test functions and an application of protein docking,endorse the superiority of the proposed meta-optimizer. Meta-learning originated from the field of psychology [27, 14]. [4, 6, 5] optimizeda learning rule in a parameterized learning rule space. [30] used RNN toautomatically design a neural network architecture. More recently, learningto learn has also been applied to sparse coding [13, 26, 9, 18], plug-and-playoptimization [23], and so on.In the field of learning to optimize, [1] proposed the first framework where gra-dients and function values were used as the features for RNN. A coordinate-wisestructure of RNN relieved the burden from the enormous number of parameters,so that the same update formula was used independently for each coordinate.[17] used the history of gradients and objective values as states and step vectorsas actions in reinforcement learning. [10] also used RNN to train a meta- learner to optimize black-box functions, including Gaussian process bandits, simplecontrol objectives, and hyper-parameter tuning tasks. Lately, [28] introduced ahierarchical RNN architecture, augmented with additional architectural featuresthat mirror the known structure of optimization tasks.The target applications of previous methods are mainly focused on trainingdeep neural networks, except [10] focusing on optimizing black-box functions.There are three limitations of these methods. First, they learn in a limitedalgorithmic space, namely point-based optimization algorithms that use gradients2r not (including SGD and Adam). So far there is no method in learning tolearn that reflects population-based algorithms (such as evolutionary and swarmalgorithms) proven powerful in many optimization tasks. Second, their learningis guided by a limited meta loss, often the cumulative regret in sampling historythat primarily drives exploitation. One exception is the expected improvement(EI) used by [10] under Gaussian processes. Last but not the least, these methodsdo not interpret the process of learning update formula, despite the previoususage of attention mechanisms in [28].To overcome aforementioned limitations of current learning-to-optimize meth-ods, we present a new meta-optimizer with the following contributions : • ( Where to learn ): We learn in an extended space of both point-based andpopulation-based optimization algorithms; • ( How to learn ): We incorporate the posterior into meta-loss to guide thesearch in the algorithmic space and balance the exploitation-explorationtrade-off. • ( What more to learn ): We design a novel architecture where a populationof LSTMs jointly learn iterative update formula for a population of samplesand embedded sample- and feature-level attentions to explain the formula.
We use the following convention for notations throughout the paper. Scalars,vectors (column vectors unless stated otherwise), and matrices are denotedin lowercase, bold lowercase, and bold uppercase, respectively. Superscript (cid:48) indicates vector transpose.Our goal is to solve the following optimization problem: x ∗ = arg min x ∈ R n f ( x ) . Iterative optimization algorithms, either point-based or population-based, havethe same generic update formula: x t +1 = x t + δ x t , where x t and δ x t are the sample (or a single sample called “particle" in swarmalgorithms) and the update (a.k.a. step vector) at iteration t , respectively. Theupdate is often a function g ( · ) of historic sample values, objective values, andgradients. For instance, in point-based gradient descent, δ x t = g ( { x τ , f ( x τ ) , ∇ f ( x τ ) } tτ =1 ) = − α ∇ f ( x t ) , α is the learning rate. In particle swarm optimization (PSO), assumingthat there are k samples (particles), then for particle j , the update is determinedby the entire population: δ x tj = g ( {{ x τj , f ( x τj ) , ∇ f ( x τj ) } kj =1 } tτ =1 ) = wδ x t − j + r ( x tj − x t ∗ j ) + r ( x tj − x t ∗ ) , where x t ∗ j and x t ∗ are the best position (with the smallest objective value) ofparticle j and among all particles, respectively, during the first t iterations;and w, r , r are the hyper-parameters often randomly sampled from a fixeddistribution (e.g. standard Gaussian distribution) during each iteration.In most of the modern optimization algorithms, the update formula g ( · ) is analytically determined and fixed during the whole process. Unfortunately,similar to what the No Free Lunch Theorem suggests in machine learning, thereis no single best algorithm for all kinds of optimization tasks. Every state-of-artalgorithm has its own best-performing problem set or domain. Therefore, itmakes sense to learn the optimal update formula g ( · ) from the data in the specificproblem domain, which is called “learning to optimize”. For instance, in [1], thefunction g ( · ) is parameterized by a recurrent neural network (RNN) with input ∇ f ( x t ) and the hidden state from the last iteration: g ( · ) = RNN ( ∇ f ( x t ) , h t − ) .In [10], the inputs of RNN are x t , f ( x t ) and the hidden state from the lastiteration: g ( · ) = RNN ( x t , f ( x t ) , h t − ) . We describe the details of our population-based meta-optimizer in this section.Compared to previous meta-optimizers, we employ k samples whose updateformulae are learned from the population history and are individually customized,using attention mechanisms. Specifically, our update rule for particle i could bewritten as: g i ( · ) = RNN i (cid:0) α inter i ( { α intra j ( { S τj } tτ =1 ) } kj =1 ) , h t − i (cid:1) where S tj = ( s tj , s tj , s tj , s tj ) is a n × feature matrix for particle j at iteration t , α intra j ( · ) is the intra-particle attention function for particle j , and α inter i ( · ) isthe i -th output of the inter-particle attention function. h t − i is the hidden stateof the i th LSTM at iteration t − .For typical population-based algorithms, the same update formula is adoptedby all particles. We follow the convention to set g ( · ) = g ( · ) = ... = g k ( · ) , whichsuggests RNN i = RNN and α intra j ( · ) = α intra ( · ) .We will first introduce the feature matrix S τj and then describe the intra-and inter- attention modules. Considering the expressiveness and the searchability of the algorithmic space,we consider the update formulae of both point- and population-based algorithmsand choose the following four features for particle i at iteration t :4 gradient: ∇ f ( x ti ) • momentum: m ti = (cid:80) tτ =1 (1 − β ) β t − ∇ f ( x τi ) • velocity: v ti = x ti − x t ∗ i • attraction: (cid:80) j ( exp( − αd ij )( x ti − x tj ) ) (cid:80) j exp( − αd ij ) , for all j that f ( x tj ) < f ( x ti ) . α is ahyperparameter and d ij = || x ti − x tj || .These four features include two from point-based algorithms using gradients andthe other two from population-based algorithms. Specifically, the first two areused in gradient descent and Adam. The third feature, velocity, comes fromPSO, where x t ∗ i is the best position (with the lowest objective value) of particle i in the first t iterations. The last feature, attraction, is from the Firefly algorithm[29]. The attraction toward particle i is the weighted average of x ti − x tj over all j such that f ( x tj ) < f ( x ti ) ; and the weight of particle j is the Gaussian similaritybetween particle i and j . For the particle of the smallest f ( x ti ) , we simply setthis feature vector to be zero. In this paper, we use β = 0 . and α = 1 .It is noteworthy that each feature vector is of dimension n × , where n is the dimension of the search space. Besides, the update formula in eachbase-algorithm is linear w.r.t. its corresponding feature. To learn a betterupdate formula, we will incorporate those features into our model of deep neuralnetworks, which is described next. Fig. 1a depicts the overall architecture of our proposed model. We use apopulation of LSTMs and design two attention modules here: feature-level(“intra-particle”) and sample-level (“inter-particle”) attentions. For particle i atiteration t , the intra-particle attention module is to reweight each feature basedon the context vector h t − i , which is the hidden state from the i -th LSTM in thelast iteration. The reweight features of all particles are fed into an inter-particleattention module, together with a k × k distance similarity matrix. The inter-attention module is to learn the information from the rest k − particles andaffect the update of particle i . The outputs of inter-particle attention modulewill be sent into k identical LSTMs for individual updates. For the intra-particle attention module, we use the idea from [2, 3, 8]. As shownin Fig. 1b, given that the j th input feature of the i th particle at iteration t is s tij , we have: b tij = v Ta tanh ( W a s tij + U a h tij ) , p tij = exp( b tij ) (cid:80) r =1 exp( b tir ) , where v a ∈ R n , W a ∈ R n × n and U a ∈ R n × n are the weight matrices, h t − i ∈ R n is the hidden state from the i th LSTM in iteration t − , b tij is the5 .. Particle 1
Intra-particleAttention
Particle 2Particle k
LSTMLSTMLSTM ......
Inter-particle Attention (a)
Transpose softmax
Inter-particle attention
Matrix MultiplicationPoint-wise Summation
Intra-particle attention softmaxFC Point-wise Summation (b)
Figure 1: (a) The architecture of our meta- optimizer for one step. We have k particles here. For each particle, we have gradient, momentum, velocity andattraction as features. Features for each particle will be sent into an intra-particle (feature-level) attention module, together with the hidden state of theprevious LSTM. The outputs of k intra-particle attention modules, togetherwith a kernelized pairwise similarity matrix Q t (yellow box in the figure), willbe the input of an inter-particle (sample-level) attention module. The role ofinter-particle attention module is to capture the cooperativeness of all particlesin order to reweight features and send them into k LSTMs. The LSTM’s outputs, δ x , will be used for generating new samples. (b) The architectures of intra- andinter-particle attention modules. 6utput of the fully-connected (FC) layer and p tij is the output after the softmaxlayer. We then use p tij to reweight our input features: c ti = (cid:88) r =1 p tir s tir , where c ti ∈ R n is the output of the intra-particle attention module for the i th particle at iteration t .For inter-particle attention, we model δ x ti for each particle i under the impactsof the rest k − particles. Specific considerations are as follows. • The closer two particles are, the more they impact each other’s update.Therefore, we construct a kernelized pairwise similarity matrix Q t ∈ R k × k (column-normalized) as the weight matrix. Its element is q tij = exp ( − || x ti − x tj || l ) (cid:80) kr =1 exp ( − || x tr − x tj || l ) . • The similar two particles are in their intra-particle attention outputs ( c ti ,local suggestions for updates), the more they impact each other’s update.Therefore, we introduce another weight matrix M t ∈ R k × k whose elementis m ij = exp ( ( c ti ) (cid:48) c tj ) (cid:80) kr =1 exp ( ( c tr ) (cid:48) c tj ) (normalized after column-wise softmax).As shown in Fig. 1b, the output of the inter-particle module for the j thparticle will be: e tj = γ k (cid:88) r =1 m trj q trj c tr + c tj , where γ is a hyperparameter which controls the ratio of contribution of rest k -1 particles to the j th particle. In this paper, γ is set to be 1 without furtheroptimization. Cumulative regret is a common meta loss function: L ( φ ) = (cid:80) Tt =1 (cid:80) kj =1 f ( x tj ) .However, this loss function has two main drawbacks. First, the loss functiondoes not reflect any exploration. If the search algorithm used for training theoptimizer does not employ exploration, it can be easily trapped in the vicinityof a local minimum. Second, for population-based methods, this loss functiontends to drag all the particles to quickly converge to the same point.To balance the exploration-exploitation tradeoff, we bring the work from[7] — it built a Bayesian posterior distribution over the global optimum x ∗ as p ( x ∗ | (cid:83) Tt =1 D t ) , where D t denotes the samples at iteration t : D t = (cid:8)(cid:0) x tj , f ( x tj ) (cid:1)(cid:9) kj =1 .We claim that, in order to reduce the uncertainty about the whereabouts of theglobal minimum, the best next sample can be chosen to minimize the entropy7f the posterior, h (cid:16) p ( x ∗ | (cid:83) Tt =1 D t ) (cid:17) . Therefore, we propose a loss function forfunction f ( · ) as: (cid:96) f ( φ ) = T (cid:88) t =1 k (cid:88) j =1 f ( x tj ) + λh (cid:32) p ( x ∗ | T (cid:91) t =1 D t ) (cid:33) , where λ controls the balance between exploration and exploitation and φ is avector of model parameters.Following [7], the posterior over the global optimum is modeled as a Boltz-mann distribution: p ( x ∗ | T (cid:91) t =1 D t ) ∝ exp( − ρ ˆ f ( x )) , where ˆ f ( x ) is a function estimator and ρ is the annealing constant. In theoriginal work of [7], both ˆ f ( x ) and ρ are updated over iteration t for activesampling. In our work, they are fixed since the complete training sample pathsare available at once.Specifically, for a function estimator based on samples in (cid:83) Tt =1 D t , we use aKriging regressor [11] which is known to be the best unbiased linear estimator(BLUE): ˆ f ( x ) = f ( x ) + ( κ ( x )) (cid:48) ( K + (cid:15) I ) − ( y − f ) , where f ( x ) is the prior for E[ f ( x )] (we use f ( x ) = 0 in this study); κ ( x ) isthe kernel vector with the i th element being the kernel, a measure of similarity,between x and x i ; K is the kernel matrix with the ( i, j ) -th element being thekernel between x i and x j ; y and f are the vector consisting of y , . . . , y n t and f ( x ) , . . . , f ( x n t ) , respectively; and (cid:15) reflects the noise in the observation andis often estimated to be the average training error (set at 2.1 in this study).For ρ , we follow the annealing schedule in [7] with one-step update: ρ = ρ · exp ( h ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:91) t =1 D t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n , where ρ is the initial parameter of ρ ( ρ = 1 without further optimization here); h is the initial entropy of the posterior with ρ = ρ ; and n is the dimensionalityof the search space.In total, our meta loss for m functions f q ( · ) ( q = 1 , . . . , m ) (analogous to m training examples) with L2 regularization is L ( φ ) = 1 m m (cid:88) q =1 (cid:96) f q ( φ ) + C || φ || . To train our model we use the optimizer Adam which requires gradients. Thefirst-order gradients are calculated numerically through TensorFlow following[1]. We use coordinate-wise LSTM to reduce the number of parameters. In ourimplementation the length of LSTM is set to be 20. For all experiments, theoptimizer is trained for 10,000 epochs with 100 iterations in each epoch.8
Experiments
We test our meta-optimizer through convex quadratic functions, non-convextest functions and an optimization-based application with extremely noisy andrugged landscapes: protein docking.
In this case, we are trying to minimize a convex quadratic function: f ( x ) = || A q x − b q || , where A q ∈ R n × n and b q ∈ R n × are parameters, whose elements are sampledfrom i.i.d. normal distributions for the training set. We compare our algorithmwith SGD, Adam, PSO and DeepMind’s LSTM (DM_LSTM) [1]. Since differentalgorithms have different population sizes, for fair comparison we fix the totalnumber of objective function evaluations (sample updates) to be 1,000 for allmethods. The population size k of our meta-optimizer and PSO is set to be 4,10 and 10 in the 2D, 10D and 20D cases, respectively. During the testing stage,we sample another 128 pairs of A q and b q and evaluate the current best functionvalue at each step averaged over 128 functions. We repeat the procedure 100times in order to obtain statistically significant results.As seen in Fig. 2, our meta-optimizer performs better than DM_LSTM inthe 2D, 10D, and 20D cases. Both meta-optimizers perform significantly betterthan the three baseline algorithms (except that PSO had similar convergence in2D). Number of function evaluations m i n x f ( x ) SGDADAMPSODM_LSTMOURS (a) 2D
Number of function evaluations m i n x f ( x ) SGDADAMPSODM_LSTMOURS (b) 10D
Number of function evaluations m i n x f ( x ) SGDADAMPSODM_LSTMOURS (c) 20D
Figure 2: The performance of different algorithms for quadratic functions in (a)2D, (b) 10D, and (c) 20D. The mean and the standard deviation over 100 runsare evaluated every 50 function evaluations.We also compare our meta-optimizer’s performances with and without theguiding posterior in meta loss. As shown in the supplemental Fig. S1, includingthe posterior improves optimization performances especially in higher dimensions.Meanwhile, posterior estimation in higher dimensions presents more challenges.The impact of posteriors will be further assessed in ablation study.9 .2 Learn to optimize non-convex Rastrigin functions
We then test the performance on a non-convex test function called Rastriginfunction: f ( x ) = n (cid:88) i =1 x i − n (cid:88) i =1 α cos (2 πx i ) + αn, where α = 10 . We consider a broad family of similar functions f q ( x ) as thetraining set: f q ( x ) = || A q x − b q || − α c q cos(2 π x ) + αn, (1)where A q ∈ R n × n , b q ∈ R n × and c q ∈ R n × are parameters whose elementsare sampled from i.i.d. normal distributions. It is obvious that Rastrigin is aspecial case in this family with A = I , b = { , , . . . , } (cid:48) , c = { , , . . . , } (cid:48) .During the testing stage, 100 i.i.d. trajectories are generated in order to reachstatistically significant conclusions. The population size k of our meta-optimizerand PSO is set to be 4, 10 and 10 for 2D, 10D and 20D, respectively. The resultsare shown in Fig. 3. In the 2D case, our meta-optimizer and PSO perform fairlythe same while DM_LSTM performs much worse. In the 10D and 20D cases,our meta-optimizer outperforms all other algorithms. It is interesting that PSOis the second best among all algorithms, which indicates that population-basedalgorithms have unique advantages here. Number of function evaluations m i n x f ( x ) SGDAdamPSODM_L2Lours (a) 2D
Number of function evaluations m i n x f ( x ) SGDAdamPSODM_L2Lours (b) 10D
Number of function evaluations m i n x f ( x ) SGDAdamPSODM_L2Lours (c) 20D
Figure 3: The performance of different algorithms for a Rastrigin function in (a)2D, (b) 10D, and (c) 20D. The mean and the standard deviation over 100 runsare evaluated every 50 function evaluations.
We also examine the transferability from convex to non-convex optimization.The hyperparameter α in Rastrigin family controls the level of ruggedness fortraining functions: α = 0 corresponds to a convex quadratic function and α = 10 does the rugged Rastrigin function. Therefore, we choose three different values of α (0, 5 and 10) to build training sets and test the resulting three trained modelson the 10D Rastrigin function. From the results in the supplemental Fig. S2, ourmeta-optimizer’s performances improve when it is trained with increasing α . The10eta-optimizer trained with α = 0 had limited progress over iterations, whichindicates the difficulty to learn from convex functions to optimize non-convexrugged functions. The one trained with α = 5 has seen significant improvement. In an effort to rationalize the learned update formula, we choose the 2D Rastrigintest function to illustrate the interpretation analysis. We plot sample pathsof our algorithm, PSO and Gradient Descent (GD) in Fig 4a. Our algorithmfinally reaches the funnel (or valley) containing the global optimum ( x = ),while PSO finally reaches a suboptimal funnel. At the beginning, samples of ourmeta-optimizer are more diverse due to the entropy control in the loss function.In contrast, GD is stuck in a local minimum which is close to its starting pointafter 80 samples. (a) (b)(c) Figure 4: (a) Paths of the first 80 samples of our meta-optimizer, PSO andGD for 2D Rastrigin functions. Darker shades indicate newer samples. (b) Thefeature attention distribution over the first 20 iterations for our meta-optimizer.(c) The percentage of the trace of γ Q t (cid:12) M t + I (reflecting self-impact onupdates) over iteration t .To further show which factor contributes the most to each update, we plotthe feature weight distribution over the first 20 iterations. Since for particle i at iteration t , the output of its intra-attention module is a weighted sum of its4 features: c ti = (cid:80) r =1 p tir s tir , we hereby sum p tir for the r -th feature over allparticles i . The final weight distribution (normalized) over 4 features reflectingthe contribution of each feature at iteration t is shown in Fig. 4b. In thefirst 6 iterations, the population-based features contribute to the update most.Point-based features start to play an important role later.Finally, we examine in the inter-particle attention module the level of particlesworking collaboratively or independently. In order to show this, we plot thepercentage of the diagonal part of γ Q t (cid:12) M t + I : tr( γ Q t (cid:12) M t + I ) (cid:80) γ Q t (cid:12) M t + I ( (cid:12) denoteselement-wise product), as shown in Fig. 4c. It can be seen that, at the beginning,particles are working more collaboratively. With more iterations, particlesbecome more independent. However, we note that the trace (reflecting self11mpacts) contributes 67%-69% over iterations and the off-diagonals (impactsfrom other particles) do above 30%, which demonstrates the importance ofcollaboration, a unique advantage of population-based algorithms. How and why our algorithm outperforms DM_LSTM is both interesting andimportant to unveil the underlying mechanism of the algorithm. In order todeeply understand each part of our algorithms, we performed an ablation studyto progressively show each part’s contribution. Starting from the DM_LSTMbaseline ( B ), we incrementally crafted four algorithms: running DM_LSTMfor k times under different initializations and choosing the best solution ( B );using k independent particles, each with the two point-based features, theintra-particle attention module, and the hidden state ( B ); adding the twopopulation-based features and the inter-particle attention module to B so as toconvert k independent particles into a swarm ( B ); and eventually, adding anentropy term in meta loss to B , resulting in our Proposed model.We tested the five algorithms ( B – B and the Proposed ) on 10D and20D Rastrigin functions with the same settings as in Section 4.2. We comparethe function minimum values returned by these algorithms in the table below(reported are means ± standard deviations over 100 runs, each using 1,000function evaluations).Dimension B B B B Proposed
10 55.4 ± ± ± ± ± ± ± ± ± ± B v.s. B : their performance gap ismarginal, which proves that our performance gain is not simply due to having k independent runs; ii) B v.s. B and B v.s. B : Whereas including intra-particle attention in B already notably improves the performance comparedto B , including population-based features and inter-particle attention in B results in the largest performance boost. This confirms that our algorithmmajorly benefits from the attention mechanisms; iii) Proposed v.s. B : addingentropy from the posterior gains further, thanks to its balancing exploration andexploitation during search. We bring our meta- optimizer into a challenging real-world application. Incomputational biology, the structural knowledge about how proteins interacteach other is critical but remains relatively scarce [20]. Protein docking helpsclose such a gap by computationally predicting the 3D structures of protein-protein complexes given individual proteins’ 3D structures or 1D sequences [24].
Ab initio protein docking represents a major challenge of optimizing a noisy andcostly function in a high-dimensional conformational space [7].12athematically, the problem of ab initio protein docking can be formulatedas optimizing an extremely rugged energy function: f ( x ) = ∆ G ( x ) , the Gibbsbinding free energy for conformation x . We calculate the energy function in aCHARMM 19 force field as in [19] and shift it so that f ( x ) = 0 at the originof the search space. And we parameterize the search space as R as in [7].The resulting f ( x ) is fully differentiable in the search space. For computationalconcern and batch training, we only consider 100 interface atoms. We choose atraining set of 25 protein-protein complexes from the protein docking benchmarkset 4.0 [15] (see Supp. Table S1 for the list), each of which has 5 startingpoints (top-5 models from ZDOCK [21]). In total, our training set includes 125instances. During testing, we choose 3 complexes (with 1 starting model each)of different levels of docking difficulty. For comparison, we also use the trainingset from Eq. 1 ( n = 12 ). All methods including PSO and both versions of ourmeta-optimizer have k = 10 particles and 40 iterations in the testing stage.As seen in Fig. 5, both meta-optimizers achieve lower-energy predictions thanPSO and the performance gains increase as the docking difficulty level increases.The meta-optimizer trained on other protein-docking cases performs similarly asthat trained on the Rastrigin family in the easy case and outperforms the latterin the difficult case. Number of function e aluations −300−200−1000100200300400500 m i n x f ( x ) PSOOur model trained on Rastrigin function familyOur model trained on real energy functions (a) 1AY7_2
Number of function evaluations −200−1000100200300 m i n x f ( x ) PSOOur model trained on Rastrigin function famil Our model trained on real energ functions (b) 2HRK_1
Number of function e aluations −200−1000100200300400500 m i n x f ( x ) PSOOur model trained on Rastrigin function familyOur model trained on real energy functions (c) 2C0L_1
Figure 5: The performance of PSO, our meta-optimizer trained on Rastriginfunction family and that trained on real energy functions for three different levelsof docking cases: (a) rigid (easy), (b) medium, and (c) flexible (difficult).
Designing a well-behaved optimization algorithm for a specific problem is a labo-rious task. In this paper, we extend point-based meta- optimizer into population-based meta- optimizer , where update formulae for a sample population are jointlylearned in the space of both point- and population-based algorithms. In order tobalance exploitation and exploration, we introduce the entropy of the posteriorover the global optimum into the meta loss, together with the cumulative regret,to guide the search of the meta- optimizer . We further embed intra- and inter- par-ticle attention modules to interpret each update. We apply our meta-optimizerto quadratic functions, Rastrigin functions and a real-world challenge – protein13ocking. The empirical results demonstrate that our meta-optimizer outperformscompeting algorithms. Ablation study shows that the performance improvementis directly attributable to our algorithmic innovations, namely population-basedfeatures, intra- and inter-particle attentions, and posterior-guided meta loss.
Acknowledgments
This work is in part supported by the National Institutes of Health (R35GM124952to YS). Part of the computing time is provided by the Texas A&M High Perfor-mance Research.
References [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman,David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas.Learning to learn by gradient descent by gradient descent. In
Advances inNeural Information Processing Systems , 2016.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 , 2014.[3] Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain morefrom orthogonality regularizations in training deep networks? In
Advancesin Neural Information Processing Systems , pages 4261–4271, 2018.[4] Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier. On the search for newlearning rules for ANNs.
Neural Processing Letters , 2(4):26–30, 1995.[5] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On theoptimization of a synaptic learning rule. In , pages 6–8. Univ. of Texas, 1992.[6] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule.In
IJCNN-91-Seattle International Joint Conference on Neural Networks ,volume ii, pages 969 vol.2–, July 1991.[7] Yue Cao and Yang Shen. Bayesian active learning for optimization and un-certainty quantification in protein docking. arXiv preprint arXiv:1902.00067 ,2019.[8] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, Wuyang Chen, YangYang, Zhou Ren, and Zhangyang Wang. Abd-net: Attentive but diverseperson re-identification.
ICCV , 2019.[9] Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoreticallinear convergence of unfolded ista and its practical weights and thresholds.In
Advances in Neural Information Processing Systems , pages 9061–9071,2018. 1410] Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, MishaDenil, Timothy P Lillicrap, Matt Botvinick, and Nando de Freitas. Learningto learn without gradient descent by gradient descent. In
Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages748–756. JMLR. org, 2017.[11] Jean-Paul Chilès and Pierre Delfiner.
Geostatistics: Modeling Spatial Un-certainty, 2nd Edition . 2012.[12] Jeffrey J. Gray, Stewart Moughon, Chu Wang, Ora Schueler-Furman, BrianKuhlman, Carol A. Rohl, and David Baker. Protein–Protein Docking withSimultaneous Optimization of Rigid-body Displacement and Side-chainConformations.
Journal of Molecular Biology , 2003.[13] Karol Gregor and Yann LeCun. Learning fast approximations of sparsecoding. In
Proceedings of the 27th International Conference on InternationalConference on Machine Learning , pages 399–406. Omnipress, 2010.[14] Harry F Harlow. The formation of learning sets.
Psychological review ,56(1):51, 1949.[15] Howook Hwang, Thom Vreven, Joël Janin, and Zhiping Weng. Protein-Protein Docking Benchmark Version 4.0.
Proteins , 78(15):3111–3114, Novem-ber 2010.[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[17] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprintarXiv:1606.01885 , 2016.[18] Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista:Analytic weights are as good as learned weights in lista.
ICLR , 2019.[19] Iain H. Moal and Paul A. Bates. SwarmDock and the Use of Normal Modesin Protein-Protein Docking.
International Journal of Molecular Sciences ,11(10):3623–3648, September 2010.[20] Roberto Mosca, Arnaud Céol, and Patrick Aloy. Interactome3d: addingstructural details to protein networks.
Nature methods , 10(1):47, 2013.[21] Brian G. Pierce, Kevin Wiehe, Howook Hwang, Bong-Hyun Kim, ThomVreven, and Zhiping Weng. ZDOCK server: interactive docking predictionof protein–protein complexes and symmetric multimers.
Bioinformatics ,30(12):1771–1773, 02 2014.[22] Herbert Robbins and Sutton Monro. A stochastic approximation method.
The annals of mathematical statistics , pages 400–407, 1951.1523] Ernest K Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang,and Wotao Yin. Plug-and-play methods provably converge with properlytrained denoisers.
ICML , 2019.[24] Graham R Smith and Michael JE Sternberg. Prediction of protein–proteininteractions by docking methods.
Current opinion in structural biology ,12(1):28–35, 2002.[25] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude.
COURSERA: Neuralnetworks for machine learning , 2012.[26] Zhangyang Wang, Qing Ling, and Thomas S Huang. Learning deep (cid:96)
Thirtieth AAAI Conference on Artificial Intelligence , 2016.[27] Lewis B Ward. Reminiscence and rote learning.
Psychological Monographs ,49(4):i, 1937.[28] Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Ser-gio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. In
Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages3751–3760. JMLR. org, 2017.[29] Xin-She Yang. Firefly algorithms for multimodal optimization. In
Proceed-ings of the 5th International Conference on Stochastic Algorithms: Founda-tions and Applications , SAGA’09, pages 169–178, Berlin, Heidelberg, 2009.Springer-Verlag.[30] Barret Zoph and Quoc V Le. Neural architecture search with reinforcementlearning. arXiv preprint arXiv:1611.01578arXiv preprint arXiv:1611.01578