[PDF] Estimation of Distribution Algorithm for Protein Structure Prediction

Abstract

Proteins are essential for maintaining life. For example, knowing the structure of a protein, cell regulatory mechanisms of organisms can be modeled, supporting the development of disease treatments or the understanding of relationships between protein structures and food attributes. However, discovering the structure of a protein can be a difficult and expensive task, since it is hard to explore the large search to predict even a small protein. Template-based methods (coarse-grained, homology, threading etc) depend on Prior Knowledge (PK) of proteins determined using other methods as X-Ray Crystallography or Nuclear Magnetic Resonance. On the other hand, template-free methods (full-atom and ab initio) rely on atoms physical-chemical properties to predict protein structures. In comparison with other approaches, the Estimation of Distribution Algorithms (EDAs) can require significant less PK, suggesting that it could be adequate for proteins of low-level of PK. Finding an EDA able to handle both prediction quality and computational time is a difficult task, since they are strong inversely correlated. We developed an EDA specific for the ab initio Protein Structure Prediction (PSP) problem using full-atom representation. We developed one univariate and two bivariate probabilistic models in order to design a proper EDA for PSP. The bivariate models make relationships between dihedral angles ϕ and ψ within an amino acid. Furthermore, we compared the proposed EDA with other approaches from the literature. We noticed that even a relatively simple algorithm such as Random Walk can find the correct solution, but it would require a large amount of prior knowledge (biased prediction). On the other hand, our EDA was able to correctly predict with no prior knowledge at all, characterizing such a prediction as pure ab initio.

Full PDF

EEstimation of Distribution Algorithm for ProteinStructure Prediction

Daniel Bonetti, Alexandre Delbem, Dorival Le˜ao

Institute of Mathematical Sciences and Computation, University of Sao Paulo, SP, Brazil

Jochen Einbeck

Department of Mathematical Sciences, Durham University, Durham, UK

Abstract

Proteins are essential for maintaining life. For example, knowing the struc-ture of a protein, cell regulatory mechanisms of organisms can be modeled,supporting the development of disease treatments or the understanding of rela-tionships between protein structures and food attributes. However, discoveringthe structure of a protein can be a diﬃcult and expensive task, since it is hardto explore the large search to predict even a small protein. Template-basedmethods (coarse-grained, homology, threading etc) depend on Prior Knowledge(PK) of proteins determined using other methods as X-Ray Crystallographyor Nuclear Magnetic Resonance. On the other hand, template-free methods(full-atom and ab initio ) rely on atoms physical-chemical properties to predictprotein structures. In comparison with other approaches, the Estimation of Dis-tribution Algorithms (EDAs) can require signiﬁcant less PK, suggesting that itcould be adequate for proteins of low-level of PK. Finding an EDA able to han-dle both prediction quality and computational time is a diﬃcult task, since theyare strong inversely correlated. We developed an EDA speciﬁc for the ab ini-tio

Protein Structure Prediction (PSP) problem using full-atom representation.We developed one univariate and two bivariate probabilistic models in order todesign a proper EDA for PSP. The bivariate models make relationships between ∗ Corresponding author

Email address: [email protected] (Daniel Bonetti)

Preprint submitted to Journal of L A TEX Templates January 7, 2019 a r X i v : . [ q - b i o . B M ] J a n ihedral angles φ and ψ within an amino acid. Furthermore, we compared theproposed EDA with other approaches from the literature. We noticed that evena relatively simple algorithm such as Random Walk can ﬁnd the correct solution,but it would require a large amount of prior knowledge (biased prediction). Onthe other hand, our EDA was able to correctly predict with no prior knowledgeat all, characterizing such a prediction as pure ab initio . Keywords:

Estimation of Distribution Algorithm, Protein StructurePrediction, ab initio , full-atom, van der Waals energy,Expectation-Maximization.

1. Introduction

From the discovery of a new disease until the development of its cure cantake about ten years and involve costs of ﬁve billion dollars [1]. One of the mainreasons for this amount of money and time needed is the problem of ﬁnding thetertiary structure of the protein responsible for the disease. Most of the existingmethods to determine the protein structures are experimental. They attemptto look at the proteins as they are present in nature. For instance, the X-Ray Crystallography experimental method shoots an x-ray beam into a proteincrystal. Then it creates a diﬀraction map of hydrogen atoms and together withinformation of the primary structure of the protein, it is possible to constructthe tertiary structure. The Nuclear Magnetic Resonance needs a solution ofhigh concentration containing the target protein. This solution is submittedto a process that will excite the atom spins and, depending on how much theatoms will move to a diﬀerent spin state, it is possible to construct the tertiarystructure of a protein. Both of these experimental methods have disadvantagesand they cannot always determine the tertiary structure of a protein [2].Thus, in order to avoid the high costs and time needed by experimentalmethods, computational methods ( in silico ) have been created. The methodto determine the primary structure of proteins is well established nowadays,but as we know, the equivalent tertiary structure is not. Thus, from the se-2uence of amino acids, computational methods attempt to look into the searchspace in order to ﬁnd feasible protein conﬁgurations. We know that proteinsin nature stabilize with the minimum energy state. Therefore, computationalmethods look for protein conﬁgurations that represent the minimum energystate. It means that we are predicting the protein structure. This problemis known as Protein Structure Prediction (PSP). There are two main compu-tational approaches to predict proteins. One uses knowledge of proteins thatwere determined using other methods, as experimental, to infer the new ones(called template-based). Other, as used in this work, is called template-free. Itdoes not use information about any existing structures to predict the new ones.Instead, it uses energy potentials so it can evaluate how good a conﬁgurationis. Then, when a conﬁguration with a lower energy value is found, we will havea hypothesis that we found the correct structure. However, the search space ofprotein conﬁgurations is huge and this task cannot be performed using exactmethods [3].Our goal is to ﬁnd the correct set of the dihedral angles φ, ψ, χ ’s (Section 2.1),i.e. the variables of the problem, that will yield the protein conﬁgurations withthe lowest possible energy. Thus, this can be treated as an optimization problemin which we want to minimize the energy of a protein conﬁguration (our ﬁtness)changing the dihedral angles (the variables). There are diﬀerent ways that onecan represent the protein conﬁgurations in the computer. One could use acoarse-grain representation of amino acid, called HP model [4], but it cannotrepresent the protein for practical purposes. Therefore, we use the full-atommodel considering all atoms of the protein conﬁgurations [5].Despite the eﬀorts of the in silico methods in trying to ﬁnd the correctprotein conﬁgurations, an appropriate algorithm that can work properly in apure ab initio way is still missing, which is especially relevant when one wantsto predict a protein that has low similarity with the known structures. Thereare some Evolutionary Algorithms (EA) and other optimization approaches withuseful results for PSP [6, 7, 8, 9]. The better the optimization technique, thebetter will be its chances of ﬁnding proper solutions. Nevertheless, this is not3 trivial task since the optimization algorithm has to deal with the large searchspace of possible 3D protein structures in an eﬃcient manner.Thus, considering the characteristics of the PSP problem, we build an Esti-mation of Distribution Algorithm (EDA) [10] specially designed for such a prob-lem. Basically, an EDA constructs models of the distribution of variable valuesfrom promising regions of the search space in order to improve the optimiza-tion process. To sample better solutions, we need to have a good probabilisticmodel. First, we developed a Univariate model-based Optimization algorithm(UNIO). Secondly, taking advantage of properties of the problem we are tack-ling, we modeled the [ φ, ψ ] within the same residue as correlated, yielding abivariate probabilistic model. We proposed two diﬀerent ways to estimate aprobability distribution for each pair of variables: (i) Kernel Density Estima-tion model-based Optimization (KDEO) [11] and (ii) Finite Gaussian Mixturesmodel-based Optimization (FGMO) [12]. That yielded three novel diﬀerentEDAs for PSP: UNIO, KDEO and FGMO.In order to evaluate our probabilistic models, we measured three aspects:the quality of the protein conﬁgurations (RMSD, root-mean-square deviation)[13], the computational time consumption and the energy values of the pre-dicted structures (Section 4). As we expected, the bivariate models (KDEO andFGMO) were able to ﬁnd better solutions in terms of the energy and RMSD.Furthermore, we made a comparison between KDEO and speciﬁcally other opti-mization approaches from the literature, Random Walk (RW) [14], Monte Carlo(MC) [15], Genetic Algorithm (GA) [16] and Diﬀerential Evolution (DE) [17].We discovered that all of these optimization techniques were able to ﬁnd thecorrect protein conﬁguration, of a speciﬁc case, when there is enough previousknowledge about a promising region of the search space. For instance, RW isknown as a poor optimization technique, thus, in order to ﬁnd the correct proteinconﬁguration, we introduced a bias, reducing the search space to a neighborhoodof the dihedral angles of the native protein. On the other hand, the proposedEDA can, in general, ﬁnd adequate solutions without search space reduction,that is, without bias. Therefore, we may concluded that the proposed EDA4s more adequate than other investigated approaches when predicting proteinswith low similarity to known structures.In the next section, we present the Protein Structure Prediction problem. InSection 3, we present our proposed EDA for PSP. Then, the results are shownin Section 4. Finally, the conclusions of this work are made in Section 5.

2. Protein Structure Prediction problem

Proteins are essential in almost all processes of life. They have diﬀerentfunctions in living beings and the function of each protein is associated withtheir shape. For instance, a protein responsible for the transportation of othermolecules has its shape appropriate to carry such a molecule along the organismuntil its destination, i.e. the hemoglobin, capable to transport oxygen. There areseveral other functions of proteins as, for instance, defense, control, regulation,breaking covalent binds etc. We say that a function of a protein is associatedto its shape, also called tertiary structure [18].When we know a protein responsible of causing a disease, we can developtreatments and inhibit that protein function. For instance, some viruses havethe capability of cutting human cells, entering in it and then making copiesof themselves spreading the disease along the rest of body. However, when weknow the shape of the protein responsible for cutting the cell we can developmedicines, the complementary structure, which bind the virus in the medicinemolecule instead of the human cell. This will probably lead to the death of thevirus and reduce the eﬀects of the virus on the organism [3].Every protein has an identiﬁcation that matches with its shape. This is calledprimary structure (or sequence of amino acids) and it stores all the residues chainof a protein. Nowadays, it is relatively simple to isolate the primary structureof a protein. However, ﬁnding the tertiary structure equivalent to the primarystructure is a very complicated task. This is the main reason why the sequencedatabase is growing fast [19] and the structures database is growing relativelyslow [20]. 5he current methods to ﬁnd the tertiary structure of proteins are expen-sive and require years of trial and error. The experimental methods X-RayCrystallography (XRC) and Nuclear Magnetic Resonance (NMR) are by far themost used methods to determine protein structures. However, it is not alwayspossible to use these methods, since they have disadvantages as, for instance,sometimes it is not possible to have a protein crystal and in this case, the XRCwill be no longer available. There is no need to have a protein crystal in NMRbut it only works for small proteins [13].Based on the drawbacks of the experimental methods many computer meth-ods that try to ﬁnd the structure of proteins were developed. This is known as insilico methods. Based on the sequence of amino acids, they look for feasible pro-tein conﬁgurations in the search space and try to predict the protein structure.The in silico methods are divided into two diﬀerent ways of prediction. First,they are based on prior knowledge of proteins that were already determined byother methods, as XRC and NMR. They are promising techniques and thereare many researches about these methods. One problem of these techniquesis that their predictions can be biased toward the experimental methods. Thesecond method is called template-free and does not make use of knowledge ofother proteins. Instead, it uses energy potentials in order to compute the energyof the protein conﬁguration. The energy potential is a way to describe how aprotein conﬁguration represents a protein in real life. Knowing that proteinsstabilize in a state of minimum energy, we also want to ﬁnd the conﬁgurationswith lowest energy. In this work, we use a template-free approach with full-atomrepresentation [21].

The protein conﬁguration can often be represented using dihedral angles φ, ψ, χ ’s. Each residue in a protein conﬁguration chain has its own set of [ φ, ψ ]angles and the number of χ ’s depends on the type of residue. All these anglesrange from −

180 to 180 degrees, the search space range. The φ angle is thedihedral angle between atoms N − C α within the same residue and the ψ is6elated to the C α − C angle. The set of all dihedral angles φ ’s, ψ ’s, χ ’s from allresidues of a protein conﬁguration has all information needed to evaluate theprotein. Thus, the variables of the problem can be understood as a vector ofdihedral angles ranging from [ − Figure 1: An example of the dihedral angles [ φ, ψ ] from the sequence of amino acids using afull-atom representation (the side chains are hidden).

Several potential energies contribute to the protein energy. There are bondedand non-bonded potential energies and each of them has a speciﬁc contributionto the molecule stabilization. We know that non-bonded potential energies havethe largest contribution to the molecule energy. In our algorithm, called Prot-Pred [7], we have ﬁve bonded potential energies (bond stretching, angle bending,Urey-Bradley, improper dihedral and torsional angle) and four non-bonded po-tential energies (van der Waals, electrostatic, solvation and hydrogen bonds).Despite having all these potential energies implemented in our algorithm, we7ecide to use only van der Waals energy in this paper, since it has the highestcontribution to molecule energy. Besides, it makes it easier to understand theresults so we can keep the focus on the evolutionary process. Thus, the ﬁtnessfunction is determined only by the van der Waals energy.The van der Waals energy models the attraction and repulsion among atoms.In general, the Lennard-Jones potential (also known as Lennard-Jones 12 − i and j , given the Euclidian distance between them( d i,j ) and the van der Waals radii constant R of atoms i and j : v ij = d i,j R i + R j . (1)The van der Waals energy is very repulsive at short distances since theelectron cloud between atoms starts to overlap. At this distance, the energyrapidly increases and tends to inﬁnity. The equilibrium point, known as van derWaals contact, happens when the Euclidian distance between the atom pair isneither too far nor too close. This is the point of minimum energy of an atompair. If the atom pair is too far from each other they will not have any type ofinteraction and the energy will tend to zero. We used a cutoﬀ of 8 ˚A in orderto avoid unnecessary computation, and to avoid dealing with large numbers wealso set a tapering-oﬀ (see Figure 2). In case that v ij is smaller than 0.8 weassume a constant C [22]. The Lennard-Jones potential used in our EDA forPSP is described by: f LJ ( v ij ) =  Av − ij − Bv − ij if v ij > . ,C if v ij ≤ . , (2)where A and B are constants experimentally determined based on characteristicsof the environment, and C is given by Av − ij − Bv − ij with v ij = 0 . n − n interactions, where n is the8umber of atoms, leading to: E vdw = n − (cid:88) i =1 n (cid:88) j = i +1 f LJ ( v ij ) . (3) Figure 2: The modiﬁed van der Waals function we used, highlighting the tapering-oﬀ (whenatoms are to close), the van der Waals contact (ideal distance) and the cutoﬀ (they are toofar).

Thus, in order to have the minimum global energy of a molecule it is nec-essary to ﬁnd a compromise between the partial energies among atoms to getthe least van der Waals energy. To ﬁnd good van der Waals energy values weneed to change the dihedral angles of our protein conﬁguration. Changing thedihedral angles will lead to a change in Cartesian coordinates of atoms in themolecule as well. Thus, there is a set of dihedral angles that will imply the bestpositioning of atoms, yielding a global minimum in van der Waals energy.9 . Estimation of Distribution Algorithm for Protein Structure Pre-diction

In the previous section, we concretized the problem we want to tackle. Inthis section, we will describe how can we ﬁnd promising solutions using EDAs,i.e. how can the EDA ﬁnd a good set of dihedral angles that will express a lowenergy molecule.The EDAs are a relatively new class of the EA. They are optimization tech-niques that use probabilistic models from a promising set of solutions in order tosample the oﬀspring. In some cases they also have the capability of accountingfor correlations between variables. In the literature, the EDAs for binary anddiscrete variables are well described, since the probabilistic models and variablerelationship are relatively easy to understand [23]. In fact, using a simpliﬁedrepresentation of proteins, [24] showed that it’s possible to achieve relevant re-sults with an EDA for the PSP problem.However, dealing with dihedral angles (in the continuous search space) isrelatively more diﬃcult since it is not possible to map the combination of allvariable values. Besides, the probabilistic model should be able to deal withmultimodality, since the distributions of dihedral angles in the PSP problemare non-parametric. We also want something that can handle the variable rela-tionship between [ φ, ψ ] within the same residues, so it has to be bivariate.There are some real-valued EDAs in the literature as well. However, theywould not be totally appropriate for the PSP problem. For example, the Uni-variate Marginal Distribution Algorithm (UMDA c ) [25] works in the continuoussearch space but cannot deal with multimodality neither variable relationship.The Bivariate Marginal Distribution Algorithm (BMDA) could model the vari-able relationship but not the multimodality aspect [26]. Since then, EDAsable to tackle the multimodality and the variable relationship aspects were de-veloped, as the case of the EGNA [25], IDEA [27, 28] and PBIL [29]. Thereal-valued Bayesian Optimization Algorithm (rBOA) [30] can also handle bothmultimodality and variable relationship. It was shown that rBOA can outper-10orm the mIDEA [31] in several benchmark problems. However, the evidencethat rBOA is able to predict protein structures using the full-atom representa-tion has not been provided in the literature yet.Knowing that the dihedral angles φ and ψ within the same residue havea strong relationship, we designed an EDA in which there is no need to learnthe variable relationship: we treat the dihedral angles [ φ, ψ ] of each residue ascorrelated variables. Besides, the statistical mechanisms used in previous EDAsas, for instance, the normal kernels and the mixtures of Gaussian distributionwere used to build our new EDA for PSP full-atom. We already showed in aprevious work that EDAs can successfully be applied in PSP using full-atomand ab initio modeling [32]. In this work, we show three diﬀerent approachesthat were developed. The detailed information about each of them is describedin the next sections. The Univariate model-based Optimization (UNIO) is the simplest algorithm.It does not deal with variable relationships. Instead, it deals with multimodal-ity in an eﬃcient way. From the promising individuals (selected) the process ofa one-dimensional kernel density estimation (KDE) is simulated for each vari-able involved in the problem. However, creating a kernel distribution for eachproblem variable would require high computational costs since it is necessary toiterate between all observations (the selected size) per variable.KDE is based on the sums of the diﬀerence between the point of interest x and all observations within a data set ( x , x , .., x n ) over a bandwidth value h . In order to sample new values from our kernel distribution we ﬁrst need tobuild a Probability Density Function (PDF). This can be done by calling theKDE function as many times as is required to ﬁll the range of the x values.That would require high computing time per generation. For instance, considerthe selected size (of the EDA) is n = 500, and the number of points needed tobuild the PDF is 400, then for 100 dihedral angles (100-dimensional problem),we would need to call the KDE function at least 500 × ×

100 = 20 , , x and add a perturbation to it with the distribution N (0 , S ij where i = 1 . . . n and j = 1 . . . m , being n the sizeof selected individuals and m the number of variables of the problem. We wantto ﬁll the oﬀspring O kj where k = 1 . . . o and o is oﬀspring size. For a randompoint drawn from the selected S i for a variable j , we add some perturbation toit and put this new value into O kj for k = j = 1. Then we repeat this processfor j = 1 . . . m (for all variables of the ﬁrst individual of the oﬀspring) and thenfor all the remaining individuals ( k = 1 . . . o ). The Algorithm 1 shows how thistechnique works. Algorithm 1

Random Points - It creates the oﬀspring from selected.

Require:

Selected individuals S , Size of selected n , Size of the Oﬀspring o ,number of problem variables m Ensure:

Oﬀspring O for j = 1 to m do u ← Sample o values from Uniform Discrete Distribution ranging from [1 , n ] O j ← S uj + N (0 , end for We carried out an experiment to ﬁnd out whether this kernel simpliﬁcationis promising. We noticed that the simpliﬁcation of KDE, indeed, does not havethe same accuracy of the KDE, but it is much faster. So we are compromisinga little of accuracy with a lot of reduction in computational cost.Figure 3 shows an example of the comparison between our proposed sim-pliﬁcation (Random points) and the traditional KDE. From a sample data setwith three modes, we simulated new values using KDE and our proposed sim-pliﬁcation. We repeated this procedure 30 times. The red line represents thetrue density. Density plots of the simulated values with KDE and our proposed12impliﬁcation are shown as well as the smoothed standard deviations.

Distribution D en s i t y Method

True densityKernel sampledRandom points (a) llllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll

Method T i m e ( m s ) (b) Figure 3: Comparison of simulating data with KDE and Random Points: (a) a comparisonwith the true density of the data and (b) a time comparison. .2. Kernel Density Estimation model-based Optimization Considering that all variables can interact with each other we could create an m –dimensional KDE. Nevertheless, that would be very diﬃcult to handle. Then,we decided to use the two-dimensional Kernel Density Estimation, yielding theKDEO (Kernel Density Estimation model-based Optimization). It correlatesthe dihedrals angles [ φ, ψ ] of amino acids within the same residue. We knowthat these two variables are strongly correlated since rotations in φ generallyproduce stereochemical constraints for its closest neighbor dihedral angle ψ .Moreover, an implicit correlation is made with [ φ, ψ ] within the same residueevery time one looks at the Ramachandran plot [33], since the plot itself showsa density map between angles φ and ψ .In KDEO, the ψ values are generated conditional on φ . Firstly, the KDE iscreated for the φ and a new value φ (cid:48) from its distribution is sampled. Then, atwo-dimensional KDE map of [ φ, ψ ] is created. The closest value to φ (cid:48) in thetwo-dimensional KDE (in x axis direction) is taken to be the conditional KDE.Finally, a new ψ (cid:48) value is sampled from the two-dimensional KDE (in y axisdirection).For the two–dimensional case, kernel density estimates are obtained viaEquation 4: ˆ f h ( x , x ) = 1 n n (cid:88) i =1 h h K (cid:18) x − φ i h (cid:19) K (cid:18) x − ψ i h (cid:19) , (4)where K ( . ) is the kernel. In this work, we used the normal kernel K ( x ) = (2 π ) − e − . x (5)and the bandwidth ˆ h is calculated using Equation 6:ˆ h =  · . · ˆ σ · n − / if ˆ σ < t, · . · t · n − / if ˆ σ ≥ t, (6)where 4 is a multiplicative factor [34, 35] and t equals14 = Q − Q . , (7)where Q Q φ, ψ ] has a normal dis-tribution using the Anderson-Darling test. If the p-value is large than 5%, thenthe KDEO is bypassed and a bivariate normal distribution is used instead. Oth-erwise, the KDEO is used. This strategy is interesting because at the beginningof the evolutionary process we can have many modes. However, as the evolu-tionary process starts to converge the kernel is not necessary anymore for allpairs [ φ, ψ ] (see Algorithm 2). Algorithm 2

Two-dimensional Kernel Sampling - It creates the Oﬀspring fromselected

Require:

Selected individuals S , the number of protein residues r , size of thenew data o Ensure:

Oﬀspring O for i = 1 to r do φ ← Get φ from residue i from Sψ ← Get ψ from residue i from S if p-value from Anderson-Darling test of [ φ, ψ ] < . then φ (cid:48) ← Random Points ( φ, o ) P ←

2D Kernel Density Map ( φ, ψ ) ψ (cid:48) ← Sample from PDF P conditional on φ (cid:48) else φ (cid:48) , ψ (cid:48) ← Sample o values from two-dimensional Gaussian N ([ µ φ , µ ψ ] , Σ φψ ) end if O i ← φ (cid:48) , ψ (cid:48) end for For instance, similarly to [ φ, ψ ], let x and x be two vectors of data dis-15 l lll ll lllll lll ll l llllll l ll lll lll lll lll ll lll ll ll ll lllll ll ll ll ll l llll ll ll ll lll lll llll lll l llll llll ll ll ll ll llll ll lll lll lll ll ll lll l ll llll lll lll l l ll l ll llll llll lll llllll lllll ll l l lll ll lll ll ll ll ll lll llll llll ll ll llll ll lllll llllllll lll ll lllll llll l lllll ll lll lll l l l lll l lll ll lll llll ll lll lll lll llll llllll ll lll lll ll ll lllll l lll ll ll ll ll ll lll ll ll lllll ll ll ll lll ll lllll ll ll lll lll llll ll lllll l lll lll ll lll ll lllll l ll lll llll lll ll ll ll llll l lll lll lll ll llll ll l ll ll lll lll ll l lll ll llll llll ll lll ll lllllll ll l lll llll l ll lll l ll ll ll l llll l −505 −5 0 5 10 x x (a) l x D en s i t y −505 −5 0 5 10 x x P(x2|x1) D en s i t y (b) Figure 4: Two-dimensional KDE example: (a) The data x , x , (b) the univariate distributionof x (top), the kernel map created for x , x (middle) and the distribution of x given thevalue of x (bottom). tributed as shown in Figure 4a. First, a new vector x (cid:48) is sampled from theindependent distribution of x . Then, the two-dimensional KDE map is createdfor the pair [ x , x ]. For each new point of x (cid:48) new points of x (cid:48) are sampledconditional to the previous value of x (cid:48) . Looking at the Figure 4b, let’s assumethat x (cid:48) = 3 . x and x , we need to pick the closest point where x = 3 . x (cid:48) we sample new x (cid:48) values . As a KDEO alternative, we develop an EDA called Finite Gaussian Mixturesmodel-based Optimization (FGMO). Despite having the implementation of the m –dimensional FGMO we used only the two-dimensional in this paper. The The complete animation can be accessed at: http://lcrserver.icmc.usp.br/~daniel/ani/kde2d/ k (cluster) hasits own mean µ k , variance matrix Σ k and the mixture weight π k . From a givennumber of mixtures K , we want to know their set of parameters θ = [ µ, Σ , π ] forall mixture components. The Expectation-Maximization (EM) is often used toestimate the parameters of Gaussian mixtures [36]. In order to get the estimatedˆ θ , the EM algorithm takes a starting setting of parameters ˆ θ and iteratesbetween E-Step and M-Step. The algorithm converges when the log-likelihoodbetween two iterations is less than a deﬁned value (1.5 in our experiments). TheE-Step updates a probability matrix w j,k via: w j,k = ˆ π k f ( x j , ˆ µ k , ˆΣ k ) (cid:80) Kl =1 ˆ π l f ( x j , ˆ µ l , ˆΣ l ) , (8)where f ( x, ˆ µ, ˆΣ) can be deﬁned as f ( x, ˆ µ, ˆΣ) = 1(2 π ) p/ | ˆΣ | / exp (cid:26) −

12 ( x − ˆ µ ) T ˆΣ − ( x − ˆ µ ) (cid:27) , (9)and the M-Step updates the parameters viaˆ π k = 1 n n (cid:88) i =1 w ik , (10)ˆ µ k = (cid:80) ni =1 w ik x i (cid:80) ni =1 w ik , (11)ˆΣ k = 1 (cid:80) ni =1 w ik n (cid:88) i =1 w ik ( x i − ˆ µ k )( x i − ˆ µ k ) T . (12)For each pair [ φ, ψ ] from the selected set we run a two-dimensional EMfor a given number of mixture components K , yielding the estimates ˆ θ . Fromˆ θ we randomly select a mixture component using a uniform distribution withweights π k and sample the new [ φ (cid:48) , ψ (cid:48) ] values at once, using their parameters(see Algorithm 4). The ˆ θ is then used to sample all the diﬀerent oﬀspringindividuals. Then, for the [ φ, ψ ] of the next residue we estimate a new ˆ θ and use17hese new parameters to sample the values for this next residue. This processcontinues until it reaches the last residue of the molecule.For instance, consider a protein with 5 residues. The selected individuals willhave 5 [ φ, ψ ] pairs ([ φ i , ψ i ]; . . . ; [ φ i , ψ i ]), where i = 1 . . . n being i the individualindex in the selected set and n is the total number of selected individuals. Forthe pair [ φ i , ψ i ] we perform a complete EM algorithm and get ˆ θ . It containsthe µ k , π k and Σ k for the mixtures k . Then, the pair [ φ i (cid:48) , ψ i (cid:48) ] is generated bysampling values from ˆ θ . Next, ˆ θ is estimated using [ φ i , ψ i ] and then a bivariatemixture with new parameter [ φ i (cid:48) , ψ i (cid:48) ] is simulated. This continues until theresidue number 5. It is important to notice that in this case, there are ﬁveEM algorithms running at each generation. This can have high computationalcosts since the EM is an iterative process. In order to speed up the algorithm,we replaced the general PDF (Equation 9) by a speciﬁc expression for the two-dimensional case, as shown in Equation 13: f ( x , x ) = 12 πσ σ (cid:112) − ρ exp (cid:26) − z − ρ ) (cid:27) , (13) z ≡ ( x − µ ) σ + 2 ρ ( x − µ )( x − µ ) σ σ + ( x − µ ) σ . Besides, before computing the FGMO we check whether the standard de-viation of both φ and ψ is small (say 0.01). If true, then we bypass the EMalgorithm and use a bivariate normal Gaussian instead (see Algorithm 3).During the EM iterations, we also needed to treat special cases to avoiddivision by zero in Equation 8, that may happen when there is a very far outlierwith a very small standard-deviation in the data set. In this case, we stop theEM iterations and use the last valid ˆ θ .Figure 5 shows an example of how the FGMO works. At the ﬁrst iteration,the parameters are set to the initial condition. Then, according to the EMiterations, the parameters tend to converge. Figure 5b shows the parametersafter the convergence. Finally, new values are sampled (Figure 5c) from itsﬁtted mixtures model. 18 lgorithm 3 Two-dimensional Finite Gaussian Mixtures based-model Opti-mization - Sample the oﬀspring using FGMO

Require:

Selected individuals S , the number of protein residues r , size of theoﬀspring o , mixtures components K Ensure:

Oﬀspring O for i = 1 to r do φ ← Get φ from residue i from Sψ ← Get ψ from residue i from S if Standard deviation of φ > .

01 and Standard deviation of ψ > . then ˆ θ ← Expectation-Maximization ( φ ; ψ, K )[ φ (cid:48) ; ψ (cid:48) ] ← Sample o individuals from ﬁtted model ˆ θ else φ (cid:48) , ψ (cid:48) ← Sample o values from two-dimensional Gaussian N ([ µ φ , µ ψ ] , Σ φψ ) end if O i ← [ φ (cid:48) ; ψ (cid:48) ] end forAlgorithm 4 Two-dimensional Finite Gaussian Mixtures - Simulating

Require:

The ﬁtted model ˆ θ and the size of the new data n Ensure:

Simulated sx ← Sample from distribution U (0 , c ← Cumulative sum of ˆ θ π k ← while x < c k do k ← k + 1 end while s ← Sample from N (ˆ θ µ k , ˆ θ σ k ) 19 x −4 −2 0 2 4 6 8 − − ll ll lllll ll l llll llll llll ll l ll lll lllllllll lll ll lll lll llllllll ll lllll lll lll llll l lll l lll llll llll l ll llll lll lll lll llll ll l l lll ll ll llll ll lll ll lll ll lll lll ll lll lll llllll lll lllll llll lllll l lll lll llllll lll lll ll llll llll l lll lll ll ll llll lllll l lll ll lll l lllll lll ll l lll l l ll lll lll lll ll lllll llll ll l l llll llll ll ll lll l llll ll llll l llllll ll l llll llll ll lllllll lll ll ll llllll lll lll ll ll l l lllllll ll l ll l ll l ll lll lll lll llllll lll llll l lllll llllll lll ll ll ll llll llllllll lll ll lll lllllll llllll lll llll lllllll ll l lll llll l ll lll ll lll lllll llllll llll l l (a) x x −4 −2 0 2 4 6 8 − − lll lllllll ll lll ll l lll ll lllll lll ll lll lllll ll ll llll lllll ll ll ll llll lll l lll lll ll ll lll llll l ll l lll ll lll lll lll ll ll llll lll lll l lll l ll l llll ll ll lllll l ll lllll lllll lll ll ll lllll l llllll lllllll l ll ll l lllll ll ll l lllll llll llllllll llll ll l llllll ll lll l lll lll l lll l ll lllll lll ll ll ll ll ll lll ll llll l ll ll lll lll ll llll ll llll l lll ll ll llll l llll llll lll ll ll l ll ll ll ll llll lll llll lll lllll lllllll ll llllllll ll l ll ll lll llllll l lllll l lll ll lll ll lllll llll ll lll ll ll lllll ll lllll lll llll l llll ll llllll llll lll ll ll l lll lll ll lllll ll l lll lllll lll lll lll lll l lll l l (b) l lll ll l lll ll l ll llll ll lll ll lll lll l l lll lll lll ll lllll llll ll ll lll lll ll ll l llll lll ll l ll lll ll lll lll llll lll ll l l lll ll ll ll ll lll ll lllll lll llll lll ll ll ll l ll llllll llll ll ll ll l llll l l l ll llll ll lll lll lll ll ll l l ll ll ll llll ll ll l lllll llll ll lll lllll ll l ll ll llll lll lll ll l ll ll l lll ll lllll ll l ll lllll ll llll ll l llll l ll l ll l lll ll llllll lll lll l ll ll ll l ll ll ll lll llll l ll lll l ll ll ll ll l lll l lll llll llll l llllll l ll ll lll lll ll ll ll ll llll lll l l ll lll ll l l ll ll ll l ll lll lll l lll ll ll l ll l llllll llll l lll ll lllll ll llll ll l ll lllllll llll lll lll lll ll l ll ll lllll lll ll ll −2 0 2 4 6 8 − x x ' (c) Figure 5: FGMO example. (a) The ﬁrst iteration of the EM algorithm. The original dataslightly overlapped with two mixture components. The red dots represent the means of eachmixture and the contour curves their densities, (b) the last iteration of the EM and (c) thesampled values.

4. Results

The results show that the three proposed EDAs (UNIO, KDEO and FGMO)performed properly for PSP. For each proposed method, we evaluated threediﬀerent aspects: (1) the computational cost, measuring the overall time of aprediction; (2) the van der Waals energy value of the best individual at the lastgeneration and (3) the RMSD, which represents how similar a solution is to thenative protein.We have to be careful when evaluating (3) since we are only using van derWaals energy in our ﬁtness function. In this case, we are bypassing other po-tentials that are less relevant to stabilize the molecule. Adding other potentialsin pure ab initio

PSP would require a much more complex algorithm, i.e. theMulti-Objective method proposed by [9]. This paper shows that we can ﬁndadequate pure ab initio protein conﬁgurations using only van der Waals and theproposed EDAs. We compared the KDEO against other optimization methodsfrom the literature, as Random Walk (RW) [14], Monte Carlo (MC) [15], GeneticAlgorithm (GA) [37] and Diﬀerential Evolution (DE) [17]. Appendix A showsa pairwise comparison using the Wilcoxon test [38] for all evaluated methodsconsidering the three aspects: van der Waals energy, RMSD and running time.20 .1. Experimental setup

The experiments were run in the 20 nodes cluster of the Laboratorio deComputacao Reconﬁguravel at the ICMC-USP. Each node has an Intel Core i72.67 GHz processor with 4 physical processors, 8 considering Hyper-Threadingtechnology. It also has 32 GB of RAM and the Operational System is Debian4.6.3-14 64 bits. Moreover, each node has two network adapters. One is usedfor the ﬁle system (NFS) and other to communication operations in MPI [39].We have selected four small proteins from the Protein Data Bank (PDB)[40], all containing α -helices. This may be favorable for the optimization modelused that is based on van der Waals only. Table 1 shows the PDB ID, thenumber of residues and the problem length ( m ). Figure 6 shows the shape andthe [ φ, ψ ] dihedral angles distributions of each native structure. Table 1: Protein sizes in experiments.

Protein Residues Problem length1A11 25 952LVG 40 1692KK7 52 2292X43 67 268The convergence is reached when either each of the tested algorithms reachesone million evaluations or the standard deviation of the population ﬁtness fallsbelow 0 . ProtPred is entirely written in C language and it uses eﬃcient libraries asGSL and CBLAS [41] to deal with most algebraic and statistical operations.Some functions were taken from statistical language R [42] and translated intoC. The van der Waals energy uses an eﬃcient implementation based on cell-lists,as we proposed in an earlier work [43]. Furthermore, the tested approaches haveseveral diﬀerent parameters to set, so we used MPI in order to distribute each21 a) −1000100 −100 0 100 Phi P s i Protein (b)

Figure 6: Native proteins: (a) three dimensional structures and, (b) the distribution of the[ φ ; ψ ] dihedral angles of the corresponding proteins. algorithm conﬁguration throughout our cluster. Despite this, we needed about12,000 hours of CPU time to get all the results shown in this paper. Before we run the sequence of experiments, we found a set of parameters ofthe EDA according to the probabilistic model used (Table 2). For each com-bination of parameters, the experiments were repeated 30 times with diﬀerentseeds for each protein.

Table 2: Parameters used in experiments with the probabilistic models.

Pop. Selected Tournament Mixturessize size sizeUNIO 2000 2000 2 -KDEO 500 250 2 -FGMO 200 100 2 10Figure 7 shows the results obtained for the smaller protein 1A11. Consid-22ring the energy aspect only, the FGMO was best, although it has some highenergetic outliers. The KDEO performed best when considering the RMSD. Therunning time results were partially surprising since a sophisticated probabilisticmodel as FGMO was faster than UNIO, in most cases. That happens becausethe selected size used by UNIO needed to be 20 times larger than FGMO.In the PSP problem, a trade-oﬀ between energy and RMSD is sometimesdiﬃcult to obtain since the predicted energy can exceed the native energy value,when using only van der Waals energy. A scatterplot of van der Waals energyper RMSD (Figure 7d for protein 1A11) enables to evaluate such trade-oﬀ. Wealso highlighted in yellow the set of points that dominate other ones, in termsof multi-criteria optimization [44]. These solutions in yellow approximate a Pa-reto Front. The points from UNIO do not show up in the Front, meaning thatneither the RMSD nor the energy aspects were better than FGMO or KDEO.The results for the protein 2LVG with 40 residues are shown in Figure 8.Although FGMO obtained a large variance in the energy, it also got the smallestvalues found, while UNIO and KDEO had concentrated points. KDEO was theonly method capable of ﬁnding RMSD values below 5.0 and reached an averagebetter than UNIO and FGMO as well. The running time followed the samepattern as the 1A11 protein. KDEO was the slower among the three and FGMOwas the faster.In the scatterplot of Figure 8d, most of the left upper points in the Frontbelong to FGMO with smallest energy, and the left lower points belongs toKDEO. This means FGMO was able to minimize better than KDEO, but KDEOfound solutions closest to the conﬁguration found in nature.For protein 2KK7 with 52 residues (Figure 9), FGMO also had widely-spacedenergy values. Considering only the average measure, FGMO would be theworst, but it was able to ﬁnd the best energy value. It means that FGMOcan reach better solutions than others, but for some reason, it is getting stuckat some local optimum, worsening the average value. Considering the RMSDaspect, KDEO was the only one able to ﬁnd solutions with RMSD below 7 . lll lll −160−150−140−130−120−110 UNIO KDEO FGMO Heuristic V an de r W aa l s E ne r g y (a) Heuristic R M S D ( A (cid:176) ) (b) llllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Heuristic R unn i ng T i m e ( m i nu t e s ) (c) l lllllllllllll lll Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

UNIOKDEOFGMO (d)

Figure 7: EDA for protein 1A11 with the three proposed methods: (a) energy plot; (b) RMSDplot; (c) time needed to run; and, (d) the scatter plot between energy and RMSD and thePareto Front highlighted in yellow. llll lllll −200−100 UNIO KDEO FGMO Heuristic V an de r W aa l s E ne r g y (a) Heuristic R M S D ( A (cid:176) ) (b) llllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllll l Heuristic R unn i ng T i m e ( m i nu t e s ) (c) llllllll Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

UNIOKDEOFGMO (d)

Figure 8: EDA for protein 2LVG with the three proposed methods: (a) density plot of theenergy; (b) RMSD plot; (c) time needed to run; and, (d) the scatter plot between energy andRMSD and the Front highlighted in yellow. − − We compared all non-bonded energies (van der Waals, electrostatic, solva-tion and hydrogen bonding) of each of the best predicted protein conﬁgurationagainst the native structure of each used protein. In order to compute the energyof native proteins, we converted the XYZ Cartesian coordinates from the PDBﬁle into dihedrals and then used these values as input of one ﬁtness function26 l −350−300−250−200−150−100 UNIO KDEO FGMO Heuristic V an de r W aa l s E ne r g y (a) Heuristic R M S D ( A (cid:176) ) (b) lllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Heuristic R unn i ng T i m e ( m i nu t e s ) (c) llllll Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

UNIOKDEOFGMO (d)

Figure 9: EDA for protein 2KK7 with the three proposed methods: (a) density plot of theenergy; (b) RMSD plot; (c) time needed to run; and, (d) the scatter plot between energy andRMSD and the Pareto Front highlighted in yellow. llll −400−300−200−1000 UNIO KDEO FGMO Heuristic V an de r W aa l s E ne r g y (a) l Heuristic R M S D ( A (cid:176) ) (b) llllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Heuristic R unn i ng T i m e ( m i nu t e s ) (c) lll Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

UNIOKDEOFGMO (d)

Figure 10: EDA for protein 2X43 with the three proposed methods: (a) density plot of theenergy; (b) RMSD plot; (c) time needed to run; and, (d) the scatter plot between energy andRMSD and the Pareto Front highlighted in yellow. able 3: Best values out of 30 runs found for each protein and for each method. The columnsEnergy, RMSD and Time are not related to each other, meaning that the values in a row maycome from diﬀerent running. The best values are highlighted.Protein Method Energy (kcal/mol) RMSD (˚A) Time (min.)1a11 RW -40.4 5.160 24.61a11 MC -113.3 6.335 36.01a11 GA -144.7 4.709 39.01a11 DE -167.9 2.976 37.11a11 UNIO -148.9 3.677 -162.5 2.338 -268.8 -345.3 -413.1 call of ProtPred.Figure 11 shows a comparison between the conﬁguration with the best energyvalue found by the proposed EDA and the native conﬁguration. For all the four29 KK7 2X431A11 2LVG V a n d e r W a a l s C h a r g e − C h a r g e S o l v a t i o n H y d r o g e n B o n d i n g T O T A L V a n d e r W a a l s C h a r g e − C h a r g e S o l v a t i o n H y d r o g e n B o n d i n g T O T A L −400−2000200−400−2000200 Energy Name E ne r g y V a l ue Configuration

EDANative

Figure 11: Comparison of the individual energies between the proposed EDA and the nativeprotein conﬁguration for proteins 1A11, 2LVG, 2KK7 and 2X43. proteins evaluated, the EDA managed to minimize the van der Waals energymore than native energy. This was expected because only van der Waals energywas minimized. However, in native proteins, the electrostatic energy (charge-charge) was better than the structure found by the proposed EDA. Thus, ifwe attempt to decrease the electrostatic energy of the protein conﬁgurationobtained by our EDA, the van der Waals energy would probably increase. Thesolvation and hydrogen bonding energies of the EDA were also higher than thenative were, apart from the hydrogen bonding of protein 2X43. Therefore, amulti-objective algorithm would be a more appropriate strategy in order to dealcorrectly with all these energies as proposed by [9].30 .4. Comparison with other Evolutionary Algorithms

We showed that FGMO can produce more diversiﬁed solutions, yieldingpromising energy values. However, KDEO was able to ﬁnd a better compromisebetween energy and RMSD. For this reason, we selected the KDEO to be com-pared against other heuristics from the literature. The parameters used withthe other heuristics are deﬁned in Table 4.

Table 4: Parameters used with RW, MC, GA and DE.

RW MC GA DEPop. Size 100 1000 2000 1000Sel. Pressure 1.2TournamentStep 2Cross. Rate 0.3 0.2Mutation Rate 0.5Mutation Factor 0.5Diﬀerential weight 0.2Figure 12 shows the scatterplot between van der Waals energy and RMSDfor all proteins used in this work. As we expected, RW was the worst for allcases and MC was the second worst. Then, GA was worse than DE, althoughGA got some points mixed with DE for the two larger proteins used (2KK7 and2X43). Finally, the KDEO was better for all cases considering only the RMSD.For the smaller protein 1A11 (Figure 12a), DE was able to ﬁnd better energyvalues than KDEO, although the KDEO got a better RMSD. Then, for theprotein 2LVG (Figure 12b), DE also found better energy values. However, thebest RMSD found by DE was 7.847 and most of the RMSD values found byKDEO are below 7.775. For protein 2KK7 (Figure 12c) the KDEO managedto ﬁnd a better energy and RMSD than all other heuristics, while most ofthe RMSD of other heuristics are higher than 8. Finally, for protein 2X43(Figure 12d) the KDEO also found better solutions with many points close to31 lllll lllll ll

Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

RWMCGADEEDA (a) lllllllllllllllllllllllllllllllllllllll

Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

RWMCGADEEDA (b) lll

Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

RWMCGADEEDA (c) lllll

Van der Waals Energy R M S D ( A (cid:176) ) Heuristic

RWMCGADEEDA (d)

Figure 12: Comparison of the energy values between our EDA with other heuristics: (a)protein 1A11; (b) protein 2LVG; (c) protein 2KK7 and, (d) protein 2X43. -400 energy and RMSD close to 10. No other heuristics managed to appear inthe Pareto Front. Therefore, it looks that the EDAs becomes more eﬃcient asthe protein size increases.Each protein has its own set of parameters (population size, selection pres-sure etc) that works better, but in these experiments, we ﬁxed the same parame-ters for all four proteins, so that we can also evaluate how the parameters wouldbe for an unknown set of proteins without needing to calibrate them before.Finally, we ran another experiment in order to show the strength of theEDA against other heuristics. We created a hypothetical chart that correlatesthe heuristic and the prior knowledge level needed to the heuristic be successful32Figure 13a). The vertical axis is ordered by the prior knowledge level, i.e. howmuch the search space needs to be reduced so that the heuristic can hit thecorrect solution. P1 means that the heuristic uses the whole search space whencreating the initial population (as we have done in all previous experiments,where it ranges from ( − , +180) in φ and ψ directions). Then we subsequentlysplit the search space using 5 degrees oﬀset toward the right solution, in thiscase, the dihedral angles of protein 1A11 (look at Figure 6b and see that themiddle of the red lines is [ φ, ψ ] = [ − , −

5. Conclusion

We develop three diﬀerent probabilistic models for an Estimation of Distri-bution Algorithm speciﬁc for the ab initio

Protein Structure Prediction withreal-valued variables. We refer to these three methods as UNIO for the Univari-ate model-based Optimization, KDEO for the two-dimensional Kernel DensityEstimation model-based Optimization and FGMO for the Finite Gaussian Mix-tures model-based Optimization, yielding three diﬀerent EDAs for PSP. Thediﬀerence among these three EDAs is how they extract representative infor-mation from the selected individuals and use this information to sample theoﬀspring.The ﬁrst, UNIO, is the simplest one. It is similar to UMDA c but can handleproblems with multiple modes instead of one, in an eﬃcient manner. Secondly,knowing that dihedral angles φ and ψ of amino acids have a strong correlation wemodeled the pair of dihedral angles [ φ, ψ ] as correlated for our two-dimensionalprobabilistic models: KDEO and FGMO. Results have shown that accountingfor correlation enables us to ﬁnd better protein conﬁgurations in the search34 andom Walk Monte Carlo Genetic Algorithm Differential Evolution EDA P r i o r k n o w l e d g e l e v e l Heur i stic Region of Success (a) l −180 −90 0 90 180 − − f y P1 P2 P3 P4P5 (b)

Figure 13: Comparison of protein conﬁgurations between ﬁve diﬀerent heuristics for protein1A11: (a) chart showing the success and the failure according to the prior-knowledge level and,(b) the minimum search space needed for each heuristic to get success in their predictions. α -helices in our predictions where van der Waals energy alone might representthe atomic interactions well. For the largest protein used in the experiments,2X43, the RMSD is still not as good as we expected. We believe that thismay occur because van der Waals energy alone cannot handle correctly the loopbetween the two helices. On the other hand, for the three smaller proteins 1A11,2LVG and 2KK7 the conﬁgurations we found were good enough, producing alow RMSD.We also compared our EDA (more speciﬁcally, the KDEO) with other heuris-tics from literatures as Random Walk, Monte Carlo, Genetic Algorithm and Dif-ferential Evolution. We noticed that the number of solutions used to composenew ones constitute a critical step in designing a good optimization algorithm.For example, the weakest heuristic RW does not use any information about theevolutionary process to infer new solutions, so it is completely random. TheMC (usually better than RW) uses information about one single solution fromthe original population in order to infer new ones. The GA (usually better thanMC) makes use of two individuals from the population in order to composethe oﬀspring. The DE (usually better than GA) uses information about threeindividuals to infer the new ones. Thus, we noticed a trend in the number ofindividuals used to compose the oﬀspring and the strength of the optimizationalgorithm. This is why we believe EDAs are better and more eﬃcient opti-mization techniques, since they use a set of promising solutions (the selectedindividuals) and try to extract some representative statistical knowledge fromthis set (the probabilistic model) in order to, ﬁnally, infer better solutions, lead-ing the whole evolutionary process toward promising directions. We made thiscomparison with the EDA against RW, MC, GA and DE to show why an EDAcan perform better than these previous algorithms. Actually, we are not takingthe RW as a serious competitor against the EDA. We just wanted to emphasizethe importance of using several potential solutions (the mechanism that the36DAs do) to build the new ones, against other approaches that do not do thisas illustrated in Figure 13a.Until now, we had not seen an EDA for PSP with ab initio and full-atomrepresentation approaches. So we decided to use some knowledge of the behaviorof proteins, modeling the correlation [ φ ; ψ ] and developing our speciﬁc EDAaimed at the PSP problem.At the ﬁrst glance, comparing the energy aspect between the two-dimensionalprobabilistic model-based techniques, KDEO and FGMO, it looks that KDEOis better in most of the cases. However, we noticed that the number of mixturecomponents used in FGMO has a high eﬀect on the energy values. Then, we stillneed to develop an automated and eﬃcient way to estimate the ideal numberof mixture components for each pair [ φ ; ψ ]. Although FGMO needs an iterativealgorithm in order to estimate the parameters (EM algorithm), the computa-tional eﬃciency stays close to UNIO. The main reason for this is because weare using the optimized PDF function (Equation 13 instead of Equation 9) anda small set of selected.Besides possible computational and statistical improvements, still the ﬁtnessfunction could be improved by adding other non-covalent energies as solvation,hydrogen bonding and electrostatic energy to our experiments. However, beforeadding these energies one needs to ensure that such energies are as eﬃcient aspossible, as already achieved for van der Waals and solvation energies in previousworks [45]. Otherwise, it would be the bottleneck of the whole algorithm.We also know that the energy eﬀects acting in proteins are contradictoryand sometimes operate on diﬀerent scales. Thus, the next step is to bring theMulti-Objective approach from the GA from [9] to our EDA. Appendix A. Statistical analysis

We performed the pairwise Wilcoxon test in order to determine whetherthere is relevant diﬀerence between the evaluated methods. The comparison wasperformed for every combination of methods for all the four evaluated proteins371A11, 2LVG, 2LKK7 and 2X43). The p-values that are greater than 0 .

05 arehighlighted in bold and they show that there is no signiﬁcance diﬀerence betweenthem, according to the Wilcoxon test. Table A.5 shows the test for the van derWaals energy values. In this table, there is a single value in bold, indicatingthat, according to the Wilcoxon test, most results concerning van der Waalsenergy were diﬀerent for all proteins evaluated. Table A.6 shows the test for theRMSD for the best protein conﬁguration found over all runs for each method.Although the Wilcoxon test for RW-KDEO and DE-KDEO got a high p-valuefor almost all proteins, it is possible to see by looking at Figure 12 that KDEOwas the only capable method of ﬁnding RMSD values below 7 . . Table A.5: P-value comparison with pairwise Wilcoxon test evaluating the van der Waalsenergy 1A11 2LVG 2KK7 2X43RW-MC 6.3e-10 5.9e-10 5.8e-10 5.9e-10RW-GA 6.3e-10 5.9e-10 5.8e-10 5.9e-10RW-DE 6.3e-10 5.9e-10 5.8e-10 5.9e-10RW-KDEO 6.3e-10 5.9e-10 5.8e-10 5.9e-10MC-GA 6.3e-10 5.9e-10 5.8e-10 5.9e-10MC-DE 6.3e-10 5.9e-10 5.8e-10 5.9e-10MC-KDEO 6.3e-10 9.5e-09 5.8e-10 5.9e-10GA-DE 6.3e-10 5.9e-10 5.8e-10 1.6e-07GA-KDEO 6.3e-10 7.3e-07 5.8e-10 5.9e-10DE-KDEO 4.3e-09 3.1e-09 5.8e-10 5.9e-10UNIO-KDEO 2.8e-07 2.7e-07 5.8e-10 5.9e-10UNIO-FGMO 3.2e-06 able A.6: P-value comparison with pairwise Wilcoxon test evaluating the RMSD1A11 2LVG 2KK7 2X43RW-MC 7.3e-04 5.4e-07 2.1e-04 1.4e-01RW-GA MC-GA 9.3e-03

MC-DE 3.4e-03 6.0e-04

GA-DE

UNIO-KDEO 2.9e-02

UNIO-FGMO 4.0e-02

MC-DE

GA-DE

DE-KDEO 6.6e-12 1.6e-03 9.1e-04 6.8e-16UNIO-KDEO 1.5e-12 1.4e-11 1.8e-07 3.6e-16UNIO-FGMO 1.7e-06 1.9e-05

KDEO-FGMO 1.0e-15 8.3e-12 3.5e-10 6.8e-16 cknowledgment The authors would like to thank the Brazilian research-funding agency FA-PESP for the ﬁnancial support given to this work. Processes number: 2010-/02928-2 and 2013-00919-4. Some part of this work was carried out while theﬁrst author visited Durham University through the Durham/USP Research Ex-change Programme.

ReferencesReferences [1] M. Herper, The cost of creating a new drug now 5 billion, pushing bigpharma to change (November 2013).URL [2] T. Lengauer, Bioinformatics - From Genomes to Drugs, Wiley, 2002.[3] J. M. Bujnicki, Prediction of Protein Structures, Functions, and Interac-tions, Wiley, West Sussex, 2009.[4] K. A. Dill, Theory for the folding and stability of globular proteins., Bio-chemistry 24 (6) (1985) 1501–1509.[5] Y. Duan, P. A. Kollman, Computational protein folding: From lattice toall-atom, IBM Systems Journal 40 (2) (2001) 297 –309. doi:10.1147/sj.402.0297 .[6] C. Hardin, T. Pogorelov, Z. Luthey-Schulten, Ab initio protein structureprediction, Current opinion in structural biology 12 (2) (2002) 176–181.[7] T. W. de Lima, R. A. Faccioli, P. H. R. Gabriel, A. C. B. Delbem, I. N.da Silva, Evolutionary approach to protein structure prediction with hy-drophobic interactions, in: Proceedings of the 9th annual conference on Ge-netic and evolutionary computation, GECCO ’07, ACM, New York, NY,USA, 2007, pp. 425–425. doi:http://doi.acm.org/10.1145/1276958. .URL http://doi.acm.org/10.1145/1276958.1277045 [8] I. Berenboym, M. Avigal, Genetic algorithms with local search optimizationfor protein structure prediction problem, in: Proceedings of the 10th annualconference on Genetic and evolutionary computation, GECCO ’08, ACM,New York, NY, USA, 2008, pp. 1097–1098. doi:http://doi.acm.org/10.1145/1389095.1389296 .URL http://doi.acm.org/10.1145/1389095.1389296 [9] C. R. S. Brasil, A. C. B. Delbem, F. L. B. da Silva, Multiobjective evolu-tionary algorithm with many tables for purely ab initio protein structureprediction, Journal of Computational Chemistry 34 (20) (2013) 1719–1734. doi:10.1002/jcc.23315 .URL http://dx.doi.org/10.1002/jcc.23315 [10] P. Larranaga, J. A. Lozano, Estimation of Distribution Algorithms: A NewTool for Evolutionary Computation, Vol. 2, Springer, New York, 2002.[11] M. Wand, C. Jones, Kernel Smoothing, Monographs on statistics and ap-plied probability, Chapman & Hall, Boston, USA, 1995.URL http://books.google.com.br/books?id=GTOOi5yE008C [12] G. McLachlan, D. Peel, Finite Mixture Models, Wiley series in probabilityand statistics: Applied probability and statistics, Wiley, Hoboken, NJ,USA, 2004.URL http://books.google.com.br/books?id=c2_fAox0DQoC [13] A. D. Baxevanis, B. F. F. Ouellette, Bioinformatics 2th edition: A PracticalGuide to the Analysis of Genes and Proteins, Wiley-Interscience, 2001.[14] K. Pearson, The random walk, Nature 72 (1905) 294; 318; 342.[15] N. Metropolis, S. Ulam, The Monte Carlo method, Journal of the AmericanStatistical Association 44 (247) (1949) pp. 335–341.URL http://dx.doi.org/10.1023/A:1008202821328 [18] C. Branden, J. Tooze, Introduction to protein structure., 2a ed. New York,NY, USA: Garland Publishing, Inc., 410, 1999.[19] UniProt, Uniprotkb/trembl protein database release 2014 07 statistics (072014).URL [20] R. PDB, Yearly growth of total structures (08 2014).URL [21] D. Bonneau, R.; Baker, Ab initio protein structure prediction: Progressand prospects., Annual Review on Biophys and Biomolecular Structures30 (2001) 173–189.[22] Y. Cui, R. S. Chen, W. H. Wong, Protein folding simulation with geneticalgorithm and supersecondary structure constraints, Proteins: Structure,Function, And Genetics 31 (1998) 247–257.[23] M. Pelikan, D. E. Goldberg, , E. Cantu-Paz, Boa: The Bayesian optimiza-tion algorithm, Proceedings of the Genetic and Evolutionary ComputationConference GECCO-99.[24] R. Santana, P. Larranaga, J. A. Lozano, Protein folding in simpliﬁed modelswith estimation of distribution algorithms, IEEE Transactions On Evolu-tionary Computation 12 (4) (2008) 418–438.4225] P. Larranaga, R. Etxeberria, J. Lozano, J. Pena, J. Pe, et al., Optimizationby learning and simulation of Bayesian and Gaussian networks.[26] M. Pelikan, H. Muehlenbein, The bivariate marginal distribution algo-rithm, in: R. Roy, T. Furuhashi, P. Chawdhry (Eds.), Advances inSoft Computing, Springer London, 1999, pp. 521–535. doi:10.1007/978-1-4471-0819-1\_39 .URL http://dx.doi.org/10.1007/978-1-4471-0819-1_39 [27] P. A. N. Bosman, D. Thierens, Ideas based on the normal kernels proba-bility density function.[28] P. A. Bosman, D. Thierens, Advancing continuous ideas with mixture dis-tributions and factorization selection metrics, in: Proceedings of the Opti-mization by Building and Using Probabilistic Models OBUPM Workshopat the Genetic and Evolutionary Computation Conference GECCO-2001,Morgan Kaufmann, 2001, pp. 208–212.[29] M. Gallagher, M. Frean, T. Downs, Real-valued evolutionary optimizationusing a ﬂexible probability density estimator, in: Proceedings of the 1stAnnual Conference on Genetic and Evolutionary Computation - Volume 1,GECCO’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,1999, pp. 840–846.URL http://dl.acm.org/citation.cfm?id=2933923.2934035 [30] C. Ahn, R. Ramakrishna, D. Goldberg, Real-coded Bayesian optimizationalgorithm: Bringing the strength of boa into the continuous world, in:K. Deb (Ed.), Genetic and Evolutionary Computation GECCO 2004, Vol.3102 of Lecture Notes in Computer Science, Springer Berlin Heidelberg,2004, pp. 840–851. doi:10.1007/978-3-540-24854-5\_86 .URL http://dx.doi.org/10.1007/978-3-540-24854-5_86 [31] M. Hauschild, M. Pelikan, A survey of estimation of distribution algorithms.4332] D. R. F. Bonetti, A. Delbem, J. Einbeck, Bivariate estimation of distri-bution algorithms for protein structure prediction, in: 29th InternationalWorkshop on Statistical Modelling, Vol. 2, 2014, pp. 15–18.[33] G. Ramachandran, C. Ramakrishnan, V. Sasisekharan, Stereochemistry ofpolypeptide chain conﬁgurations, Journal of Molecular Biology 7 (1) (1963)95–99.[34] B. W. Silverman, Density estimation for statistics and data analysis,Vol. 26, CRC press, 1986.[35] B. D. Ripley, W. N. Venables, Modern applied statistics with S-Plus,Springer-Verlag New York, NY, 1994.[36] T. Moon, The expectation-maximization algorithm, Signal Processing Mag-azine, IEEE 13 (6) (1996) 47–60.[37] D. E. Goldberg, The Design of Innovation: Lessons from and for CompetentGenetic Algorithms, Kluwer Academic Publishers, Norwell, USA, 2002.[38] M. Hollander, D. Wolfe, Nonparametric statistical methods, Wiley Seriesin Probability and Statistics - Applied Probability and Statistics Section,Wiley, 1973.URL http://books.google.es/books?id=ajxMAAAAMAAJ [39] M. J. Quinn, Parallel Programming in C With MPI and OpenMP, McGraw-Hill, 2004.[40] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig,I. N. Shindyalov, P. E. Bourne, The protein data bank (2000).URL [41] M. Galassi et al, Gnu scientiﬁc library reference manual (3rd ed.), iSBN0954612078 (01 2009).URL [43] D. R. Bonetti, A. C. Delbem, G. Travieso, P. S. L. de Souza, Enhanced vander waals calculations in genetic algorithms for protein structure prediction,Concurrency and Computation: Practice and Experience 25 (15) (2013)2170–2186. doi:10.1002/cpe.2913 .URL http://dx.doi.org/10.1002/cpe.2913http://dx.doi.org/10.1002/cpe.2913