[PDF] Decoy Selection for Protein Structure Prediction Via Extreme Gradient Boosting and Ranking

Abstract

Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.

Full PDF

DD ECOY S ELECTION FOR P ROTEIN S TRUCTURE P REDICTION V IA E XTREME G RADIENT B OOSTING AND R ANKING

A P

REPRINT

Nasrin Akhter , Gopinath Chennupati , Hristo Djidjev , and Amarda Shehu

1, 3, 41

Department of Computer Science, George Mason University, Fairfax, VA, 22030, USA Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., 87545, Los Alamos, USA Department of Bioengineering, George Mason University, Fairfax, VA, 22030, USA School of Systems Biology, George Mason University, Manassas, VA, 20110, USA * Corresponding author ∗ : [email protected] October 6, 2020 A BSTRACT

Background:

Results:

We propose a novel decoy selection method, ML-Select, a machine learning frameworkthat exploits the energy landscape associated with the structure space probed through a template-freedecoy generation. The proposed method outperforms both clustering and energy ranking-basedmethods, all the while consistently offering better performance on varied test-cases. Moreover,ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys.

Conclusions:

ML-Select is a useful method for decoy selection. This work suggests further researchin ﬁnding more effective ways to adopt machine learning frameworks in achieving robust performancefor decoy selection in template-free protein structure prediction.

Protein molecules play a vital role in controlling the biological activities of a cell. There are a number of attempts inwet laboratories to determine biologically-active/native tertiary structures as a route to decoding protein function [35].Technological advances have now made it possible to generate hundreds of thousands of tertiary structures for a givenamino-acid sequence, known as decoys, in a few CPU hours [50]. The multiplicity of decoys necessitates recognizinghigh-quality, near-native decoys among hundreds of thousand of decoys in an ensemble. Identifying these near-nativedecoys is a challenging problem in computational structural biology, and is known as decoy selection.Template-free methods, which generate low-energy tertiary structures in the absence of one or more structural templatesfrom homogeneous sequences, have now become prominent. The most popular ones include Rosetta [29] and Quark [56].To compute the low-energy structures, these methods employ stochastic optimization to ﬁnd local minimum of a selectedenergy/scoring function. A well known fact is that energy bias often does not lead to tertiary structures that are close tothe native. Therefore, identifying near-natives from a large ensemble of decoys remains an open problem [27]. ∗ [email protected], [email protected], [email protected], [email protected] a r X i v : . [ q - b i o . B M ] O c t PREPRINT - O

CTOBER

6, 2020Consequently, other decoy selection strategies gained momentum due to the weak role of energy in recognizingnear-native conformations, which is reﬂected in Critical Assessment of protein Structure Prediction (CASP) [27] seriesof community wide experiments. Clustering-based methods dominate the model quality assessment (MQA) performedin CASP. Clustering-based decoy selection methods work on the notion that decoys are randomly distributed around thenative structure which a consensus method ought to reveal. The clustering-based decoy selection performs better whenthe ensemble consists of mostly good quality decoys. However, if the sampling of decoys in the decoy generation stageis sparse, resulting in many dissimilar decoys in an ensemble, consensus methods fail to recognize exceptionally gooddecoys [42]. Moreover, the time complexity incurred in clustering a large decoy ensemble creates another bottleneck.In addressing the above challenges in decoy selection, we propose an alternative approach that takes advantage of theconsensus methods and a machine learning technique. As described in [8], protein energy landscape reveals importantstatistical information regarding the conformational organization and pathway. In this paper, we leverage the quantitativeknowledge garnered from the energy landscape of a protein molecule in a machine learning framework to address thechallenges in decoy selection. Supervised machine learning methods are gaining prominence in computational biologyapplications. These methods generate predictive models that learn subtle patterns from the data without making anyprior assumptions [38]. One of the biggest challenges for these predictive models is to succeed even when the dataset isextremely imbalanced. Data imbalance is a common problem in computational biology and bioinformatics [61]. Forinstance, one of the benchmark proteins in our experiments contains only . of positive instances (near-natives)among , decoys. Even in such a sparse decoy set, the proposed method successfully identiﬁes the near-natives.Our method works as follows: ﬁrst, the method extracts local structures from the energy landscape probed through atemplate-free protein structure prediction method; next, a machine learning-based decoy selection method uses theselocal structures to ﬁnally select groups of good quality decoys. The method outperforms state-of-the-art decoy selectionstrategies in [4]. The diverse collection of decoy selection strategies can be categorized into single-model, multi-model, quasi-single, andmachine learning (ML) methods. Single-model methods predict quality on a per-decoy basis [52], these are physics-based and/or knowledge-based. Physics-based methods employ different atomic interactions such as electrostatic, VanDer Waals interactions, hydrogen bonding [7, 16, 28], whereas the knowledge-based scoring functions employ statisticalanalysis of known native structures [36, 40, 51]. Between these two methods, knowledge-based methods are known tobe more successful in predicting high quality decoys [20, 47].Cluster-based methods work on the premise that the decoys are randomly distributed around the ’true’ answer [19, 33],which is not entirely valid due to the inherent bias associated with the template-free protein structure prediction methodsused to generate the decoys. Apart from the huge time-complexity incurred by clustering a large decoy ensemble, thecluster-based methods often fail to identify good quality decoys (near-natives) for hard targets, which are more sparselysampled [42]. Despite the bottlenecks, cluster-based decoy selection strategies have been the most popular methods inthe decoy selection literature. Quasi-single models combine the single-model and consensus methods. First, some highquality reference structures are selected, then the remaining decoys in the ensemble are compared with the referencestructures [26]. These methods are shown to perform better [23, 27, 48].Recent investigations are employing machine learning (ML) methods for decoy selection [25, 34, 43]. For instance,work in [39] uses Support Vector Machine (SVM) and uses a statistical scoring function GOAP [62] to distinguishnative decoys from the non-native ones. Decoy selection through machine learning are mostly single-model methods.These methods leverage structural features of proteins to assess decoy quality. Work in [5] employs non-negative matrixfactorization for selecting the best cluster of decoys and the the best decoy in the decoy set, which can be furtherextended to large scale using the the distributed implementations [15] of NMF.Deep learning has also become a popular approach to address ML problems in bioinformatics [31]. Along with a varietyof applications, such as DNA sequencing [30], enzyme function prediction [32], de-novo prediction of membraneproteins [53], protein contact map prediction [55], and protein secondary structure prediction [54], deep learninghas been successfully utilized for protein decoy selection as well. For instance, a deep belief network-based proteinquality estimation (decoy selection) model DeepQA outperforms SVM-based methods and achieves state-of-the-artperformance on the CASP dataset [10]. Convolutional neural network-based models have also observed success inprotein decoy selection [10, 24, 49].In this paper, we prefer to investigate shallow models, which, unlike deep architectures, do not place such high demandson the size of the training dataset in relation to the number of parameters. As our ability to expediently generate orobtain structure data grows, deep learning will surely provide an interesting way forward that we plan to pursue intandem with strategies to reduce the dimensionality of the loss function.2

PREPRINT - O

CTOBER

6, 2020In this paper, we employ an ML technique to a multi-model method that exploits local structures extracted from anenergy landscape [44]. The proposed ML-based multi-model method offers promising results in terms of higher truepositives and lower false positives.

First, we elaborate on the concept of energy landscape that forms the basis of our decoy selection method.

The energy landscape is an instance of a more general ﬁtness landscape that comprises a set of points X , a neighborhood N ( X ) deﬁned on X , a distance metric on X , and a ﬁtness function f : X → R ≥ that assigns a ﬁtness to every pointin X . Moreover, the points in X secure neighbors via the neighborhood function. In the context of decoy selection,the points x ∈ X represent decoy structures, and the ﬁtness function often designates an energy function. Effectively,the energy landscape of decoy structures characterizes the mapping of structures to their internal energy and providesimportant quantitative information about the structure space.A protein energy landscape features an ensemble of structural states near or far from the native state and an extensivecollection of intermediate states that shape the multi-modal and multi-dimensional nature of the landscape [45]. Theconcept of a basin is connected to a local/focal minimum. A focal minimum in a landscape is surrounded by a basinof attraction, which is the set of points on the landscape from which steepest descent/ascent converges to that focaloptimum. Barriers separate basins and regulate transitions of a system between different structural states correspondingto basins in the landscape.Under the energy landscape treatment, the biologically-active/native state(s) can be determined by identifying corre-sponding basins, which requires one to extract the underlying organization of decoys to identify basins in the landscape.One approach to achieve this objective is to embed the decoys in a connectivity data structure and utilize energiesto identify basins. Consider an Ω set of decoys. The Ω can be embedded in a nearest-neighbor graph (nn-graph) G = ( V, E ) [11]. The vertex set V is populated with the decoys, and the edge set E is populated by inferring the neigh-borhood structure of the landscape. The distance between two structures is measured via root-mean-squared-deviation(RMSD) after each of the structures is superimposed over some reference structures (arbitrarily, chosen to be the ﬁrst inthe ensemble); the superimposition minimizes differences due to rigid-body motions. Each vertex u ∈ V is connectedto vertices v ∈ V if d ( u, v ) ≤ (cid:15) , where (cid:15) is a user-deﬁned parameter. If the landscape has been sampled sparsely and ina non-uniform way, there is a possibility of creating a disconnected graph from a small (cid:15) value. One way to prevent suchscenario is to increase the (cid:15) while controlling the density of the resulting nn-graph via the number of nearest neighborsof u .The local minima of the landscape can be detected by analyzing the nn-graph. A vertex u ∈ V is a local minimumif ∀ v ∈ V f ( u ) ≤ f ( v ) , where v ∈ N ( u ) ( N ( u ) denotes the neighborhood of u ). The remaining vertices are thenassigned to basins as follows. Each vertex u is associated a negative gradient estimated by selecting the edge ( u, v ) thatmaximizes the ratio [ f ( u ) − f ( v )] /d ( u, v ) . From each vertex u that is not a local minimum, the negative gradient isfollowed (via the edge that maximizes the above ratio) until a local minimum is reached. Vertices that reach the samelocal minimum are assigned to the basin associated with that minimum. The basins, extracted from the energy landscape, can be useful in decoy selection. Works in [3, 4] shows that simple,ranking-based basin selection strategies outperform a standard clustering-based decoy selection method in terms ofpurity (percentage of true positives, penalizes the selected basin by the extent of false positives found in that basin).Basins can be ranked as a combination of basin characteristics. For instance, basins can be ranked merely as size (S),as a combination of size and the energy (S+E) of the focal minimum of that basin. The size of basin is computed bythe number decoys that belong to a basin. On the other hand, size and energy are used as conﬂicting objectives in amulti-objective, Pareto-based selection strategy. In a multi-objective optimization, solution A dominates solution B, ifA is better than or equal to B for all optimization objectives, and for at least one objective, A is strictly better than B. Inthe context of basins, Pareto Rank (PR) of Basin A is the number of basins that dominate A. The Pareto Count (PC) ofbasin A is the number of basins that A dominates. Speciﬁcally, basins can be ranked with their PR, or with PR and PC(PR+PC). Empirical studies conducted in [4] demonstrate the superiority of the Pareto-based basin selection strategiesover both cluster-based, size and energy-based decoy selection methods.3

PREPRINT - O

CTOBER

6, 2020Figure 1: Three components in one of the basin-graphs of .Despite good performance, ranking-based decoy selection strategies are unable to perform consistently well over all testcases regardless of their difﬁculty levels. Neither S+E nor PR+PC can provide fair performance (less false positives andmore true positives in the selected clusters/basins) over all or most of the test cases. One would prefer a decoy selectionmethod that is able to provide reasonably good performance for all or most of the test cases regardless of difﬁculty levelor heterogeneity in structural characteristics. This is the premise of the work presented in this paper.

Shortcomings of ranking-based basin selection strategies necessitate a new basin selection strategy. On that premise,we present a novel basin-based decoy selection method, referred to as ML-Select, that employs machine learningtechniques. The method operates in two phases: the ﬁrst phase captures n pure basins; while the second phase puriﬁesthe selected n basins and offers top k puriﬁed basins as output. Both the phases involve ﬁtting a regression modeland a selection approach (ranking) based on the regression results. To generalize across all possible difﬁculty levelsof proteins, we randomly select two proteins per difﬁculty level (easy, medium, hard) to train the models. Therefore,the performance of our models is independent of a test case and difﬁculty levels. We now describe the two phases ofML-Select in further detail. In this phase, ML-Select predicts the purity of basins and ranks them based on the predicted values. We use two kindsof attributes:

Pareto and graph -based attributes as features to build the regression model. The Pareto-based featuresare PR and PC, computed from treating basin size and focal energy as two conﬂicting optimization objectives [4].We assign the ranks to each basin that are calculated based on the PR and PC values associated with the given basin.Speciﬁcally, each basin is assigned two ranks based on their PR and PC values, which serve as two different features.The graph-based feature, number of connected components , characterizes a spatial attribute of the graphical repre-sentation of basins. The extracted basins from the nn-graph (of all the decoys in the dataset) using the StructuralBioinformatics Library (SBL) [11] are essentially bags of decoys. Estimating the spatial structure of these decoysin a speciﬁc basin is hard. Therefore, we consider the number of connected components as one of the features forML-Select.In order to easily recover the relative spatial organization of the decoys comprising a basin, we construct m differentnearest-neighbor graphs using the decoys populating m different basins. We use pdist + 1 Å for the distance thresholdto create the nearest-neighbor graphs, where pdist refers to the average pairwise distance between the decoys of thebasins. Depending on the distance between the decoys in a basin, the corresponding graph may consist of one ormore connected components, which signify the structural attribute of a basin. Figure 1 shows an example graphicalrepresentation of the components in a basin. We rank the basins based on the predicted purity and pass the top n basinsto the second phase for further puriﬁcation. In the second phase, we predict the root mean-squared-deviation ( rmsd ) of a decoy from the true native. The trainingset of this phase uses the same proteins as in the ﬁrst phase. However, the features in the second phase are different fromthat of the previous phase. We use twenty features of which three are knowledge-based potentials and the remaining areenergy scores from Rosetta suite [9]. The three knowledge-based features are: RW , RW plus [63] and dDFIRE [57]. RW is distance-dependent atomic potential and RW plus is side-chain orientation dependent potential; the third featureis dDF IRE , which improves the DFIRE statistical potential by adding an orientation dependency. The remaining features are energy terms in the REF2015 scoring function [6] in the

Rosetta suite of scoring functions. The RosettaREF2015 energy terms are the Lennard-Jones attractive and repulsive terms that capture interactions between atoms in4

PREPRINT - O

CTOBER

6, 2020different residues, the Lazaridis-Karplus solvation energy, the intra-residue Lazaridis-Karplus solvation energy term,the asymmetric solvation energy term, the Lennard-Jones repulsive term that captures interactions between atoms in thesame residue, the Coulombic electrostatic potential with a distance-dependent dielectric, the Proline ring closure energyand energy of the psi angle of preceding residue, the backbone-backbone hydrogen-bonding energy term between atomsclose and distant in the primary sequence, the sidechain-backbone and sidechain-sidechain hydrogen-bonding energyterm, the Ramachandran preferences term, the (backbone) omega dihedral term, the probability of amino acid giventorsion values for the phi and psi backbonee angles, the internal energy of sidechain rotamers term (as derived fromDunbrack’s statistics), and a special torsional potential term to keep the tyrosine hydroxyl in the plane of the aromaticring.The top n pure basins from the ﬁrst phase are treated as test cases. That is, we build n regression models for n basinsthat are passed to the second phase from the ﬁrst phase. Each of these basins are further puriﬁed as follows. In a givenbasin from phase 1, if the predicted rmsd of a decoy falls short of pre-deﬁned threshold ( dist _ thresh , explained laterin the implementation details), we remove that decoy from a test case basin. Effectively, the decoys that are furtheraway from the true native are removed from the selected basins. As a result, the purity of the selected basin improves.We rank the basins based on the resulting purity after the non-native decoy elimination and offer the top k basins as aresult at the end of second phase. The puriﬁcation process in this phase poses a threat of eliminating a good decoy (onesnear the native). We mitigate this effect with a shift in the pre-deﬁned distance threshold, dist _ thresh ± τ , where τ ∈ { , , } of the pre-deﬁned threshold. The effect of the threshold variation on purity is discussed later in theresults. We evaluate the performance of our approach using two metrics: percentage of true positives ( n ) and purity ( p ). Ata given distance threshold dist _ thresh (explained in the implementation details), n is the ratio of number of truenear-natives in the selected basin B − x , where x ∈ { , , } , to the total number of true near-natives in that decoyensemble. This metric resembles the Sensitivity (recall or true positive rate) measure. However, even signiﬁcantlyhigh n might become less effective if the number of false positives in the selected basin is high, where, a random drawfrom the selected basin would result in a lower probability of offering a true near-native. The metric p compensatesthis scenario by penalizing a large basin (or a group of selected basins) containing a large number of true and false positives to the extent of the false positive population present in that basin. p is computed as a ratio of the number oftrue positives to the size of a basin (or a group of basins). Therefore, a basin with a large number of false positivesresults in a low purity regardless of the number of true positives in that basin. In essence, purity metric resembles theprecision of our method. Speciﬁcally, we discuss the performance of ML-Select and four other competing methods interms of purity metric due to its balanced treatment towards false and true positives. For evaluation, we select thesemetrics that focus more on true and false positives rather than on true and false negatives because here we are moreconcerned with increasing the probability of selecting a true positive from the selected basins in a random draw, whichcan be achieved by minimizing the false positives and maximizing the true positives. We use a distance threshold of 1Å for creating the nn-graph of a decoy ensemble via SBL [11]. Since Rosetta decoygeneration protocol may produce sparse samples, a low threshold may result in a disconnected graph. To address thisproblem, we increase the initial threshold until the graph is connected. Minimum distance from a decoy in an ensembleto the true native is referred to as min _ dist . For a protein with a known native structure, all decoys under the threshold dist _ thresh are deemed as near-natives. As there are three different categories of test cases, we set the dist _ thresh parameter to determine the near-natives on a per-case basis. More speciﬁcally, dist _ thresh is set to 2Å for the easycases ( min _ dist < ). For the medium cases ( ≤ min _ dist < ), dist _ thresh is either . or . For the hard cases( < min _ dist ), we increase the dist _ thresh until one of the methods accumulate non-zero number of near-natives inthe top selected basins. Moreover, if any test case belongs to a particular category based on the min _ dist , but very fewnear-natives can be found according to that min _ dist , we move that test case to the next difﬁculty level.We use a boosting-based ensemble learning approach, XGBoost [21], to build the regression models. We use a linearregression model via XGBoost in both phase 1 and phase 2. XGBoost is fast, scalable that follows the principle ofgradient boosting. XGBoost is good to control over-ﬁtting while producing a more regularized model formalization [12].We calculate the knowledge-based features as follows. We calculate the RW potentials in the form of calRW and calRWplus , the executable programs used in the calculation are from Zhang lab [2]. The dDFIRE potential has beencalculated using dDFIRE program [1]. We use rounds of boosting to build our regression model. For training theregression models, we choose top q pure basins and randomly draw q basins (total q basins) from the rest of thetraining data. 5 PREPRINT - O

CTOBER

6, 2020We use easy, medium, and hard proteins for training the models. For testing, we use an easy, a medium, or a hardprotein that has not been used in the training dataset. To test/evaluate on a protein, we use another protein to take itsplace for training. Eventually, all the 18 proteins are tested and there is no overlap between the training and testing data.To address the randomness in the training phase, we run the models on the test data for times, and report the average p and n . We use for q in this experiment. Construction of the nn-graph by SBL takes from to hours depending onthe lengths (number of amino acids) of the proteins and the size of the decoy ensembles. Construction of the regressionmodels take about a minute. Once the model has been built, testing it on a new dataset with runs takes about 12seconds. Basin-Size and Basin-Size+Energy take about 20 seconds to test a new dataset. The runtimes for Pareto-Rankand Pareto-Rank+Count are and seconds, respectively.Table 1: Testing dataset (* denotes proteins with a predominant β fold and a short helix). Difﬁculty

PDB ID Fold Length | Ω | min_dist (Å)Easy α + β

61 58 ,

745 0 . β

68 68 ,

000 0 . α + β

64 60 ,

000 0 . α + β

88 60 ,

000 0 . α + β

74 60 ,

500 0 . Medium β

53 61 ,

000 0 . α

70 58 ,

491 1 . β ∗

64 65 ,

000 1 . α + β

65 60 ,

000 1 . α + β

69 51 ,

724 1 . β

66 66 ,

000 1 . Hard β ∗

99 60 ,

000 1 . α

93 54 ,

626 2 . α

78 57 ,

000 3 . α

123 54 ,

795 3 . coil

62 60 ,

000 3 . α

83 55 ,

000 4 . β

146 53 ,

000 9 . We experimented with eighteen proteins of different lengths and folds. These proteins constitute a benchmark datasetoften used by decoy generation algorithms [17, 37, 41, 46, 58, 59]. We used the Rosetta template-free (decoy generation)protocol to generate around , to , decoys per target. Table 1 presents all the eighteen proteins arranged intothree different categories (easy, medium, and hard). The difﬁculty level (easy, medium, hard) has been determined usingthe minimum distance ( min _ dist ) between the generated decoys and a known native conformation of the correspondingprotein. The size of the decoy ensemble | Ω | for each target is shown in column . Figure 2 provides a visual comparison of the methods with respect to the quality of the selected decoys in the top threebasins. We present three representative cases from the easy, medium, and hard categories. Each plot shows the decoysas two-dimensional dots where the x-axis tracks the lRMSD of each decoy and the y-axis tracks the Rosetta REF2015(all-atom) energy (measured in Rosetta Energy Units - REUs). Decoys in each basin are colored in maroon, gold, andnavy to distinguish between the top three basins.The protein with known native structure under PDB id , shown in the ﬁrst column in Figure 2, presents an easycase. ML-Select, shown in top row, captures the best quality decoys (near-natives, low lRMSD from the native) in thetop three basins ( p : 99 . ). All the decoys in top three basins are within Å from the known native. On the other hand,the top three basins, selected by four other strategies, contain decoys with larger lRMSD, which lowers the purity (aslow as ). For instance, Pareto-Rank captures very few decoys in top three basins. Moreover, some of these decoysare more than Å away from the native.Although ML-Select obtains basins of smaller size compared to that of the existing strategies for the medium case, , the quality of the selected decoys are better, which results in higher purity ( , , . for B , B − , B − , respectively). Contrarily, the larger basins, selected by Basin-Size, PR, and PR+PC, suffer from low purity due6 PREPRINT - O

CTOBER

6, 2020to the presence of numerous non near-natives (minimum . and maximum . ). Basin-Size+Energy performsfair in this scenario ( p : 94 . for B − ). However, purity diminishes as more basins are added in the selection ( . for B − ). Evidently, it is more likely that a random draw would yield a near-native from the top basin (or group ofbasins) if ML-Select is employed to perform the selection.ML-Select excels even in the hard cases, as shown for the protein with known native structure under PDB id .The quality of the decoys selected in ML-Select is as good as the Rosetta structure prediction protocol can sample( p : 94 . for B ). None of the existing basin-based strategies provide any near-native in their selected basins. Thatis, all the top basins selected by four other decoy selection strategies contain only false positives (decoys with largerlRMSD from the native ( ≥ Å)).Figure 3 compares the top basins selected by ML-Select with the top clusters selected by a state-of-the art clustering-based model quality estimation method, MUFOLD-CL [60]. Since larger clusters are considered to have tighterdistributions and are typically used for near-native model selection in practice [60], we select the three largest clustersresulting from MUFOLD-CL as the top three clusters for comparison. As shown in Figure 3, the top three clustersresulting from MUFOLD-CL are much larger; they contain near-natives, as well as as many non-natives. The presenceof many non-natives lowers purity. For instance, for the easy protein , despite containing . near-natives in thetop cluster, purity is only . This is due to the presence of many non-natives. Table 2 compares ML-Select with four basin-based decoy selection strategies proposed in [4] on the easy, medium,and hard test cases. The comparison focuses on p metric over B − x groups of decoys where x varies from to .The results with respect to n metric and the size ( s ) of each B − x are also shown. Empirical evaluation conductedin [4] shows that the four existing selection methods outperform a clustering-based decoy selection strategy. Figure 4compares the ﬁve selection strategies in terms of p metric. The x -axis shows the test cases while y -axis tracks the purity( p ) achieved by each method. The bold font indicates the best result among all the experimental methods.The purity of the top basin for all ﬁve selection strategies (except for PR, which performs much worse than others) arecomparable for the easy cases ( , , , tig , and ). However, the purity diminishes as more basinsare added to the selection for the four existing selection strategies (Size, Size+Energy, PR, PR+PC). For instance,ML-Select scores more than for the top basins ( B − ) for all the easy test cases, whereas Basin-Size can achieveonly . for , Basin-Size+energy can provide only purity for , and

PR+PC achieves purity for .For the medium-difﬁculty cases, the purity improvements resulted from ML-Select are prominent. ML-Select outper-forms the four existing selection strategies in out of cases for B − x , where x ∈ [1 − . For instance, ML-Selectachieves a maximum of and a minimum of purity for and , whereas the remaining four methodsachieve a minimum of purity and a maximum of purity.The hard cases present the most challenging decoy ensembles. Even for these challenging decoy sets, ML-Selectsigniﬁcantly outperforms the four existing selection strategies in out of test cases ( , , , , and ) for all sizes of basin selections (i.e., B − x , x ∈ [1 − ). For two other cases ( and ), ML-Selectperforms better for the top basin for , and for when x ∈ [2 , . For instance, for the most difﬁcult test case , ML-Select obtains about purity whereas the four other methods fail to provide a single true positive ( purity).Table 3 compares ML-Select with MUFOLD-CL on the easy, medium, and hard test cases. For all cases, the top threeclusters are fairly large, which lowers purity. For instance, the smallest of the top clusters (on ) contains ofall the decoys in the decoy set of size , . The near-native presence in this decoy set is only . . As a result,despite containing . near-natives, abundant non-natives populating the top cluster lowers its purity. In contrast,ML-Select is more precise; it selects basins of much smaller size that consist of mostly near-natives, resulting in muchhigher purity.Figure 4 shows that ML-Select offers reasonably good performance for a variety of test cases, which is not the casewith the basin-based strategies. For instance, PR performs quite well for and for B , but it fails miserablyfor , , and . As a result, one cannot rely on this selection strategy in achieving good purity over a new testcase. Contrarily, ML-Select guarantees reasonably good purity over all the test cases (except for one test case, ).Hence, ML-Select stands out as a more reliable decoy selection strategy than the four existing selection methods.Figure 5 shows that ML-Select performs much better than MUFOLD-CL in terms of the purity metric. However,MUFOLD-CL has been able to provide some near-natives for the medium-difﬁculty protein on which ML-Select7 PREPRINT - O

CTOBER

6, 2020Table 2: Comparison of the ﬁve basin-selection strategies. The top G − x groups of decoys selected from each selectionstrategy, with x limited to , are analyzed. When analyzing B − x , the top x basins are merged. The analysis lists themetrics (M): percentage of near-native decoys ( n ); the purity ( p ), which is the proportion of near-native decoys relativeto the size of a group; and the relative size ( s , is proportional to | Ω | ) of each basin. M ML-Select Basin-Size Basin-Size+Energy PR PR+PC B B − B − B B − B − B B − B − B B − B − B B − B −

99% 87.8% 79.3% 99% 89.4% 81.5% 0% 0% 80% 0% 0% 0%s 0.002% 0.003% 0.004% 0.43% 0.48% 0.5% 0.43% 0.47% 0.5% 0.001% 0.003% 0.02% 0.02% 0.05% 0.08%1hz6a n 4.6% 4.6% 4.6% 35.9% 35.9% 44.7% 35.9% 35.9% 35.9% 0.02% 0.02% 0.02% 2.2% 2.6% 4.6%p 99.8%

0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%s 0.002% 0.004% 0.01% 0.07% 0.12% 0.17% 0.04% 0.1% 0.16% 0.002% 0.02% 0.05% 0.02% 0.04% 0.06%1ail n 1.4% 3.8% 3.8% 0% 0% 0% 0% 0% 0.92% 0% 0% 0% 0% 0% 0.3%p

0% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 1.6%s 0.01% 0.023% 0.025% 0.14% 0.22% 0.3% 0.05% 0.13% 0.17% 0.001% 0.005% 0.008% 0.034% 0.063% 0.11%1c8ca n 0.8% 1.0% 1.1% 1.1% 6.2% 8.6% 1.4% 6.5% 7.6% 0.11% 0.11% 0.11% 0.06% 0.11% 1.21%p

0% 0% 7.6%s 0.01% 0.02% 0.03% 0.06% 0.12% 0.18% 0.05% 0.1% 0.17% 0.02% 0.02% 0.021% 0.03% 0.07% 0.11%1fwp n 1.84% 4.5% 4.5% 0% 0% 0% 0% 0% 0% 9.3% 9.3% 9.3% 0% 1.3% 1.3%p

0% 3.7% 2.4%s 0.003% 0.008% 0.01% 0.06% 0.12% 0.17% 0.05% 0.1% 0.15% 0.017% 0.019% 0.02% 0.03% 0.05% 0.08%1sap n 2.63% 2.63% 2.63% 9.3% 14.8% 20.9% 0% 1.5% 10.8% 0% 0% 0% 0.4% 0.8% 1.6%p 87.8% 71.7% 70.6% 85% 84.6% 88.3% 0% 26.9% 65.4% 0% 0% 0% s 0.21% 0.25% 0.26% 0.8% 1.2% 1.7% 0.2% 0.4% 1.2% 0.002% 0.003% 0.005% 0.03% 0.07% 0.12%1hhp n 12.2% 18.3% 24.2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%p

0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%s 0.012% 0.02% 0.03% 0.06% 0.13% 0.19% 0.06% 0.1% 0.16% 0.007% 0.03% 0.08% 0.03% 0.06% 0.08%2ezk n 1.3% 1.3% 1.3% 0% 0% 0% 1.83% 1.83% 1.83% 0% 0% 0% 0% 0% 0%p

0% 0% 0% 51.6% 19.8% 14% 0% 0% 0% 0% 0% 0%s 0.03% 0.045% 0.51% 0.09% 0.16% 0.23% 0.06% 0.15% 0.21% 0.01% 0.02% 0.03% 0.03% 0.07% 0.11%1aoy n 0.11% 0.23% 0.29% 0.12% 0.12% 0.15% 0.03% 0.2% 0.5% 0% 0% 0.08% 0% 0.1% 0.18%p

0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%s 0.028% 0.029% 0.034% 0.07% 0.14% 0.2% 0.06% 0.11% 0.17% 0.065% 0.075% 0.077% 0.03% 0.07% 0.09%1isua n 0.021% 0.043% 0.064% 0.06% 0.13% 0.56% 0.02% 0.11% 0.11% 0% 0% 0% 0.02% 0.17% 0.17%p

0% 1.6% 3.1%

0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%s 0.035% 0.044% 0.054% 0.056% 0.11% 0.17% 0.051% 0.11% 0.16% 0.01% 0.024% 0.026% 0.025% 0.055% 0.081% PREPRINT - O

CTOBER

6, 2020obtains purity. However, MUFOLD-CL’s performance in terms of purity is low, as well. This is due to the muchbigger cluster size and the scarcity of near-natives in the decoy sets.Table 3: Statistical signiﬁcance of ﬁve methods over eighteen test-cases determined through Friedman tests withHommel’s post-hoc analysis at α = . . The best method is marked with an asterisk (*), while the boldface presents thesigniﬁcance of the respective method when compared with the best method.Top Method Average p p Basins Rank value Hommel B PR 3.889 2.101E-6 0.0125Basin-Size 3.306 2.76E-4 0.0167PR+PC 3.306 2.76E-4 0.025Basin-Size+Energy 3.11 0.001 0.05

ML-select ∗ B − PR 4.028 5.53E-7 0.0125Basin-Size 3.417 1.19E-4 0.0167PR+PC 3.139 8.99E-4 0.025Basin-Size+Energy 3.028 0.002 0.05

ML-Select ∗ B − PR 3.833 7.47E-6 0.0125Basin-Size 3.444 1.83E-4 0.0167PR+PC 3.306 5.04E-4 0.025Basin-Size+Energy 2.944 0.005 0.05

ML-Select ∗ α = . . The ﬁrst column indicates the number of basins under consideration in the prediction of purity. The secondcolumn shows the methods, while the third column presents the average rank calculated from the Friedman’s test [18],which rejects the null hypothesis. Upon the rejection of the null hypothesis, Hommel’s post-hoc analysis helps todetermine the statistical signiﬁcance of the new technique (ML-Select) when compared to that of the existing methods.The fourth and the ﬁfth columns show the p -value and Hommel’s critical value respectively. The lowest average rankshows the best (ML-Select) method, and is marked with an asterisk (*). A method is said to be signiﬁcantly differentfrom the best method if the p -value of the corresponding method is less than that of the p -Hommel at α = 0.05, isin boldface. Overall, for all the three different basin sizes, ML-Select is the best. Therefore, ML-Select signiﬁcantlyoutperforms the existing basin-based selection strategies. dist _ thresh on Performance We varied the dist _ thresh parameter in the second phase to monitor any performance deviations in ML-Select. Herewe summarize our ﬁndings. The improvement in the purity of the selected basins is insigniﬁcant when we alter thepre-deﬁned distance threshold, dist _ thresh ± τ , where τ ∈ , , . In out of test cases, the purity varied,however, when dist _ thresh is increased by , we see an insigniﬁcant improvement. For example, the purity of thetop basins for increases from to . when the dist _ thresh is raised by . For all the remaining testcases, the improvement in the purity is insigniﬁcant. Overall, altering the distance threshold by a factor has insigniﬁcantimpact in predicting the purity. The results presented in this paper suggest that energy landscape probed by a template-free protein structure predictionmethod can be leveraged for decoy selection and warrants further investigation. In particular, energy is often ignoredin favor of structural similarity in clustering-based decoy selection strategies. The work presented in this paper hasdemonstrated that energy, when utilized in the context of energy landscape, can be successfully employed to identifynear-native decoys from a decoy ensemble.Observation on results from clustering-based selection methods show that these methods fail to identify exceptionallygood decoys for sparsely distributed decoy ensembles. Since a clear consensus is often not available as near-native9

PREPRINT - O

CTOBER

6, 2020decoys are usually scarce and far away from the rest of the decoys, consensus-based methods such as clustering-basedselections struggle to yield good performance for such challenging datasets. As shown in this paper, basins in energylandscape can improve decoy selection performance. In particular, supervised learning methods applied to basinsextracted from an energy landscape can not only provide better decoy selection performance, but also prove resilientagainst sparsely distributed decoy ensembles.Speciﬁcally, this paper presents a novel decoy selection method, ML-Select, that employs a supervised machine learningmethod to identify basins comprising mostly near-native decoys. ML-Select utilizes both energy- and graph-basedcharacteristics of basins to successfully select near-native basins even for the challenging datasets consisting of only afew near-natives. Results presented in this paper also show that ML-Select is able to provide good performance forvaried test cases irrespective of the difﬁculty level of the decoy ensemble.Although ML-Select shows promise in decoy selection in template-free protein structure prediction, further investigationis warranted to address the current limitations. For instance, while ML-Select is able to provide a good-quality basin,this method does not assess the quality of individual decoys in the selected basin. However, the selected basin offers aninformative set from which the best decoy(s) can be identiﬁed with the help of further ranking and more investigation.Further work will concentrate on utilizing decoy characteristics to incorporate an weighting scheme for identifying thebest decoy(s) from a decoy ensemble. The line of inquiry pursued in this paper demonstrates a promising direction foradvancing decoy selection research.

We proposed a novel machine learning strategy, ML-Select, in purifying the basins generated from the energy landscapes.Our experimental results indicate the utility of basins in the energy landscape probed by a template-free structureprediction method for automatic decoy selection. The model has been evaluated in terms of purity (favors lowerfalse-positives and higher true-positives) and compared against four existing basin-based decoy selection strategiesthat perform better than a cluster-based selection strategy. We showed that ML-Select performs signiﬁcantly betterthan all the four basin-based selection strategies. Moreover, the performance of ML-Select is highly reliable, unlike theinconsistent dominance of basin-based methods over the cluster-based method. Finally, we validate the use of machinelearning techniques in decoy selection, while suggesting further research in this direction for advancing the state ofdecoy selection. In the future, we would like to investigate the use of other machine learning strategies and/or heuristics(similar to [13, 14]) that initially predict the difﬁculty of a protein and use an ensemble of algorithms in predicting thepurity of the basins for the respective class of proteins.

ML – Machine LearningPDB – Protein Data BankRMSD – Root Mean Squared DeviationPR – Pareto RankPC – Pareto CountSBL – Structural Bioinformatics Library

Computations were run on Darwin, a research computing heterogeneous cluster (URL: https://darwin.lanl.gov).

The research was supported by Los Alamos National Laboratory (LANL) LDRD ER grant (20160317ER). Partsof this research used resources provided by the Los Alamos National Laboratory Institutional Computing Program,which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No.DE-AC52-06NA25396. This work is also supported in part by the National Science Foundation Grant No. 1900061.This material is additionally based upon work supported by (while serving at) the National Science Foundation. Anyopinion, ﬁndings, and conclusions or recommendations expressed in this material are those of the author(s) and do notnecessarily reﬂect the views of the National Science Foundation. Publication costs are funded by the National ScienceFoundation. The funder has no role in the research and writing of the paper.10

PREPRINT - O

CTOBER

6, 2020

All software and data are available upon demand.

10 Authors’ contributions

NA drafted the manuscript. NA, GC, DH, and AS revised the manuscript. NA designed and executed the experiments,while GC, DH, and AS supervised the design and analysis of methods. NA implemented majority of the code, whileGC implemented the graph features. NA, GC, DH and AS conceptualized the methods. All authors provided criticalfeedback on the manuscript, read and approved the ﬁnal manuscript.

11 Ethics approval and consent to participate

Not applicable.

12 Consent to publish

Not applicable.

13 Competing interests

The authors declare that they have no competing interests.11

PREPRINT - O

CTOBER

6, 2020

Figure 2: Visualization of selected decoys for three target proteins (indicated by the PDB id of their native structure).Decoys are plotted by their lRMSD from the native structure and their Rosetta REF2015 all-atom energy.12

PREPRINT - O

CTOBER

6, 2020

Figure 3: Visualization of decoys selected by ML-Select and MUFOLD-CL for three target proteins (indicated bythe PDB id of their native structure). Decoys are plotted by their lRMSD from the native structure and their RosettaREF2015 all-atom energy. P u r i t y ( p ) B B − B − P u r i t y ( p ) zk y i s ua 1 cc l y P u r i t y ( p ) zk y i s ua 1 cc l y zk y i s ua 1 cc l y ML-Select S S+E PR PR+PC

Figure 4: Comparison of the ﬁve selection strategies ML-Select, Size (S), Size+Energy (S+E), Pareto-Rank (PR), andPareto-Rank+Count (PR+PC), in terms of the p metric, for the easy, medium, and hard test cases. The top row showsthe results for easy cases, second row is for the medium cases, and the bottom row shows the results for the hard cases.Metric p , purity, measures the percentage of near-native decoys in the x selected basins while penalizing the basins bythe extent of false positive presence. Results are shown for x ∈ { , } .13 PREPRINT - O

CTOBER

6, 2020 P u r i t y ( p ) B B − B − P u r i t y ( p ) hhp e z k a o y h nd i s u a cc a l y P u r i t y ( p ) hhp e z k a o y h nd i s u a cc a l y hhp e z k a o y h nd i s u a cc a l y ML-Select MUFOLD-CL

Figure 5: Comparison of ML-Select and MUFOLD-CL, in terms of the p metric, for the easy, medium, and hard testcases. The top row shows the results for easy cases, second row is for the medium cases, and the bottom row shows theresults for the hard cases. Metric p , purity, measures the percentage of near-native decoys in the x selected basins whilepenalizing the basins by the extent of false positive presence. Results are shown for x ∈ { , } .14 PREPRINT - O

CTOBER

6, 2020

References [1] ddﬁre/dﬁre2 energy calculation. Accessed on 07.08.2018.[2] Rw potential. Accessed on 07.05.2018.[3] N. Akhter, G. Chennupati, K. L. Kabir, H. Djidjev, and A. Shehu. Unsupervised and supervised learning over theenergy landscape for protein decoy selection.

Biomolecules , 9(10):607, 2019.[4] N. Akhter and A. Shehu. From extraction of local structures of protein energy landscapes to improved decoyselection in template-free protein structure prediction.

Molecules , 23(1):216, 2018.[5] N. Akhter, R. Vangara, G. Chennupati, B. S. Alexandrov, H. Djidjev, and A. Shehu. Non-negative matrixfactorization for selection of near-native protein tertiary structures. In , pages 70–73. IEEE, 2019.[6] R. F. Alford, A. Leaver-Fay, J. R. Jeliazkov, M. J. O’Meara, F. P. DiMaio, H. Park, M. V. Shapovalov, P. D.Renfrew, V. K. Mulligan, K. Kappel, et al. The rosetta all-atom energy function for macromolecular modeling anddesign.

Journal of chemical theory and computation , 13(6):3031–3048, 2017.[7] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. a. Swaminathan, and M. Karplus. Charmm:a program for macromolecular energy, minimization, and dynamics calculations.

Journal of computationalchemistry , 4(2):187–217, 1983.[8] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes. Funnels, pathways, and the energy landscape ofprotein folding: a synthesis.

Proteins: Structure, Function, and Bioinformatics , 21(3):167–195, 1995.[9] S. Burman and V. Mulligan. Scoring tutorial. Accessed on 06.20.2018.[10] R. Cao, D. Bhattacharya, J. Hou, and J. Cheng. Deepqa: improving the estimation of single protein model qualitywith deep belief networks.

BMC bioinformatics , 17(1):495, 2016.[11] F. Cazals and T. Dreyfus. The structural bioinformatics library: modeling in biomolecular science and beyond.

Bioinformatics , 33(7):997–1004, 2017.[12] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In

Proceedings of the 22nd acm sigkddinternational conference on knowledge discovery and data mining , pages 785–794. ACM, 2016.[13] G. Chennupati, R. M. A. Azad, and C. Ryan. Performance optimization of multi-core grammatical evolutiongenerated parallel recursive programs. In

Proceedings of the 2015 Annual Conference on Genetic and EvolutionaryComputation , pages 1007–1014. ACM, 2015.[14] G. Chennupati, J. Fitzgerald, and C. Ryan. On the efﬁciency of multi-core grammatical evolution (mcge) evolvingmulti-core parallel programs. In ,pages 238–243. IEEE, 2014.[15] G. Chennupati, R. Vangara, E. Skau, H. Djidjev, and B. Alexandrov. Distributed non-negative matrix factorizationwith determination of the number of latent features.

The Journal of Supercomputing , 76(9): 7458-7488, 2020.[16] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C. Spellmeyer, T. Fox,J. W. Caldwell, and P. A. Kollman. A second generation force ﬁeld for the simulation of proteins, nucleic acids,and organic molecules j. am. chem. soc. 1995, 117, 5179- 5197.

Journal of the American Chemical Society ,118(9):2309–2309, 1996.[17] J. DeBartolo, G. Hocky, M. Wilde, J. Xu, K. F. Freed, and T. R. Sosnick. Protein structure prediction enhancedwith evolutionary diversity: SPEED. 19(3):520–534, 2010.[18] J. Demšar. Statistical comparisons of classiﬁers over multiple data sets.

Journal of Machine learning research ,7(Jan):1–30, 2006.[19] T. Estrada, R. Armen, and M. Taufer. Automatic selection of near-native protein-ligand conformations using ahierarchical clustering and volunteer computing. In

Proceedings of the First ACM International Conference onBioinformatics and Computational Biology , pages 204–213. ACM, 2010.[20] A. K. Felts, E. Gallicchio, A. Wallqvist, and R. M. Levy. Distinguishing native conformations of proteins fromdecoys with an effective free energy estimator based on the opls all-atom force ﬁeld and the surface generalizedborn solvent model.

Proteins: Structure, Function, and Bioinformatics , 48(2):404–422, 2002.[21] J. H. Friedman. Greedy function approximation: a gradient boosting machine.

Annals of statistics , pages1189–1232, 2001.[22] S. Garcia and F. Herrera. An extension on "statistical comparisons of classiﬁers over multiple data sets" for allpairwise comparisons.

Journal of Machine Learning Research , 9:2677–2694, 2008.15

PREPRINT - O

CTOBER

6, 2020[23] Z. He, M. Alazmi, J. Zhang, and D. Xu. Protein structural model selection by combining consensus and singlescoring methods.

PloS one , 8(9):e74006, 2013.[24] J. Hou, T. Wu, R. Cao, and J. Cheng. Protein tertiary structure modeling driven by deep learning and contactdistance prediction in casp13.

Proteins: Structure, Function, and Bioinformatics , 2019.[25] D. M. Hurtado, K. Uziela, and A. Elofsson. Deep transfer learning in the assessment of the quality of proteinmodels. arXiv preprint arXiv:1804.06281 , 2018.[26] X. Jing, K. Wang, R. Lu, and Q. Dong. Sorting protein decoys by machine-learning-to-rank.

Scientiﬁc Reports ,6:31571, 2016.[27] A. Kryshtafovych, A. Barbato, K. Fidelis, B. Monastyrskyy, T. Schwede, and A. Tramontano. Assessmentof the assessment: evaluation of the model quality estimates in casp10.

Proteins: Structure, Function, andBioinformatics , 82:112–126, 2014.[28] T. Lazaridis and M. Karplus. Discrimination of the native from misfolded protein models with an energy functionincluding implicit solvation 1.

Journal of molecular biology , 288(3):477–487, 1999.[29] A. Leaver-Fay, M. Tyka, S. M. Lewis, O. F. Lange, J. Thompson, R. Jacak, K. W. Kaufman, P. D. Renfrew,C. A. Smith, W. Shefﬂer, et al. Rosetta3: an object-oriented software suite for the simulation and design ofmacromolecules. In

Methods in enzymology , volume 487, pages 545–574. Elsevier, 2011.[30] Y. Li, R. Han, C. Bi, M. Li, S. Wang, and X. Gao. Deepsimulator: a deep simulator for nanopore sequencing.

Bioinformatics , 34(17):2899–2908, 2018.[31] Y. Li, C. Huang, L. Ding, Z. Li, Y. Pan, and X. Gao. Deep learning in bioinformatics: Introduction, application,and perspective in the big data era.

Methods , 2019.[32] Y. Li, S. Wang, R. Umarov, B. Xie, M. Fan, L. Li, and X. Gao. Deepre: sequence-based enzyme ec numberprediction by deep learning.

Bioinformatics , 34(5):760–769, 2017.[33] S. Lorenzen and Y. Zhang. Identiﬁcation of near-native structures by clustering protein docking conformations.

PROTEINS: Structure, Function, and Bioinformatics , 68(1):187–194, 2007.[34] B. Manavalan, J. Lee, and J. Lee. Random forest-based protein model quality assessment (rfmqa) using structuralfeatures and potential energy terms.

PloS one , 9(9):e106542, 2014.[35] T. Maximova, R. Moffatt, B. Ma, R. Nussinov, and A. Shehu. Principles and overview of sampling methods formodeling macromolecular structure and dynamics.

PLoS computational biology , 12(4):e1004619, 2016.[36] B. J. McConkey, V. Sobolev, and M. Edelman. Discrimination of native protein structures using atom–atomcontact scoring.

Proceedings of the National Academy of Sciences , 100(6):3215–3220, 2003.[37] J. Meiler and D. Baker. Coupled prediction of protein secondary and tertiary structure.

Proceedings of the NationalAcademy of Sciences of the United States of America , 100(21):12105–12110, 2003.[38] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell.

Machine learning: An artiﬁcial intelligence approach .Springer Science & Business Media, 2013.[39] S. Mirzaei, T. Sidi, C. Keasar, and S. Crivelli. Purely structural protein scoring functions using support vectormachine and ensemble learning.

IEEE/ACM transactions on computational biology and bioinformatics , 2016.[40] S. Miyazawa and R. L. Jernigan. An empirical energy potential with a reference state for protein fold and sequencerecognition.

Proteins: Structure, Function, and Bioinformatics , 36(3):357–369, 1999.[41] K. Molloy, S. Saleh, and A. Shehu. Probabilistic search and energy guidance for biased decoy sampling in ab-initioprotein structure prediction.

IEEE/ACM Trans Comput Biol and Bioinf , 10(5):1162–1175, 2013.[42] J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, and A. Tramontano. Critical assessment of methods of proteinstructure prediction (casp)—round x.

Proteins: Structure, Function, and Bioinformatics , 82:1–6, 2014.[43] S. P. Nguyen, Y. Shang, and D. Xu. Dl-pro: A novel deep learning method for protein model quality assessment.In

Neural Networks (IJCNN), 2014 International Joint Conference on , pages 2071–2078. IEEE, 2014.[44] R. Nussinov and P. G. Wolynes. A second molecular biology revolution? the energy landscapes of biomolecularfunction.

Physical Chemistry Chemical Physics , 16(14):6321–6322, 2014.[45] R. Nussinov and P. G. Wolynes. A second molecular biology revolution? the energy landscapes of biomolecularfunction.

Phys Chem Chem Phys , 16(14):6321–6322, 2014.[46] B. Olson and A. Shehu. Multi-objective stochastic search for sampling local minima in the protein energy surface.In

ACM Conf on Bioinf and Comp Biol (BCB) , pages 430–439, Washington, D. C., September 2013.16

PREPRINT - O

CTOBER

6, 2020[47] B. Park and M. Levitt. Energy functions that discriminate x-ray and near-native folds from well-constructeddecoys.

Journal of molecular biology , 258(2):367–392, 1996.[48] M. Pawlowski, L. Kozlowski, and A. Kloczkowski. Mqapsingle: A quasi single-model approach for estimation ofthe quality of individual protein structure models.

Proteins: Structure, Function, and Bioinformatics , 84(8):1021–1028, 2016.[49] R. Sato and T. Ishida. Protein model accuracy estimation based on local structure quality assessment using 3dconvolutional neural network.

PloS one , 14(9):e0221347, 2019.[50] A. Shehu. A review of evolutionary algorithms for computing functional conformations of protein molecules. In

Computer-Aided Drug Discovery , pages 31–64. Springer, 2015.[51] K. T. Simons, I. Ruczinski, C. Kooperberg, B. A. Fox, C. Bystroff, and D. Baker. Improved recognition ofnative-like protein structures using a combination of sequence-dependent and sequence-independent features ofproteins.

Proteins: Structure, Function, and Bioinformatics , 34(1):82–95, 1999.[52] K. Uziela and B. Wallner. Proq2: estimation of model accuracy implemented in rosetta.

Bioinformatics ,32(9):1411–1413, 2016.[53] S. Wang, S. Fei, Z. Wang, Y. Li, J. Xu, F. Zhao, and X. Gao. Predmp: a web server for de novo prediction andvisualization of membrane proteins.

Bioinformatics , 35(4):691–693, 2018.[54] S. Wang, J. Peng, J. Ma, and J. Xu. Protein secondary structure prediction using deep convolutional neural ﬁelds.

Scientiﬁc reports , 6:18962, 2016.[55] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu. Accurate de novo prediction of protein contact map by ultra-deeplearning model.

PLoS computational biology , 13(1):e1005324, 2017.[56] D. Xu and Y. Zhang. Ab initio protein structure assembly using continuous structure fragments and optimizedknowledge-based force ﬁeld.

Proteins: Structure, Function, and Bioinformatics , 80(7):1715–1735, 2012.[57] Y. Yang and Y. Zhou. Speciﬁc interactions for ab initio folding of protein terminal regions with secondarystructures.

Proteins: Structure, Function, and Bioinformatics , 72(2):793–803, 2008.[58] G. Zhang, L. Ma, X. Wang, and X. Zhou. Secondary structure and contact guided differential evolution for proteinstructure prediction.

IEEE/ACM Trans Comput Biol and Bioinf , 2018. preprint.[59] G. J. Zhang, G. Zhou, X, X. F. Yu, H. Hao, and L. Yu. Enhancing protein conformational space sampling usingdistance proﬁle-guided differential evolution.

IEEE/ACM Trans Comput Biol and Bioinf , 14(6):1288–1301, 2017.[60] J. Zhang and D. Xu. Fast algorithm for population-based protein structural model analysis.

Proteomics , 13(2):221–229, 2013.[61] X.-M. Zhao, X. Li, L. Chen, and K. Aihara. Protein classiﬁcation with imbalanced data.

Proteins: Structure,function, and bioinformatics , 70(4):1125–1132, 2008.[62] H. Zhou and J. Skolnick. Goap: a generalized orientation-dependent, all-atom statistical potential for proteinstructure prediction.

Biophysical journal , 101(8):2043–2052, 2011.[63] H. Zhou and Y. Zhou. Distance-scaled, ﬁnite ideal-gas reference state improves structure-derived potentials ofmean force for structure selection and stability prediction.