[PDF] Blind prediction of protein B-factor and flexibility

Abstract

Debye-Waller factor, a measure of X-ray attenuation, can be experimentally observed in protein X-ray crystallography. Previous theoretical models have made strong inroads in the analysis of B-factors by linearly fitting protein B-factors from experimental data. However, the blind prediction of B-factors for unknown proteins is an unsolved problem. This work integrates machine learning and advanced graph theory, namely, multiscale weighted colored graphs (MWCGs), to blindly predict B-factors of unknown proteins. MWCGs are local features that measure the intrinsic flexibility due to a protein structure. Global features that connect the B-factors of different proteins, e.g., the resolution of X-ray crystallography, are introduced to enable the cross-protein B-factor predictions. Several machine learning approaches, including ensemble methods and deep learning, are considered in the present work. The proposed method is validated with hundreds of thousands of experimental B-factors. Extensive numerical results indicate that the blind B-factor predictions obtained from the present method are more accurate than the least squares fittings using traditional methods.

Full PDF

BBlind prediction of protein B-factor and ﬂexibility

David Bramer and Guo-Wei Wei , , ∗ Department of MathematicsMichigan State University, MI 48824, USA Department of Biochemistry and Molecular BiologyMichigan State University, MI 48824, USA Department of Electrical and Computer EngineeringMichigan State University, MI 48824, USASeptember 13, 2018

Abstract

Debye-Waller factor, a measure of X-ray attenuation, can be experimentally observed in proteinX-ray crystallography. Previous theoretical models have made strong inroads in the analysis of B-factors by linearly ﬁtting protein B-factors from experimental data. However, the blind prediction ofB-factors for unknown proteins is an unsolved problem. This work integrates machine learning andadvanced graph theory, namely, multiscale weighted colored graphs (MWCGs), to blindly predictB-factors of unknown proteins. MWCGs are local features that measure the intrinsic ﬂexibility dueto a protein structure. Global features that connect the B-factors of diﬀerent proteins, e.g., theresolution of X-ray crystallography, are introduced to enable the cross-protein B-factor predictions.Several machine learning approaches, including ensemble methods and deep learning, are consideredin the present work. The proposed method is validated with hundreds of thousands of experimentalB-factors. Extensive numerical results indicate that the blind B-factor predictions obtained from thepresent method are more accurate than the least squares ﬁttings using traditional methods.

Keywords:

Weighted colored graph, Protein ﬂexibility, Gradient boosted trees, Random forest,Convolutional neural network. ∗ Address correspondences to Guo-Wei Wei. E-mail:[email protected] a r X i v : . [ q - b i o . B M ] S e p Introduction

Protein beta factor (B-factor) or temperature factor (Debye-Waller factor) is a measure of atomic meansquared displacement or uncertainty in the X-ray scattering or neutron scattering structure determination.For a given protein at a given temperature, a large B-factor is caused by the atomic thermal ﬂuctuationand low attenuation rate. The latter depends also on the experimental modality. For example, thehydrogen atom has a low attenuation rate in X-ray scattering because of its small number of electronsbut has a normal attenuation rate for neutron scattering. For a given element type under the sameexperimental condition, the B-factor of an atom is determined by its intrinsic ﬂexibility and possiblecrystal packing eﬀects. It has been previously shown that intrinsic ﬂexibility correlates to importantprotein conformational variations [1]. That is, protein structural ﬂuctuation provides an important linkbetween structure, and function of a protein. As such, accurate prediction of protein B-factors is animportant and meaningful metric in understanding protein structure, ﬂexibility and function [2].One successful class of methods in protein B-factor prediction was those that used elastic mass-and-spring networks derived from Hooke’s Law. These models represent the alpha carbons of biologicalmacromolecules as a mass and spring network to predict B-factors based on a harmonic potential. Eachalpha carbon in a protein is regarded as a node in the network, and edges are weighted based on a potentialfunction. In these models, a pair of nodes is connected by an edge if they fall within a predeﬁned Euclideancutoﬀ distance. This approach captures the local non-covalent interactions between an individual alphacarbon atom and nearby alpha carbon atoms.Normal mode analysis (NMA) was one of the ﬁrst mass-and-spring methods used for protein B-factorprediction. This method is independent of time and makes use of a Hamiltonian matrix for atomicinteractions. Here the modes of the system correspond to motion where all parts of the molecule aremoving sinusoidally with the same frequency and phase. Moreover, eigenvalues of the system correspondto characteristic frequencies that correlate with protein B-factors. Low-frequency modes correlate withoperative motions which can be useful for hinge detection. NMA has also been found to be useful incharacterizing coarse grain deformation of supramolecular complexes. [1, 3–5]The elastic network model (ENM) was introduced to reduce the computational cost of NMA by usinga simpliﬁed spring network [6]. One successful ENM model is the anisotropic network model (ANM).This model uses a simpliﬁed spring potential between each residue, then determines the modes of thesystem via matrix diagonalization. ANM still retains many of the insightful features of NMA but with amuch lower computation cost. [7–9]The Gaussian network model (GNM) was introduced as a simpliﬁed method for B-factor prediction [8].Similar to previous models, a graph network is constructed using alpha carbon as nodes and edgesbased on a prescribed cutoﬀ distance. GNM uses a distance-based Kirchhoﬀ (or connectivity) matrixto represent the interaction between each two alpha carbon atoms (nodes). The expectation values ofresidue ﬂuctuations or mean-square ﬂuctuations are found in the diagonal terms of a covariance matrix.GNM provides good-coarse grained results with relatively low computational cost. [10]More recently, the ﬂexibility and rigidity index (FRI) methods have provided improved results. Thesemethods construct graph centrality based on radial basis functions which scale distance non-linearly [11].Fast FRI (fFRI) provides a version of FRI with a very low computation cost while still maintainingsatisfactory results [12]. Anisotropic FRI (aFRI) oﬀers a matrix version of FRI to compute proteinanisotropic motions. Moreover, the multiscale ﬂexibility rigidity index (mFRI) is able to capture proteinmultiscale interactions using several radial basis functions with diﬀerent parameterizations [13, 14].Previously the authors introduced a multiscale weighted colored graph (MWCG) model for proteinﬂexibility analysis [15]. The MWCG is a geometric graph model that oﬀers the most accurate and reliableprotein ﬂexibility analysis and B-factor prediction to date. It is about 40% more accurate than GNM [15].The basic idea of MWCG is to color (label) a protein graph based on element interaction types. Eachatom of given an element type selection represents a graph vertex and subgraphs are deﬁned accordingto speciﬁc heavy element types. A generalized centrality is deﬁned for each subgraph vertex. Usingvarious parameterizations of radial basis functions, this method is able to capture multiscale element2peciﬁc interactions. The MWCG method can be combined with various earlier FRI approaches, suchas fFRI, mFRI and aFRI, to further strengthen its power in the analysis of intrinsic protein ﬂexibility.Additionally, MWCG works well not only for C α carbons but also for all the atoms in a protein, i.e.,non-C α carbon, nitrogen, oxygen, and sulfur atoms. Hydrogen atoms can be treated similarly if they areavailable in the dataset [15].All of the aforementioned methods are designed for the analysis of intrinsic protein ﬂexibility due toprotein structure and packing crystal packing. However, none was designed to predict the B-factors ofan unknown protein. Indeed, all of these methods ﬁt experimental B-factors of given protein by the leastsquares algorithm. They generally do a poor job in predicting ﬂexibility across proteins. Stated diﬀerently,the ﬁtting coeﬃcients obtained from one protein are not applicable to a diﬀerent protein in general. Thisis largely due to the fact that protein B-factor depends also on a large number of eﬀects, including X-raycrystal quality, crystal symmetry (i.e., space group), data collecting method, data collecting environment,equipment condition, etc. Consequently, the blind prediction of protein ﬂexibility and B-factors remainsa major challenge.Recently, advances in graphics processing unit (GPU) computing and optimization have led to im-pressive biophysical predictions for various problems using machine learning, particularly, deep learningtechniques. In this work, we propose machine-learning based methods for blind protein B-factor predic-tions. We introduce two sets of features, the global ones, and local ones. Global features are designed torepresent crystal and experimental conditions across diﬀerent proteins, while local features are devotedto describing structural and atomic properties within a protein structure. We compile and engineer localand global features from a large set of known protein data as a training set, then apply machine learningtechniques to establish regression models which are used for the blind prediction of B-factors of unknownprotein structures. In terms of machine learning procedures, we use a variety of local and global proteinfeatures of a labeled training set to construct regression models that can blindly predict the B-factorsof a test set, consisting of entirely new proteins. In this work, we explore the random forest, boostedgradient decision trees, and deep learning methods for blind protein B-factor predictions. Using a largeand diverse set of proteins from the protein data bank ensures technical robustness. In addition to pre-viously explored features such as MWCG kernels and element type, we also include secondary structuralinformation and local packing density features to further improve our results. The success of blind protein B-factor predictions depends crucially on the representation of biomolecularstructures. We employ MWCGs as local features to describe protein structures. A brief review of MWCGsis given below.

Graph theory concerns the relationship of a set of vertices, denoted as V , in terms of pairwise connectivity,i.e., edges E . We use a graph to describe the non-covalent interactions in proteins. To improve ourgraph theory representation, we consider colored graphs in which diﬀerent types of elements are labeled.We classify labeled protein atoms into subgraphs where colored edges correspond to element speciﬁcinteractions. Speciﬁcally, we label the i th atom by its element type α j and position r j . As such, verticesare labeled as V = { ( r j , α j ) | r j ∈ IR ; α j ∈ C ; j = 1 , , . . . , N } , where C = { C, N, O, S } are the set of elements whose pairwise interactions will be considered. Hy-drogen is omitted from this list due to its absence from most PDB data and can be added withoutaﬀecting the present description. The set of edges in the colored protein graph are element speciﬁc pairs P = { CC,CN,CO,CS,NC,NN,NO,NS,OC,ON,OO,OS,SC,SN,SO,SS } . For example, the subset P = { CO } contains all directed CO pairs in the protein such that the ﬁrst atom is a carbon and the second one is a3itrogen. The direction is maintained because the edge, E , is a set of weighted and directed interactionkernels of various pairs of atoms, E = { Φ k ( || r i − r j || ; η ij ) | ( α i α j ) ∈ P k ; k = 1 , , . . . , i, j = 1 , , . . . , N } , (1)where || r i − r j || is the Euclidean distance between the i th and j th atoms, η ij is a characteristic distancebetween the atoms and ( α i α j ) is a directed pair of element types. Here Φ k is a correlation function andis chosen to have the following properties [12]Φ k ( || r i − r j || ; η ij ) = 1 , as || r i − r j || → α i α j ) ∈ P k , (2)Φ k ( || r i − r j || ; η ij ) = 0 as || r i − r j || → ∞ , ( α i α j ) ∈ P k . (3)Our previous work [12] has shown that generalized exponential functions,Φ k ( || r i − r j || ; η ij ) = e − ( || r i − r j || /η ij ) κ , ( α i α j ) ∈ P k ; κ > , (4)and generalized Lorentz functions,Φ k ( || r i − r j || ; η ij ) = 11 + ( || r i − r j || /η ij ) ν , ( α i α j ) ∈ P k ; ν > , (5)are good choices which satisfy the assumptions.The centrality metric used in this work is an extension of harmonic centrality to subgraphs withweighted edges deﬁned by the generalized correlation functions µ ki = N (cid:88) j =1 w ij Φ k ( || r i − r j || ; η ij ) , ( α i α j ) ∈ P k , ∀ i = 1 , , . . . , N, (6)where w ij is a weight function related to the element type. The WCG centrality in Eq. (6) describes theatomic speciﬁc rigidity which measures the stiﬀness at the i th atom due to the k th set of contact atoms.To characterize protein multiscale interactions, we use the atomic speciﬁc rigidity index from multi-scale weighted colored graphs (MWCGs) introduced in our previous work [15]. The atomic rigidity of i th atom at n th scale due to the k th set of interaction atoms is deﬁned as µ k,ni = N (cid:88) j =1 w nij Φ k ( || r i − r j || ; η nij ) , ( α i α j ) ∈ P k , (7)where Φ k ( || r i − r j || ; η nij ) is a correlation kernel, η nij a scale parameter, and w nij is an atomic type dependentparameter. We set w nij = 1 in the present work.While sulfur atoms play an important role in proteins they are so sparse that their kernels have anegligible eﬀect on the current model. Therefore, it is convenient to consider a subset of P in practicalcomputations ˆ P = (cid:8) CC, CN, CO, NC, NN, NO, OC, ON, OO (cid:9) . (8)We chose only C, N, and O element types due to their high occurrence frequency and important biologicalrelevance. Protein Databank (PDB) .pdb ﬁles provide the spatial atomic coordinates and the B-factor of each atomin a protein as well as a variety of other types of observed data that can be used as features. In addition4o the use of PDB spatial coordinates, this work makes use of global features provided in PDB ﬁles suchas R-value, resolution, and number of heavy atoms. R-value and resolution are global measures of thequality of the atomic model obtained from crystallographic data. Another global feature we consider isthe overall protein size. To allow the models to distinguish proteins of diﬀerent sizes we use one hotencoding with the 10 size ranges[500 , , , , , , , , , , where a protein element feature size will take on 1 if the number of heavy atoms (carbon, nitrogen, oroxygen) in that protein is less than or equal to the corresponding size and zero for the remaining sizes.For example, a protein with 1700 heavy elements would have the feature size vector for all of its atomsgiven by [0 , , , , , , , , , , . A frequency distribution of the size categories is provided in Figure 1. There are a total of 12 globalprotein size features. Figure 1:

Frequency of the number of heavy elements within the proteins from the 364 protein dataset.

PDB ﬁles also contain amino acid information for each element. Using one hot encoding we include aminoacid information for each heavy element which results in 20 amino acid features. Similarly we one hotcode the 4 diﬀerent heavy element types carbon, nitrogen, oxygen, and sulfur for each element resultingin 4 additional features.We use MWCG rigidity index described in Section 2.1 to create feature vectors for carbon, nitrogen,and oxygen interactions with each element. Moreover, to capture multiscale interactions we use 3 diﬀerentkernel choices for each interaction type. This results in a total of 9 MWCG feature vectors. Theparametrization of the kernels is chosen based on our previous work and is provided in Table 1. [15]Kernel Type κ η n ν Lorentz ( n = 1) - 16 3Lorentz ( n = 2) - 2 1Exponential ( n = 3) 1 31 - Table 1:

Parameters used for correlation kernels in a parameter-free MWCG based on previous results. [15] i th atom is deﬁned as p di = N d N , where d is the given cutoﬀ in angstroms, N d is the number of atoms within the Euclidean distance ofthe cutoﬀ to the i th atom, and N the total number of heavy atoms of the protein. The packing densitycutoﬀs used in this work are provided in Table 2.Short Medium Long d < ≤ d < ≤ d Table 2:

Packing density parameters in distance ( d ˚A). We include secondary structural information generated using the STRIDE software. The STRIDEsoftware provides secondary structural information about a protein given its atomic coordinates as a PDBﬁle. STRIDE designates each atom as belonging to a helix (alpha helix, 3-10 helix, PI-helix), extendedconformation, isolated bridge, turn, or coil. Additionally, STRIDE provides φ and ψ angles and residuesolvent accessible area [16]. Taken together this provides 12 secondary features. Using the MWCG method, we apply Lorentz and exponential radial basis functions to construct multi-scale images for each element of a protein. To capture a large variety of scales we construct multiscalekernels for each heavy atom of a protein using various values of κ , ν , and η . In particular we use η = { , , , , , , , } and κ, ν = { , . , , . , , . , , . , , . , , , , , } . This results in 2D MWCG images of dimension (8 , F ki in equation 9, where each atom f ki ( l, m, n ) represents the ﬂexibilityindex of the i th atom, and k th atom interaction (C, N, or O), l = η , m = { κ, ν } , and n the type of radialbasis function. Values of n = 1 and n = 2 correspond to exponential and Lorentz radial basis functionsrespectively. F ki =  f ki (1 , , f ki (1 , . , . . . f ki (1 , , f ki (1 , , f ki (1 , . , . . . f ki (1 , , f ki (2 , , f ki (2 , . , . . . f ki (2 , , f ki (2 , , f ki (2 , . , . . . f ki (2 , , f ki (15 , , f ki (15 , . , . . . f ki (15 , , f ki (15 , , f ki (15 , . , . . . f ki (15 , , (cid:124) (cid:123)(cid:122) (cid:125) κ f ki (20 , , f ki (20 , . , . . . f ki (20 , , (cid:124) (cid:123)(cid:122) (cid:125) ν f ki (20 , , f ki (20 , . , . . . f ki (20 , ,  η (9) A grid search was implemented for each method to determine the hyperparameters provided in Sections2.3.1, 2.3.2 and 2.3.3. 6 .3.1 Random forest

Random forests are ensemble methods that can be used for either classiﬁcation or regression tasks. Sinceprotein B-factor is a continuous measurement, B-factor prediction is a regression task. Random forestsuse a forest of n decision trees, and in the regression task, the prediction output is the mean predictionof all the trees. Random forests have the added beneﬁt of avoiding overﬁtting. Random forests are alsoinvariant to scaling and can rank the importance of features used in the model. Random forests are veryrobust to use for small- and medium-sized data sets.The number of n trees used generally improves the predictive power of a random forest model but if n is too large the model is susceptible overﬁtting the data set. In this work, we tested a variety of valuesfor n to ﬁnd a balance between performance and cost. Gradient boosting is another ensemble method that assembles a number of so called weak “learners” intoa prediction model iteratively. Gradient boosting tree is a gradient descent method that optimizes anarbitrary diﬀerentiable loss function to minimize the residuals from each step. Gradient boosted treesincorporate decision trees at each step of the gradient boosting to improve the predictive power of gradientboosting. Gradient boosted trees are advantageous because they can handle heterogeneous features, havestrong predictive power, and are generally robust to outliers.The gradient boosted tree method has several hyper parameters that can be tuned. In this work weoptimize the hyper parameters using the standard practice of a grid search. The parameters used fortesting are provided in Table 3. Any hyper parameters not listed below were taken to be the defaultvalues provided by the python scikit-learn package.Parameter SettingLoss Function QuantileAlpha 0.95Estimators 1000Learning Rate 0.001Max Depth 4Min Samples Leaf 9Min Samples Split 9

Table 3:

Boosted gradient tree parameters used for testing. Parameters were determined using a grid search.Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learnpackage.

Neural networks initially intend to model the way neurons function in the brain. In a neural network,a batch of signals or feature inputs is passed through activation functions called perceptrons which arethe functional units of the network. The weights of the networks are then trained using a loss functionover several epochs. Each epoch passes the training data set through the network updating the weightsaccording to the loss function. A neural network is considered deep when it has several “hidden” layersof perceptrons.Convolutional neural networks (CNNs) have recently succeeded in classifying images. CNN’s canextract features from images using convolutions with a pre-deﬁned ﬁlter size. CNNs are advantageousbecause they can provide similar results without training the network on the full data set. In practice,one can extract high-level features by using several convolutions. In this work, we explore using a heatmap of rigidity indices generated by three channel MWCG image features. We then merge the CNNoutput into a neural network that contains additional global and local protein features. A diagram of theCNN architecture is given in Figure 2. 7 igure 2:

The deep learning architecture using a convolutional neural network combined with a deep neuralnetwork. The plus symbol represents the concatenation of data sets.

The input of the CNN is a three-channel MWCG image of dimension (8,30,3). The model then appliestwo convolutional layers with 2x2 ﬁlters followed by a dropout layer at 0.5. This is followed by a denselayer which is ﬂattened then joined with the other global and local features into a dense layer of 59neurons followed by a dropout layer of 0.5, another dense layer of 100 neurons, a dropout layer of 0.25,a dense layer of 10 neurons, and ﬁnishes with a dense layer of 1 neuron. This results in a total of 21,584trainable parameters for our network.The convolutional neural network (CNN)has several hyper parameters that can be tuned. In thiswork we optimize the hyper parameters using the standard practice of a grid search. The parametersused for testing are provided in Table 4. Any hyper parameters not listed below were taken to be thedefault values provided by the python Keras package.Parameter SettingLearning Rate 0.001Epoch 100Batch Size 100Loss Mean Absolute ErrorOptimizer Adam

Table 4:

Convolutional Neural Network (CNN) parameters used for testing. Parameters were determined usinga grid search. Any hyper parameters not listed below were taken to be the default values provided by pythonwith the Keras package.

The RF, GBT, and CNN were all trained and tested in the same manner. For each protein, a machinelearning model is built using the entire dataset but excluding data from the protein whose B-factorsare to be predicted. Overall, there are more than 620,000 atoms in our dataset. For each protein, thisprovides a training set of roughly 600,000 data points (i.e., atoms). For each heavy atom, there is a setof features as described in Section 2.2 and a B-factor value (label). The features and the labels in thetraining set are used to train each machine learning model. Since we perform leave-one-out predictions,data from each protein is taken as a test set when its B-factors are to be blindly predicted.8e implement random forest and boosted gradient models using the scikit-learn python package. Forthe CNN model, we also use the python package Keras with tensorﬂow as a backend.

Our study uses two datasets, one from Park, Jernigan, and Wu [17] and the other from Refs. [12, 13].The ﬁrst contains 3 subsets of small, medium, and large proteins [17] and the latter contains 364 proteins[12, 13]. The latter dataset is an extended version of the ﬁrst. In these proteins, all sequences have aresolution of 3 ˚A or higher and an average resolution of 1.3 ˚A and the sets include proteins that rangefrom 4 to 3912 residues [17].For the CNN the feature datasets were standardized with mean 0 and variance of 1. Proteins 1OB4,1OB7, 2OLX, and 3MD5 are excluded from the data set because the STRIDE software is unable toprovide features for these proteins. We exclude protein 1AGN due to known problems with this proteindata. Proteins 1NKO, 2OCT, and 3FVA are also excluded because these proteins have residues withB-factors reported as zero, which is unphysical.

We successfully executed a leave-one-(protein)-out method to blindly predict the B-factors of all carbon,nitrogen, and oxygen atoms present in a given protein. For a comparison with other existing method, wealso list results for predicted C α B-factors, which are predicted in the same way as other heavy atoms.Machine learning was used to train a B-factor prediction model using the structural and B-factor datafrom a training data set as described in Sections 2.3.4 and 2.4. The model was then used to predict theB-factors of all heavy atoms in a given protein using only its structural data.To quantitatively assess our method for B-factor prediction we used the Pearson correlation coeﬃcientgiven by PCC = N (cid:88) i =1 ( B ei − ¯ B e )( B ti − ¯ B t ) (cid:20) N (cid:88) i =1 ( B ei − ¯ B e ) N (cid:88) i =1 ( B ti − ¯ B t ) (cid:21) / , (10)where B ti , i = 1 , , . . . , N are predicted B-factors using the proposed method and B ei , i = 1 , , . . . , N experimental B-factors from the PDB ﬁle. The terms B ti and B ei represent the i th theoretical andexperimental B-factors respectively. Here ¯ B e and ¯ B t are averaged B-factors. Computational eﬃciency in B-factor predictions is an important consideration for large proteins. Table5 lists the running times of GNM, RF, GBT, and CNN in our python implementations. These resultsare depicted in Figure 3. The proteins used to evaluate the computational complexity were the same asthose used by Opron et all [12]. For this comparison we only predict B-factors for C α atoms. Severalproteins were excluded as GNM takes signiﬁcantly too much CPU time to run. Tests excluded the timeit took to load PDB ﬁles and feature data. The machine learning algorithm times exclude the trainingof the model, which, once trained, can be used for the prediction of all proteins. The results show thatGNM has computational complexity of roughly O ( N ) due to the matrix decomposition while the MLalgorithms are close to O ( N ), with N being the number of atoms. The lines of best ﬁt for CPU time ( t )are t ≈ (4 × − ) ∗ N . for GNM, t ≈ (9 × − ) ∗ N . for RF, t ≈ (4 × − ) ∗ N . for GBT, and t ≈ (1 . × − ) ∗ N . for CNN. 9 igure 3: CPU Eﬃciency comparison between GNM [12], RF, GBT, and CNN algorithms. Execution timesin seconds (s) versus number of residues. A set of 34 proteins, listed in Table 5, were used to evaluate thecomputational complexity.

PDB N GNM [12] RF GBT CNN3P6J 125 0.141 0.000455 0.000358 0.1303R87 132 0.156 0.000464 0.000339 0.1383KBE 140 0.187 0.000505 0.000384 0.1491TZV 141 0.203 0.000473 0.000365 0.1632VY8 149 0.219 0.000486 0.000359 0.1563ZIT 152 0.234 0.000519 0.000365 0.1482FG1 157 0.265 0.000518 0.000403 0.1742X3M 166 0.312 0.000526 0.000382 0.1823LAA 169 0.327 0.000514 0.000405 0.1553M8J 178 0.375 0.000548 0.000412 0.1782GZQ 191 0.468 0.000647 0.000454 0.1954G7X 194 0.499 0.000631 0.000445 0.2092J9W 200 0.546 0.000554 0.000424 0.2083TUA 210 0.655 0.000602 0.000472 0.2171U9C 221 0.733 0.000592 0.000486 0.1983ZRX 221 0.718 0.000654 0.000515 0.2163K6Y 227 0.765 0.000619 0.000490 0.1893OQY 234 0.873 0.000619 0.000502 0.2112J32 244 0.967 0.000625 0.000556 0.2253M3P 249 1.029 0.000621 0.000525 0.2201U7I 267 1.263 0.000647 0.000551 0.2374B9G 292 1.669 0.000693 0.000574 0.2564ERY 318 2.122 0.000775 0.000619 0.2893MGN 348 2.902 0.000655 0.000552 0.2672ZU1 360 3.136 0.000816 0.000675 0.3372Q52 412 4.696 0.000900 0.000750 0.3694F01 448 6.178 0.001016 0.000878 0.4013DRF 547 11.154 0.001131 0.001033 0.5123UR8 637 17.409 0.001307 0.001136 0.5832AH1 939 61.012 0.001716 0.001605 0.8001GCO 1044 75.801 0.001936 0.001814 0.9051F8R 1932 654.127 0.003343 0.003163 1.7451H6V 2927 2085.842 0.005205 0.004739 2.5431QKI 3912 6365.668 0.006261 0.006198 3.560

Table 5:

CPU execution times, in seconds, from eﬃciency comparison between GNM [12], RF, GBT, and CNN. .3 Machine learning performance The results in Table 6 show that for the blind prediction of all heavy atoms the convolutional neuralnetwork method performs best with an overall average Pearson correlation coeﬃcient of 0.69. The gradientboosted and random forest ensemble methods performed similarly with Pearson correlation coeﬃcients of0.63 and 0.59 respectively. For comparison, Table 6 lists only the average Pearson correlation coeﬃcientsfor C α B-factor predictions, which are obtained in the same manner as other heavy atoms. These resultscan be compared with those of the parameter-free ﬂexibility-rigidity index (pfFRI), Gaussian networkmodel (GNM) and normal mode analysis (NMA) which, however, were obtained via the least squaresﬁtting of each protein.Results for all heavy atom B-factor predictions for small-, medium-, and large-sized protein datasubsets [17] are given in Tables 7, 8, and 9. Table 10 shows the results for all heavy atom B-factorpredictions of each protein in the Superset. The average Pearson correlation coeﬃcient for the datasubsets is provided in Table 6. All methods perform similarly for the diﬀerent protein data subsets withthe convolutional neural network method performing the best on the Superset for both all heavy atomand C α only B-factor predictions. Prediction Of Only C α Protein Set RF GBT CNN pfFRI [12] GNM [12] NMA [12]Small 0.25 0.39 0.53 0.60 0.54 0.48Medium 0.47 0.59 0.55 0.61 0.55 0.48Large 0.50 0.57 0.62 0.59 0.53 0.49Superset 0.49 0.57 0.66 0.63 0.57 NAPrediction Of All Heavy AtomProtein Set RF GBT CNN pfFRI [12] GNM [12] NMA [12]Small 0.44 0.49 0.56 NA NA NAMedium 0.59 0.64 0.62 NA NA NALarge 0.62 0.65 0.68 NA NA NASuperset 0.59 0.63 0.69 NA NA NA

Table 6:

Average Pearson correlation coeﬃcients (PCC) both of all heavy atom and C α only B-factor predic-tions for small-, medium-, and large-sized protein sets along with the entire superset of the 364 protein dataset.Predictions of random forest (RF), gradient boosted tree (GBT), and convolutional neural network (CNN) areobtained by leave-one-protein-out (blind), while predictions of parameter-free ﬂexibility-rigidity index (pfFRI),Gaussian network model (GNM) and normal mode analysis (NMA) were obtained via the least squares ﬁtting ofindividual proteins. All machine learning models use all heavy atom information for training. Our blind prediction result using the convolutional neural network is notable because it improvesupon the best result in the previous work for single protein parameters-free FRI (pfFRI) linear ﬁttingof 0.63 [12]. It is noted that blind predictions are much more diﬃcult than linear ﬁttings. The resultfor single protein GNM linear ﬁttings of the same data set is 0.57 [12]. As reported in Table 10, foreach protein, no method outperforms any other method over the entire data set. In terms of the averagePearson correlation coeﬃcient for all heavy atom B-factor prediction, the convolutional neural networkmethod outperforms the boosted gradient and random forest by 10% and 17% respectively.11DB ID N RF GBT CNN1AIE 235 0.62 0.53 0.601AKG 108 0.41 0.51 0.701BX7 345 0.55 0.67 0.631ETL 76 0.27 0.03 0.481ETM 80 0.46 0.13 0.481ETN 77 0.33 0.25 0.201FF4 477 0.55 0.59 0.761GK7 321 0.53 0.73 0.721GVD 401 0.66 0.69 0.711HJE 73 -0.07 0.46 0.371KYC 138 0.43 0.30 0.321NOT 96 -0.18 0.81 0.631O06 142 0.51 0.64 0.651P9I 203 0.73 0.77 0.771PEF 153 0.60 0.64 0.761PEN 109 0.34 0.24 0.211Q9B 303 0.41 0.67 0.751RJU 257 0.71 0.75 0.731U06 432 0.55 0.68 0.611UOY 452 0.55 0.56 0.551USE 290 0.25 0.50 0.681VRZ 66 0.38 -0.17 0.091XY2 62 0.16 0.27 0.551YJO 55 0.36 0.12 0.021YZM 361 0.51 0.60 0.562DSX 386 0.36 0.44 0.562JKU 229 0.57 0.63 0.352NLS 269 0.45 0.49 0.702OL9 51 0.65 0.51 0.846RXN 345 0.56 0.71 0.82

Table 7:

Pearson correlation coeﬃcients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the small-sized proteinset. Results reported use heavy atoms in both training and prediction.

Table 8:

Pearson correlation coeﬃcients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the medium-sizedprotein set. Results reported use heavy atoms in both training and prediction.

Table 9:

Pearson correlation coeﬃcients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the large-sized proteinset. Results reported use heavy atoms in both training and prediction.

Table 10:

Pearson correlation coeﬃcients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the Superset. Resultsreported use heavy atoms in both training and prediction.

PDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN Some low Pearson correlation coeﬃcients results show a poor model prediction. However, in almostevery protein where one model performs poorly, another model performs satisfactorily. When the maxi-mum correlation coeﬃcient for each protein is considered among the three methods the average all heavyatom correlation coeﬃcient is increased to 0.73 and the average C α only correlation coeﬃcient is increasedto 0.72. This result is similar to that of the parameter-optimized FRI (opFRI) reported in our earlierwork [12]. Both random forest and boosted gradient methods have the ability to rank relative feature importancehelping us understand signiﬁcant features in the model. Figure 4 shows the individual feature importancefor the random forest averaged over the dataset. 18 igure 4:

Individual feature importance for the random forest model averaged over the data set. Reportedfeature selection includes the use heavy atoms in the model.

We also include aggregated feature importance in Figure 5. In this ﬁgure, we sum the importanceof the individual angle, secondary, MWCG, atom type, protein size, amino acid, and packing densityfeatures.

Figure 5:

Average feature importance for the random forest model with the angle, secondary, MWCG, atomtype, protein size, amino acid, and packing density features aggregated. Reported feature selection includes theuse heavy atoms in the model.

Figure 4 shows the most important MWCG feature is the carbon-carbon interaction. This MWCGfeature uses a Lorentz radial basis function as with η = 16 and ν = 3 as detailed in Section 2.2. Theremaining eight MWCG features all rank similarly with the carbon-oxygen interaction ranked as thesecond most signiﬁcant MWCG feature. This result validates that the model beneﬁts from the multi-scale property of the MWCG feature, which uses three diﬀerent kernels to capture interactions at variouslength scales. Since all MWCG have signiﬁcance in the feature ranking it follows that the element speciﬁc19roperty of the MWCG method is also a meaningful model feature.Figure 4 shows that that the individual MWCG, amino acid type, and packing density feature havelow relative importance, however, considering their aggregate importance as seen in Figure 5, we see thatthey contribute to the model. Figure 5 shows that the medium density protein packing density featurewas twice as important to the model as the short and long density features. The medium packing densitymay be capturing semi-local side chain interactions which are important in protein ﬂexibility. The shortpacking density likely captures only adjacent backbone information while the long packing density isonly adding weak atomic interaction information to the model. Protein resolution is the most signiﬁcantrelative feature followed by MWCG features and the STRIDE generated residue solvent accessible areafeature. This also highlights the importance of the quality of X-ray crystal structures and diﬃculty incross-protein B-factor prediction. Protein angles, secondary structures, and size play a less signiﬁcantrole in the model compared to the other features. Atom type has the lowest signiﬁcance relative to theother features implemented in the model. Not surprisingly, we see that global features such as resolutionand R-value are important components in the ensemble model. The global feature of protein size has asmall role in the model.Care must be taken to use feature ranking to understand feature importance. The feature rankingprovided by these models is a relative ordering of features that the models ﬁnd most important. Sofeatures with high correlation may be redundant giving one of them a lower rank even though they mayhave signiﬁcant prediction power. For example, R-value highly correlates with resolution so it is likely ameaningful feature. However, the use of resolution reduces the relative importance ranking of R-value inthe model. Among the three methods considered in this work, the convolutional neural network method outperformsthe boosted gradient tree and random forest by 10% and 17%, respectively. As reported in Table 10, nomachine learning method outperforms any other method for each of all proteins. Results for all machinelearning methods could undoubtedly be improved by reﬁning features, exploring new features, and furthertuning hyperparameters.In general, ensemble methods do not require as much parameter tuning as the CNN does. Therandom forest is the simplest and most robust method. To balance cost, time, and quality only 500trees were used for the random forest and 1000 trees were used for the boosted gradient method in thiswork. This may account for the increased performance of the boosted gradient tree method comparedto the random forest. Ensemble methods are quite robust against overﬁtting so adding more featureswould likely improve their results [18]. The boosted gradient trees use several hyperparameters so thesemethods could beneﬁt by further tuning these hyperparameters.The additional data in the form of MWCG images used in the convolutional neural network likelyexplains the improved performance as compared to the ensemble methods. More reﬁned images and othernovel image types could further improve results.Using the dropout strategy, CNNs are also robust against overﬁtting. Since there are a few hyperpa-rameters in the CNN method, it would likely beneﬁt from more detailed parameter tuning. Additionally,a large dataset and more features would also improve the CNN performance. For example, includingpersistent homology [19] and diﬀerential geometry features might lead to a better CNN prediction.

Protein ﬂexibility is known to strongly correlate with protein function and its prediction is importantfor our understanding of protein dynamics and transport. Our quantitative understanding of proteinﬂexibility and function is greatly impeded by their complexity and a large number of degrees of freedom.Many time-independent methods, such as NMA [4, 20, 5, 3], ENM [6], GNM [8, 9, 21], and FRI [11–13, 22],20xist that dramatically simplify the protein structural complexity and are able to analyze protein B-factors, which reﬂect protein ﬂexibility among other things. Based on the hypothesis that intrinsic physicslies in a low-dimensional space embedded in a high-dimensional data space, we introduced multiscaleweighted colored graphs (MWCGs) to eﬀectively reduce protein structural complexity and eﬃcientlydescribe protein ﬂexibility. However, none of the aforementioned methods is able to blindly predict theprotein B-factors of an unknown protein. This work integrates advanced machine learning algorithmsand two sets of features, i.e., global and local ones, to blindly predict protein ﬂexibility and B-factors.A few standard datasets involving more than 300 proteins (or more than 600,000 of B-factors) havebeen utilized to test the proposed method. We use the leave-one-protein-out scheme to blindly predictprotein B-factors of both all heavy atoms and only C α atoms. Extensive numerical experiments demon-strate that the present blind prediction is more accurate than the least squares ﬁtting using GNM or NMAin terms of Pearson’s correlation coeﬃcients for the prediction of C α B-factor. Further, we demonstratethe ability to eﬀectively blindly predict B-factors of any heavy atoms in a given protein.Three standard machine learning algorithms, namely, the random forest, graduate boosted trees, andconvolutional neural networks are employed in the present study. Among them, convolutional neuralnetworks do a better job in B-factor predictions. A variety of diﬀerent features were considered for thesemodels including local, semi-local, and global features. Local features, such as MWCGs, are designedto capture structural properties associated with the intrinsic ﬂexibility while global features, such asX-ray crystal resolution, are used to enable the cross-protein comparison and analysis. The proposedmethod is very eﬃcient. However, there is still much room for novel and interesting features that can beimplemented in future work. For example, many algebraic topology tools have been found very useful forprotein analysis [23,18,19] and will likely pair well with machine learning approaches for protein ﬂexibilitypredictions.This work is a ﬁrst step using the recent advances in machine learning techniques to blindly predictprotein B-factors. To the authors’ knowledge, this is the ﬁrst work demonstrating this as a feasible androbust prediction method. This work provides a clear evidence that machine algorithms are useful in pro-tein ﬂexibility analysis. Results for all methods could undoubtedly be improved by a better mathematicaldescription of intrinsic ﬂexibility, larger datasets, and more advanced machine learning algorithms.The proposed methods could be implemented in a variety of interesting applications related to proteinﬂexibility and function. These include topics such as hinge detection, hot spot identiﬁcation, allostericsite detection, pose prediction, protein folding, and computer-aided drug design.

Acknowledgment

This work was supported in part by NSF Grants DMS-1721024 and DMS-1761320, NIH grant 1R01GM126189-01A1. 21 eferences [1] J. P. Ma, “Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecularcomplexes.,”

Structure , vol. 13, pp. 373 – 180, 2005.[2] H. Frauenfelder, S. G. Slihar, and P. G. Wolynes, “ The energy landsapes and motion of proteins ,”

Science , vol. 254, pp. 1598–1603, DEC 13 1991.[3] M. Tasumi, H. Takenchi, S. Ataka, A. M. Dwidedi, and S. Krimm, “Normal vibrations of proteins:Glucagon,”

Biopolymers , vol. 21, pp. 711 – 714, 1982.[4] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. States, S. Swaminathan, and M. Karplus,“Charmm: A program for macromolecular energy, minimization, and dynamics calculations,”

J.Comput. Chem. , vol. 4, pp. 187–217, 1983.[5] M. Levitt, C. Sander, and P. S. Stern, “Protein normal-mode dynamics: Trypsin inhibitor, crambin,ribonuclease and lysozyme.,”

J. Mol. Biol. , vol. 181, no. 3, pp. 423 – 447, 1985.[6] M. M. Tirion, “Large amplitude elastic motions in proteins from a single-parameter, atomic analy-sis.,”

Phys. Rev. Lett. , vol. 77, pp. 1905 – 1908, 1996.[7] A. R. Atilgan, S. R. Durrell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar, “Anisotropyof ﬂuctuation dynamics of proteins with an elastic network model.,”

Biophys. J. , vol. 80, pp. 505 –515, 2001.[8] I. Bahar, A. R. Atilgan, and B. Erman, “Direct evaluation of thermal ﬂuctuations in proteins usinga single-parameter harmonic potential.,”

Folding and Design , vol. 2, pp. 173 – 181, 1997.[9] I. Bahar, A. R. Atilgan, M. C. Demirel, and B. Erman, “Vibrational dynamics of proteins: Signiﬁ-cance of slow and fast modes in relation to function and stability.,”

Phys. Rev. Lett , vol. 80, pp. 2733– 2736, 1998.[10] T. Haliloglu, I. Bahar, and B. Erman, “Gaussian dynamics of folded proteins,”

Physical reviewletters , vol. 79, no. 16, p. 3090, 1997.[11] K. L. Xia and G. W. Wei, “A stochastic model for protein ﬂexibility analysis,”

Physical Review E ,vol. 88, p. 062709, 2013.[12] K. Opron, K. L. Xia, and G. W. Wei, “Fast and anisotropic ﬂexibility-rigidity index for proteinﬂexibility and ﬂuctuation analysis,”

Journal of Chemical Physics , vol. 140, p. 234105, 2014.[13] K. Opron, K. L. Xia, and G. W. Wei, “Communication: Capturing protein multiscale thermalﬂuctuations,”

Journal of Chemical Physics , vol. 142, no. 211101, 2015.[14] D. D. Nguyen, K. L. Xia, and G. W. Wei, “Generalized ﬂexibility-rigidity index,”

Journal of ChemicalPhysics , vol. 144, p. 234106, 2016.[15] D. Bramer and G.-W. Wei, “Multiscale weighted colored graphs for protein ﬂexibility and rigidityanalysis,”

The Journal of Chemical Physics , vol. 148, no. 5, p. 054103, 2018.[16] M. Heinig and D. Frishman, “Stride: a web server for secondary structure assignment from knownatomic coordinates of proteins,”

Nucleic Acids Research , vol. 32, no. suppl 2, pp. W500–W502, 2004.[17] J. K. Park, R. Jernigan, and Z. Wu, “Coarse grained normal mode analysis vs. reﬁned gaussiannetwork model for protein residue-level structural ﬂuctuations,”

Bulletin of Mathematical Biology ,vol. 75, pp. 124 –160, 2013. 2218] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutationby element speciﬁc persistent homology,”

Bioinformatics , vol. 33, pp. 3549–3557, 2017.[19] Z. X. Cang and G. W. Wei, “Integration of element speciﬁc persistent homology and machine learningfor protein-ligand binding aﬃnity prediction ,”

International Journal for Numerical Methods inBiomedical Engineering , vol. 34, p. 22914, 2018.[20] N. Go, T. Noguti, and T. Nishikawa, “Dynamics of a small globular protein in terms of low-frequencyvibrational modes,”

Proc. Natl. Acad. Sci. , vol. 80, pp. 3696 – 3700, 1983.[21] B. Brooks and M. Karplus, “Harmonic dynamics of proteins: normal modes and ﬂuctuations inbovine pancreatic trypsin inhibitor,”

Proceedings of the National Academy of Sciences , vol. 80, no. 21,pp. 6571–6575, 1983.[22] K. Opron, K. L. Xia, Z. Burton, and G. W. Wei, “Flexibility-rigidity index for protein-nucleic acidﬂexibility and ﬂuctuation analysis,”

Journal of Computational Chemistry , vol. 37, pp. 1283–1295,2016.[23] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, ﬂexibility and folding,”