Blind prediction of protein B-factor and flexibility
BBlind prediction of protein B-factor and flexibility
David Bramer and Guo-Wei Wei , , ∗ Department of MathematicsMichigan State University, MI 48824, USA Department of Biochemistry and Molecular BiologyMichigan State University, MI 48824, USA Department of Electrical and Computer EngineeringMichigan State University, MI 48824, USASeptember 13, 2018
Abstract
Debye-Waller factor, a measure of X-ray attenuation, can be experimentally observed in proteinX-ray crystallography. Previous theoretical models have made strong inroads in the analysis of B-factors by linearly fitting protein B-factors from experimental data. However, the blind prediction ofB-factors for unknown proteins is an unsolved problem. This work integrates machine learning andadvanced graph theory, namely, multiscale weighted colored graphs (MWCGs), to blindly predictB-factors of unknown proteins. MWCGs are local features that measure the intrinsic flexibility dueto a protein structure. Global features that connect the B-factors of different proteins, e.g., theresolution of X-ray crystallography, are introduced to enable the cross-protein B-factor predictions.Several machine learning approaches, including ensemble methods and deep learning, are consideredin the present work. The proposed method is validated with hundreds of thousands of experimentalB-factors. Extensive numerical results indicate that the blind B-factor predictions obtained from thepresent method are more accurate than the least squares fittings using traditional methods.
Keywords:
Weighted colored graph, Protein flexibility, Gradient boosted trees, Random forest,Convolutional neural network. ∗ Address correspondences to Guo-Wei Wei. E-mail:[email protected] a r X i v : . [ q - b i o . B M ] S e p Introduction
Protein beta factor (B-factor) or temperature factor (Debye-Waller factor) is a measure of atomic meansquared displacement or uncertainty in the X-ray scattering or neutron scattering structure determination.For a given protein at a given temperature, a large B-factor is caused by the atomic thermal fluctuationand low attenuation rate. The latter depends also on the experimental modality. For example, thehydrogen atom has a low attenuation rate in X-ray scattering because of its small number of electronsbut has a normal attenuation rate for neutron scattering. For a given element type under the sameexperimental condition, the B-factor of an atom is determined by its intrinsic flexibility and possiblecrystal packing effects. It has been previously shown that intrinsic flexibility correlates to importantprotein conformational variations [1]. That is, protein structural fluctuation provides an important linkbetween structure, and function of a protein. As such, accurate prediction of protein B-factors is animportant and meaningful metric in understanding protein structure, flexibility and function [2].One successful class of methods in protein B-factor prediction was those that used elastic mass-and-spring networks derived from Hooke’s Law. These models represent the alpha carbons of biologicalmacromolecules as a mass and spring network to predict B-factors based on a harmonic potential. Eachalpha carbon in a protein is regarded as a node in the network, and edges are weighted based on a potentialfunction. In these models, a pair of nodes is connected by an edge if they fall within a predefined Euclideancutoff distance. This approach captures the local non-covalent interactions between an individual alphacarbon atom and nearby alpha carbon atoms.Normal mode analysis (NMA) was one of the first mass-and-spring methods used for protein B-factorprediction. This method is independent of time and makes use of a Hamiltonian matrix for atomicinteractions. Here the modes of the system correspond to motion where all parts of the molecule aremoving sinusoidally with the same frequency and phase. Moreover, eigenvalues of the system correspondto characteristic frequencies that correlate with protein B-factors. Low-frequency modes correlate withoperative motions which can be useful for hinge detection. NMA has also been found to be useful incharacterizing coarse grain deformation of supramolecular complexes. [1, 3–5]The elastic network model (ENM) was introduced to reduce the computational cost of NMA by usinga simplified spring network [6]. One successful ENM model is the anisotropic network model (ANM).This model uses a simplified spring potential between each residue, then determines the modes of thesystem via matrix diagonalization. ANM still retains many of the insightful features of NMA but with amuch lower computation cost. [7–9]The Gaussian network model (GNM) was introduced as a simplified method for B-factor prediction [8].Similar to previous models, a graph network is constructed using alpha carbon as nodes and edgesbased on a prescribed cutoff distance. GNM uses a distance-based Kirchhoff (or connectivity) matrixto represent the interaction between each two alpha carbon atoms (nodes). The expectation values ofresidue fluctuations or mean-square fluctuations are found in the diagonal terms of a covariance matrix.GNM provides good-coarse grained results with relatively low computational cost. [10]More recently, the flexibility and rigidity index (FRI) methods have provided improved results. Thesemethods construct graph centrality based on radial basis functions which scale distance non-linearly [11].Fast FRI (fFRI) provides a version of FRI with a very low computation cost while still maintainingsatisfactory results [12]. Anisotropic FRI (aFRI) offers a matrix version of FRI to compute proteinanisotropic motions. Moreover, the multiscale flexibility rigidity index (mFRI) is able to capture proteinmultiscale interactions using several radial basis functions with different parameterizations [13, 14].Previously the authors introduced a multiscale weighted colored graph (MWCG) model for proteinflexibility analysis [15]. The MWCG is a geometric graph model that offers the most accurate and reliableprotein flexibility analysis and B-factor prediction to date. It is about 40% more accurate than GNM [15].The basic idea of MWCG is to color (label) a protein graph based on element interaction types. Eachatom of given an element type selection represents a graph vertex and subgraphs are defined accordingto specific heavy element types. A generalized centrality is defined for each subgraph vertex. Usingvarious parameterizations of radial basis functions, this method is able to capture multiscale element2pecific interactions. The MWCG method can be combined with various earlier FRI approaches, suchas fFRI, mFRI and aFRI, to further strengthen its power in the analysis of intrinsic protein flexibility.Additionally, MWCG works well not only for C α carbons but also for all the atoms in a protein, i.e.,non-C α carbon, nitrogen, oxygen, and sulfur atoms. Hydrogen atoms can be treated similarly if they areavailable in the dataset [15].All of the aforementioned methods are designed for the analysis of intrinsic protein flexibility due toprotein structure and packing crystal packing. However, none was designed to predict the B-factors ofan unknown protein. Indeed, all of these methods fit experimental B-factors of given protein by the leastsquares algorithm. They generally do a poor job in predicting flexibility across proteins. Stated differently,the fitting coefficients obtained from one protein are not applicable to a different protein in general. Thisis largely due to the fact that protein B-factor depends also on a large number of effects, including X-raycrystal quality, crystal symmetry (i.e., space group), data collecting method, data collecting environment,equipment condition, etc. Consequently, the blind prediction of protein flexibility and B-factors remainsa major challenge.Recently, advances in graphics processing unit (GPU) computing and optimization have led to im-pressive biophysical predictions for various problems using machine learning, particularly, deep learningtechniques. In this work, we propose machine-learning based methods for blind protein B-factor predic-tions. We introduce two sets of features, the global ones, and local ones. Global features are designed torepresent crystal and experimental conditions across different proteins, while local features are devotedto describing structural and atomic properties within a protein structure. We compile and engineer localand global features from a large set of known protein data as a training set, then apply machine learningtechniques to establish regression models which are used for the blind prediction of B-factors of unknownprotein structures. In terms of machine learning procedures, we use a variety of local and global proteinfeatures of a labeled training set to construct regression models that can blindly predict the B-factorsof a test set, consisting of entirely new proteins. In this work, we explore the random forest, boostedgradient decision trees, and deep learning methods for blind protein B-factor predictions. Using a largeand diverse set of proteins from the protein data bank ensures technical robustness. In addition to pre-viously explored features such as MWCG kernels and element type, we also include secondary structuralinformation and local packing density features to further improve our results. The success of blind protein B-factor predictions depends crucially on the representation of biomolecularstructures. We employ MWCGs as local features to describe protein structures. A brief review of MWCGsis given below.
Graph theory concerns the relationship of a set of vertices, denoted as V , in terms of pairwise connectivity,i.e., edges E . We use a graph to describe the non-covalent interactions in proteins. To improve ourgraph theory representation, we consider colored graphs in which different types of elements are labeled.We classify labeled protein atoms into subgraphs where colored edges correspond to element specificinteractions. Specifically, we label the i th atom by its element type α j and position r j . As such, verticesare labeled as V = { ( r j , α j ) | r j ∈ IR ; α j ∈ C ; j = 1 , , . . . , N } , where C = { C, N, O, S } are the set of elements whose pairwise interactions will be considered. Hy-drogen is omitted from this list due to its absence from most PDB data and can be added withoutaffecting the present description. The set of edges in the colored protein graph are element specific pairs P = { CC,CN,CO,CS,NC,NN,NO,NS,OC,ON,OO,OS,SC,SN,SO,SS } . For example, the subset P = { CO } contains all directed CO pairs in the protein such that the first atom is a carbon and the second one is a3itrogen. The direction is maintained because the edge, E , is a set of weighted and directed interactionkernels of various pairs of atoms, E = { Φ k ( || r i − r j || ; η ij ) | ( α i α j ) ∈ P k ; k = 1 , , . . . , i, j = 1 , , . . . , N } , (1)where || r i − r j || is the Euclidean distance between the i th and j th atoms, η ij is a characteristic distancebetween the atoms and ( α i α j ) is a directed pair of element types. Here Φ k is a correlation function andis chosen to have the following properties [12]Φ k ( || r i − r j || ; η ij ) = 1 , as || r i − r j || → α i α j ) ∈ P k , (2)Φ k ( || r i − r j || ; η ij ) = 0 as || r i − r j || → ∞ , ( α i α j ) ∈ P k . (3)Our previous work [12] has shown that generalized exponential functions,Φ k ( || r i − r j || ; η ij ) = e − ( || r i − r j || /η ij ) κ , ( α i α j ) ∈ P k ; κ > , (4)and generalized Lorentz functions,Φ k ( || r i − r j || ; η ij ) = 11 + ( || r i − r j || /η ij ) ν , ( α i α j ) ∈ P k ; ν > , (5)are good choices which satisfy the assumptions.The centrality metric used in this work is an extension of harmonic centrality to subgraphs withweighted edges defined by the generalized correlation functions µ ki = N (cid:88) j =1 w ij Φ k ( || r i − r j || ; η ij ) , ( α i α j ) ∈ P k , ∀ i = 1 , , . . . , N, (6)where w ij is a weight function related to the element type. The WCG centrality in Eq. (6) describes theatomic specific rigidity which measures the stiffness at the i th atom due to the k th set of contact atoms.To characterize protein multiscale interactions, we use the atomic specific rigidity index from multi-scale weighted colored graphs (MWCGs) introduced in our previous work [15]. The atomic rigidity of i th atom at n th scale due to the k th set of interaction atoms is defined as µ k,ni = N (cid:88) j =1 w nij Φ k ( || r i − r j || ; η nij ) , ( α i α j ) ∈ P k , (7)where Φ k ( || r i − r j || ; η nij ) is a correlation kernel, η nij a scale parameter, and w nij is an atomic type dependentparameter. We set w nij = 1 in the present work.While sulfur atoms play an important role in proteins they are so sparse that their kernels have anegligible effect on the current model. Therefore, it is convenient to consider a subset of P in practicalcomputations ˆ P = (cid:8) CC, CN, CO, NC, NN, NO, OC, ON, OO (cid:9) . (8)We chose only C, N, and O element types due to their high occurrence frequency and important biologicalrelevance. Protein Databank (PDB) .pdb files provide the spatial atomic coordinates and the B-factor of each atomin a protein as well as a variety of other types of observed data that can be used as features. In addition4o the use of PDB spatial coordinates, this work makes use of global features provided in PDB files suchas R-value, resolution, and number of heavy atoms. R-value and resolution are global measures of thequality of the atomic model obtained from crystallographic data. Another global feature we consider isthe overall protein size. To allow the models to distinguish proteins of different sizes we use one hotencoding with the 10 size ranges[500 , , , , , , , , , , where a protein element feature size will take on 1 if the number of heavy atoms (carbon, nitrogen, oroxygen) in that protein is less than or equal to the corresponding size and zero for the remaining sizes.For example, a protein with 1700 heavy elements would have the feature size vector for all of its atomsgiven by [0 , , , , , , , , , , . A frequency distribution of the size categories is provided in Figure 1. There are a total of 12 globalprotein size features. Figure 1:
Frequency of the number of heavy elements within the proteins from the 364 protein dataset.
PDB files also contain amino acid information for each element. Using one hot encoding we include aminoacid information for each heavy element which results in 20 amino acid features. Similarly we one hotcode the 4 different heavy element types carbon, nitrogen, oxygen, and sulfur for each element resultingin 4 additional features.We use MWCG rigidity index described in Section 2.1 to create feature vectors for carbon, nitrogen,and oxygen interactions with each element. Moreover, to capture multiscale interactions we use 3 differentkernel choices for each interaction type. This results in a total of 9 MWCG feature vectors. Theparametrization of the kernels is chosen based on our previous work and is provided in Table 1. [15]Kernel Type κ η n ν Lorentz ( n = 1) - 16 3Lorentz ( n = 2) - 2 1Exponential ( n = 3) 1 31 - Table 1:
Parameters used for correlation kernels in a parameter-free MWCG based on previous results. [15] i th atom is defined as p di = N d N , where d is the given cutoff in angstroms, N d is the number of atoms within the Euclidean distance ofthe cutoff to the i th atom, and N the total number of heavy atoms of the protein. The packing densitycutoffs used in this work are provided in Table 2.Short Medium Long d < ≤ d < ≤ d Table 2:
Packing density parameters in distance ( d ˚A). We include secondary structural information generated using the STRIDE software. The STRIDEsoftware provides secondary structural information about a protein given its atomic coordinates as a PDBfile. STRIDE designates each atom as belonging to a helix (alpha helix, 3-10 helix, PI-helix), extendedconformation, isolated bridge, turn, or coil. Additionally, STRIDE provides φ and ψ angles and residuesolvent accessible area [16]. Taken together this provides 12 secondary features. Using the MWCG method, we apply Lorentz and exponential radial basis functions to construct multi-scale images for each element of a protein. To capture a large variety of scales we construct multiscalekernels for each heavy atom of a protein using various values of κ , ν , and η . In particular we use η = { , , , , , , , } and κ, ν = { , . , , . , , . , , . , , . , , , , , } . This results in 2D MWCG images of dimension (8 , F ki in equation 9, where each atom f ki ( l, m, n ) represents the flexibilityindex of the i th atom, and k th atom interaction (C, N, or O), l = η , m = { κ, ν } , and n the type of radialbasis function. Values of n = 1 and n = 2 correspond to exponential and Lorentz radial basis functionsrespectively. F ki = f ki (1 , , f ki (1 , . , . . . f ki (1 , , f ki (1 , , f ki (1 , . , . . . f ki (1 , , f ki (2 , , f ki (2 , . , . . . f ki (2 , , f ki (2 , , f ki (2 , . , . . . f ki (2 , , f ki (15 , , f ki (15 , . , . . . f ki (15 , , f ki (15 , , f ki (15 , . , . . . f ki (15 , , (cid:124) (cid:123)(cid:122) (cid:125) κ f ki (20 , , f ki (20 , . , . . . f ki (20 , , (cid:124) (cid:123)(cid:122) (cid:125) ν f ki (20 , , f ki (20 , . , . . . f ki (20 , , η (9) A grid search was implemented for each method to determine the hyperparameters provided in Sections2.3.1, 2.3.2 and 2.3.3. 6 .3.1 Random forest
Random forests are ensemble methods that can be used for either classification or regression tasks. Sinceprotein B-factor is a continuous measurement, B-factor prediction is a regression task. Random forestsuse a forest of n decision trees, and in the regression task, the prediction output is the mean predictionof all the trees. Random forests have the added benefit of avoiding overfitting. Random forests are alsoinvariant to scaling and can rank the importance of features used in the model. Random forests are veryrobust to use for small- and medium-sized data sets.The number of n trees used generally improves the predictive power of a random forest model but if n is too large the model is susceptible overfitting the data set. In this work, we tested a variety of valuesfor n to find a balance between performance and cost. Gradient boosting is another ensemble method that assembles a number of so called weak “learners” intoa prediction model iteratively. Gradient boosting tree is a gradient descent method that optimizes anarbitrary differentiable loss function to minimize the residuals from each step. Gradient boosted treesincorporate decision trees at each step of the gradient boosting to improve the predictive power of gradientboosting. Gradient boosted trees are advantageous because they can handle heterogeneous features, havestrong predictive power, and are generally robust to outliers.The gradient boosted tree method has several hyper parameters that can be tuned. In this work weoptimize the hyper parameters using the standard practice of a grid search. The parameters used fortesting are provided in Table 3. Any hyper parameters not listed below were taken to be the defaultvalues provided by the python scikit-learn package.Parameter SettingLoss Function QuantileAlpha 0.95Estimators 1000Learning Rate 0.001Max Depth 4Min Samples Leaf 9Min Samples Split 9
Table 3:
Boosted gradient tree parameters used for testing. Parameters were determined using a grid search.Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learnpackage.
Neural networks initially intend to model the way neurons function in the brain. In a neural network,a batch of signals or feature inputs is passed through activation functions called perceptrons which arethe functional units of the network. The weights of the networks are then trained using a loss functionover several epochs. Each epoch passes the training data set through the network updating the weightsaccording to the loss function. A neural network is considered deep when it has several “hidden” layersof perceptrons.Convolutional neural networks (CNNs) have recently succeeded in classifying images. CNN’s canextract features from images using convolutions with a pre-defined filter size. CNNs are advantageousbecause they can provide similar results without training the network on the full data set. In practice,one can extract high-level features by using several convolutions. In this work, we explore using a heatmap of rigidity indices generated by three channel MWCG image features. We then merge the CNNoutput into a neural network that contains additional global and local protein features. A diagram of theCNN architecture is given in Figure 2. 7 igure 2:
The deep learning architecture using a convolutional neural network combined with a deep neuralnetwork. The plus symbol represents the concatenation of data sets.
The input of the CNN is a three-channel MWCG image of dimension (8,30,3). The model then appliestwo convolutional layers with 2x2 filters followed by a dropout layer at 0.5. This is followed by a denselayer which is flattened then joined with the other global and local features into a dense layer of 59neurons followed by a dropout layer of 0.5, another dense layer of 100 neurons, a dropout layer of 0.25,a dense layer of 10 neurons, and finishes with a dense layer of 1 neuron. This results in a total of 21,584trainable parameters for our network.The convolutional neural network (CNN)has several hyper parameters that can be tuned. In thiswork we optimize the hyper parameters using the standard practice of a grid search. The parametersused for testing are provided in Table 4. Any hyper parameters not listed below were taken to be thedefault values provided by the python Keras package.Parameter SettingLearning Rate 0.001Epoch 100Batch Size 100Loss Mean Absolute ErrorOptimizer Adam
Table 4:
Convolutional Neural Network (CNN) parameters used for testing. Parameters were determined usinga grid search. Any hyper parameters not listed below were taken to be the default values provided by pythonwith the Keras package.
The RF, GBT, and CNN were all trained and tested in the same manner. For each protein, a machinelearning model is built using the entire dataset but excluding data from the protein whose B-factorsare to be predicted. Overall, there are more than 620,000 atoms in our dataset. For each protein, thisprovides a training set of roughly 600,000 data points (i.e., atoms). For each heavy atom, there is a setof features as described in Section 2.2 and a B-factor value (label). The features and the labels in thetraining set are used to train each machine learning model. Since we perform leave-one-out predictions,data from each protein is taken as a test set when its B-factors are to be blindly predicted.8e implement random forest and boosted gradient models using the scikit-learn python package. Forthe CNN model, we also use the python package Keras with tensorflow as a backend.
Our study uses two datasets, one from Park, Jernigan, and Wu [17] and the other from Refs. [12, 13].The first contains 3 subsets of small, medium, and large proteins [17] and the latter contains 364 proteins[12, 13]. The latter dataset is an extended version of the first. In these proteins, all sequences have aresolution of 3 ˚A or higher and an average resolution of 1.3 ˚A and the sets include proteins that rangefrom 4 to 3912 residues [17].For the CNN the feature datasets were standardized with mean 0 and variance of 1. Proteins 1OB4,1OB7, 2OLX, and 3MD5 are excluded from the data set because the STRIDE software is unable toprovide features for these proteins. We exclude protein 1AGN due to known problems with this proteindata. Proteins 1NKO, 2OCT, and 3FVA are also excluded because these proteins have residues withB-factors reported as zero, which is unphysical.
We successfully executed a leave-one-(protein)-out method to blindly predict the B-factors of all carbon,nitrogen, and oxygen atoms present in a given protein. For a comparison with other existing method, wealso list results for predicted C α B-factors, which are predicted in the same way as other heavy atoms.Machine learning was used to train a B-factor prediction model using the structural and B-factor datafrom a training data set as described in Sections 2.3.4 and 2.4. The model was then used to predict theB-factors of all heavy atoms in a given protein using only its structural data.To quantitatively assess our method for B-factor prediction we used the Pearson correlation coefficientgiven by PCC = N (cid:88) i =1 ( B ei − ¯ B e )( B ti − ¯ B t ) (cid:20) N (cid:88) i =1 ( B ei − ¯ B e ) N (cid:88) i =1 ( B ti − ¯ B t ) (cid:21) / , (10)where B ti , i = 1 , , . . . , N are predicted B-factors using the proposed method and B ei , i = 1 , , . . . , N experimental B-factors from the PDB file. The terms B ti and B ei represent the i th theoretical andexperimental B-factors respectively. Here ¯ B e and ¯ B t are averaged B-factors. Computational efficiency in B-factor predictions is an important consideration for large proteins. Table5 lists the running times of GNM, RF, GBT, and CNN in our python implementations. These resultsare depicted in Figure 3. The proteins used to evaluate the computational complexity were the same asthose used by Opron et all [12]. For this comparison we only predict B-factors for C α atoms. Severalproteins were excluded as GNM takes significantly too much CPU time to run. Tests excluded the timeit took to load PDB files and feature data. The machine learning algorithm times exclude the trainingof the model, which, once trained, can be used for the prediction of all proteins. The results show thatGNM has computational complexity of roughly O ( N ) due to the matrix decomposition while the MLalgorithms are close to O ( N ), with N being the number of atoms. The lines of best fit for CPU time ( t )are t ≈ (4 × − ) ∗ N . for GNM, t ≈ (9 × − ) ∗ N . for RF, t ≈ (4 × − ) ∗ N . for GBT, and t ≈ (1 . × − ) ∗ N . for CNN. 9 igure 3: CPU Efficiency comparison between GNM [12], RF, GBT, and CNN algorithms. Execution timesin seconds (s) versus number of residues. A set of 34 proteins, listed in Table 5, were used to evaluate thecomputational complexity.
PDB N GNM [12] RF GBT CNN3P6J 125 0.141 0.000455 0.000358 0.1303R87 132 0.156 0.000464 0.000339 0.1383KBE 140 0.187 0.000505 0.000384 0.1491TZV 141 0.203 0.000473 0.000365 0.1632VY8 149 0.219 0.000486 0.000359 0.1563ZIT 152 0.234 0.000519 0.000365 0.1482FG1 157 0.265 0.000518 0.000403 0.1742X3M 166 0.312 0.000526 0.000382 0.1823LAA 169 0.327 0.000514 0.000405 0.1553M8J 178 0.375 0.000548 0.000412 0.1782GZQ 191 0.468 0.000647 0.000454 0.1954G7X 194 0.499 0.000631 0.000445 0.2092J9W 200 0.546 0.000554 0.000424 0.2083TUA 210 0.655 0.000602 0.000472 0.2171U9C 221 0.733 0.000592 0.000486 0.1983ZRX 221 0.718 0.000654 0.000515 0.2163K6Y 227 0.765 0.000619 0.000490 0.1893OQY 234 0.873 0.000619 0.000502 0.2112J32 244 0.967 0.000625 0.000556 0.2253M3P 249 1.029 0.000621 0.000525 0.2201U7I 267 1.263 0.000647 0.000551 0.2374B9G 292 1.669 0.000693 0.000574 0.2564ERY 318 2.122 0.000775 0.000619 0.2893MGN 348 2.902 0.000655 0.000552 0.2672ZU1 360 3.136 0.000816 0.000675 0.3372Q52 412 4.696 0.000900 0.000750 0.3694F01 448 6.178 0.001016 0.000878 0.4013DRF 547 11.154 0.001131 0.001033 0.5123UR8 637 17.409 0.001307 0.001136 0.5832AH1 939 61.012 0.001716 0.001605 0.8001GCO 1044 75.801 0.001936 0.001814 0.9051F8R 1932 654.127 0.003343 0.003163 1.7451H6V 2927 2085.842 0.005205 0.004739 2.5431QKI 3912 6365.668 0.006261 0.006198 3.560
Table 5:
CPU execution times, in seconds, from efficiency comparison between GNM [12], RF, GBT, and CNN. .3 Machine learning performance The results in Table 6 show that for the blind prediction of all heavy atoms the convolutional neuralnetwork method performs best with an overall average Pearson correlation coefficient of 0.69. The gradientboosted and random forest ensemble methods performed similarly with Pearson correlation coefficients of0.63 and 0.59 respectively. For comparison, Table 6 lists only the average Pearson correlation coefficientsfor C α B-factor predictions, which are obtained in the same manner as other heavy atoms. These resultscan be compared with those of the parameter-free flexibility-rigidity index (pfFRI), Gaussian networkmodel (GNM) and normal mode analysis (NMA) which, however, were obtained via the least squaresfitting of each protein.Results for all heavy atom B-factor predictions for small-, medium-, and large-sized protein datasubsets [17] are given in Tables 7, 8, and 9. Table 10 shows the results for all heavy atom B-factorpredictions of each protein in the Superset. The average Pearson correlation coefficient for the datasubsets is provided in Table 6. All methods perform similarly for the different protein data subsets withthe convolutional neural network method performing the best on the Superset for both all heavy atomand C α only B-factor predictions. Prediction Of Only C α Protein Set RF GBT CNN pfFRI [12] GNM [12] NMA [12]Small 0.25 0.39 0.53 0.60 0.54 0.48Medium 0.47 0.59 0.55 0.61 0.55 0.48Large 0.50 0.57 0.62 0.59 0.53 0.49Superset 0.49 0.57 0.66 0.63 0.57 NAPrediction Of All Heavy AtomProtein Set RF GBT CNN pfFRI [12] GNM [12] NMA [12]Small 0.44 0.49 0.56 NA NA NAMedium 0.59 0.64 0.62 NA NA NALarge 0.62 0.65 0.68 NA NA NASuperset 0.59 0.63 0.69 NA NA NA
Table 6:
Average Pearson correlation coefficients (PCC) both of all heavy atom and C α only B-factor predic-tions for small-, medium-, and large-sized protein sets along with the entire superset of the 364 protein dataset.Predictions of random forest (RF), gradient boosted tree (GBT), and convolutional neural network (CNN) areobtained by leave-one-protein-out (blind), while predictions of parameter-free flexibility-rigidity index (pfFRI),Gaussian network model (GNM) and normal mode analysis (NMA) were obtained via the least squares fitting ofindividual proteins. All machine learning models use all heavy atom information for training. Our blind prediction result using the convolutional neural network is notable because it improvesupon the best result in the previous work for single protein parameters-free FRI (pfFRI) linear fittingof 0.63 [12]. It is noted that blind predictions are much more difficult than linear fittings. The resultfor single protein GNM linear fittings of the same data set is 0.57 [12]. As reported in Table 10, foreach protein, no method outperforms any other method over the entire data set. In terms of the averagePearson correlation coefficient for all heavy atom B-factor prediction, the convolutional neural networkmethod outperforms the boosted gradient and random forest by 10% and 17% respectively.11DB ID N RF GBT CNN1AIE 235 0.62 0.53 0.601AKG 108 0.41 0.51 0.701BX7 345 0.55 0.67 0.631ETL 76 0.27 0.03 0.481ETM 80 0.46 0.13 0.481ETN 77 0.33 0.25 0.201FF4 477 0.55 0.59 0.761GK7 321 0.53 0.73 0.721GVD 401 0.66 0.69 0.711HJE 73 -0.07 0.46 0.371KYC 138 0.43 0.30 0.321NOT 96 -0.18 0.81 0.631O06 142 0.51 0.64 0.651P9I 203 0.73 0.77 0.771PEF 153 0.60 0.64 0.761PEN 109 0.34 0.24 0.211Q9B 303 0.41 0.67 0.751RJU 257 0.71 0.75 0.731U06 432 0.55 0.68 0.611UOY 452 0.55 0.56 0.551USE 290 0.25 0.50 0.681VRZ 66 0.38 -0.17 0.091XY2 62 0.16 0.27 0.551YJO 55 0.36 0.12 0.021YZM 361 0.51 0.60 0.562DSX 386 0.36 0.44 0.562JKU 229 0.57 0.63 0.352NLS 269 0.45 0.49 0.702OL9 51 0.65 0.51 0.846RXN 345 0.56 0.71 0.82
Table 7:
Pearson correlation coefficients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the small-sized proteinset. Results reported use heavy atoms in both training and prediction.
Table 8:
Pearson correlation coefficients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the medium-sizedprotein set. Results reported use heavy atoms in both training and prediction.
Table 9:
Pearson correlation coefficients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the large-sized proteinset. Results reported use heavy atoms in both training and prediction.
Table 10:
Pearson correlation coefficients for cross protein heavy atom blind B-factor prediction obtained byrandom forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the Superset. Resultsreported use heavy atoms in both training and prediction.
PDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN able 10 – continued from previous pagePDB N RF GBT CNN PDB N RF GBT CNN Some low Pearson correlation coefficients results show a poor model prediction. However, in almostevery protein where one model performs poorly, another model performs satisfactorily. When the maxi-mum correlation coefficient for each protein is considered among the three methods the average all heavyatom correlation coefficient is increased to 0.73 and the average C α only correlation coefficient is increasedto 0.72. This result is similar to that of the parameter-optimized FRI (opFRI) reported in our earlierwork [12]. Both random forest and boosted gradient methods have the ability to rank relative feature importancehelping us understand significant features in the model. Figure 4 shows the individual feature importancefor the random forest averaged over the dataset. 18 igure 4:
Individual feature importance for the random forest model averaged over the data set. Reportedfeature selection includes the use heavy atoms in the model.
We also include aggregated feature importance in Figure 5. In this figure, we sum the importanceof the individual angle, secondary, MWCG, atom type, protein size, amino acid, and packing densityfeatures.
Figure 5:
Average feature importance for the random forest model with the angle, secondary, MWCG, atomtype, protein size, amino acid, and packing density features aggregated. Reported feature selection includes theuse heavy atoms in the model.
Figure 4 shows the most important MWCG feature is the carbon-carbon interaction. This MWCGfeature uses a Lorentz radial basis function as with η = 16 and ν = 3 as detailed in Section 2.2. Theremaining eight MWCG features all rank similarly with the carbon-oxygen interaction ranked as thesecond most significant MWCG feature. This result validates that the model benefits from the multi-scale property of the MWCG feature, which uses three different kernels to capture interactions at variouslength scales. Since all MWCG have significance in the feature ranking it follows that the element specific19roperty of the MWCG method is also a meaningful model feature.Figure 4 shows that that the individual MWCG, amino acid type, and packing density feature havelow relative importance, however, considering their aggregate importance as seen in Figure 5, we see thatthey contribute to the model. Figure 5 shows that the medium density protein packing density featurewas twice as important to the model as the short and long density features. The medium packing densitymay be capturing semi-local side chain interactions which are important in protein flexibility. The shortpacking density likely captures only adjacent backbone information while the long packing density isonly adding weak atomic interaction information to the model. Protein resolution is the most significantrelative feature followed by MWCG features and the STRIDE generated residue solvent accessible areafeature. This also highlights the importance of the quality of X-ray crystal structures and difficulty incross-protein B-factor prediction. Protein angles, secondary structures, and size play a less significantrole in the model compared to the other features. Atom type has the lowest significance relative to theother features implemented in the model. Not surprisingly, we see that global features such as resolutionand R-value are important components in the ensemble model. The global feature of protein size has asmall role in the model.Care must be taken to use feature ranking to understand feature importance. The feature rankingprovided by these models is a relative ordering of features that the models find most important. Sofeatures with high correlation may be redundant giving one of them a lower rank even though they mayhave significant prediction power. For example, R-value highly correlates with resolution so it is likely ameaningful feature. However, the use of resolution reduces the relative importance ranking of R-value inthe model. Among the three methods considered in this work, the convolutional neural network method outperformsthe boosted gradient tree and random forest by 10% and 17%, respectively. As reported in Table 10, nomachine learning method outperforms any other method for each of all proteins. Results for all machinelearning methods could undoubtedly be improved by refining features, exploring new features, and furthertuning hyperparameters.In general, ensemble methods do not require as much parameter tuning as the CNN does. Therandom forest is the simplest and most robust method. To balance cost, time, and quality only 500trees were used for the random forest and 1000 trees were used for the boosted gradient method in thiswork. This may account for the increased performance of the boosted gradient tree method comparedto the random forest. Ensemble methods are quite robust against overfitting so adding more featureswould likely improve their results [18]. The boosted gradient trees use several hyperparameters so thesemethods could benefit by further tuning these hyperparameters.The additional data in the form of MWCG images used in the convolutional neural network likelyexplains the improved performance as compared to the ensemble methods. More refined images and othernovel image types could further improve results.Using the dropout strategy, CNNs are also robust against overfitting. Since there are a few hyperpa-rameters in the CNN method, it would likely benefit from more detailed parameter tuning. Additionally,a large dataset and more features would also improve the CNN performance. For example, includingpersistent homology [19] and differential geometry features might lead to a better CNN prediction.
Protein flexibility is known to strongly correlate with protein function and its prediction is importantfor our understanding of protein dynamics and transport. Our quantitative understanding of proteinflexibility and function is greatly impeded by their complexity and a large number of degrees of freedom.Many time-independent methods, such as NMA [4, 20, 5, 3], ENM [6], GNM [8, 9, 21], and FRI [11–13, 22],20xist that dramatically simplify the protein structural complexity and are able to analyze protein B-factors, which reflect protein flexibility among other things. Based on the hypothesis that intrinsic physicslies in a low-dimensional space embedded in a high-dimensional data space, we introduced multiscaleweighted colored graphs (MWCGs) to effectively reduce protein structural complexity and efficientlydescribe protein flexibility. However, none of the aforementioned methods is able to blindly predict theprotein B-factors of an unknown protein. This work integrates advanced machine learning algorithmsand two sets of features, i.e., global and local ones, to blindly predict protein flexibility and B-factors.A few standard datasets involving more than 300 proteins (or more than 600,000 of B-factors) havebeen utilized to test the proposed method. We use the leave-one-protein-out scheme to blindly predictprotein B-factors of both all heavy atoms and only C α atoms. Extensive numerical experiments demon-strate that the present blind prediction is more accurate than the least squares fitting using GNM or NMAin terms of Pearson’s correlation coefficients for the prediction of C α B-factor. Further, we demonstratethe ability to effectively blindly predict B-factors of any heavy atoms in a given protein.Three standard machine learning algorithms, namely, the random forest, graduate boosted trees, andconvolutional neural networks are employed in the present study. Among them, convolutional neuralnetworks do a better job in B-factor predictions. A variety of different features were considered for thesemodels including local, semi-local, and global features. Local features, such as MWCGs, are designedto capture structural properties associated with the intrinsic flexibility while global features, such asX-ray crystal resolution, are used to enable the cross-protein comparison and analysis. The proposedmethod is very efficient. However, there is still much room for novel and interesting features that can beimplemented in future work. For example, many algebraic topology tools have been found very useful forprotein analysis [23,18,19] and will likely pair well with machine learning approaches for protein flexibilitypredictions.This work is a first step using the recent advances in machine learning techniques to blindly predictprotein B-factors. To the authors’ knowledge, this is the first work demonstrating this as a feasible androbust prediction method. This work provides a clear evidence that machine algorithms are useful in pro-tein flexibility analysis. Results for all methods could undoubtedly be improved by a better mathematicaldescription of intrinsic flexibility, larger datasets, and more advanced machine learning algorithms.The proposed methods could be implemented in a variety of interesting applications related to proteinflexibility and function. These include topics such as hinge detection, hot spot identification, allostericsite detection, pose prediction, protein folding, and computer-aided drug design.
Acknowledgment
This work was supported in part by NSF Grants DMS-1721024 and DMS-1761320, NIH grant 1R01GM126189-01A1. 21 eferences [1] J. P. Ma, “Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecularcomplexes.,”
Structure , vol. 13, pp. 373 – 180, 2005.[2] H. Frauenfelder, S. G. Slihar, and P. G. Wolynes, “ The energy landsapes and motion of proteins ,”
Science , vol. 254, pp. 1598–1603, DEC 13 1991.[3] M. Tasumi, H. Takenchi, S. Ataka, A. M. Dwidedi, and S. Krimm, “Normal vibrations of proteins:Glucagon,”
Biopolymers , vol. 21, pp. 711 – 714, 1982.[4] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. States, S. Swaminathan, and M. Karplus,“Charmm: A program for macromolecular energy, minimization, and dynamics calculations,”
J.Comput. Chem. , vol. 4, pp. 187–217, 1983.[5] M. Levitt, C. Sander, and P. S. Stern, “Protein normal-mode dynamics: Trypsin inhibitor, crambin,ribonuclease and lysozyme.,”
J. Mol. Biol. , vol. 181, no. 3, pp. 423 – 447, 1985.[6] M. M. Tirion, “Large amplitude elastic motions in proteins from a single-parameter, atomic analy-sis.,”
Phys. Rev. Lett. , vol. 77, pp. 1905 – 1908, 1996.[7] A. R. Atilgan, S. R. Durrell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar, “Anisotropyof fluctuation dynamics of proteins with an elastic network model.,”
Biophys. J. , vol. 80, pp. 505 –515, 2001.[8] I. Bahar, A. R. Atilgan, and B. Erman, “Direct evaluation of thermal fluctuations in proteins usinga single-parameter harmonic potential.,”
Folding and Design , vol. 2, pp. 173 – 181, 1997.[9] I. Bahar, A. R. Atilgan, M. C. Demirel, and B. Erman, “Vibrational dynamics of proteins: Signifi-cance of slow and fast modes in relation to function and stability.,”
Phys. Rev. Lett , vol. 80, pp. 2733– 2736, 1998.[10] T. Haliloglu, I. Bahar, and B. Erman, “Gaussian dynamics of folded proteins,”
Physical reviewletters , vol. 79, no. 16, p. 3090, 1997.[11] K. L. Xia and G. W. Wei, “A stochastic model for protein flexibility analysis,”
Physical Review E ,vol. 88, p. 062709, 2013.[12] K. Opron, K. L. Xia, and G. W. Wei, “Fast and anisotropic flexibility-rigidity index for proteinflexibility and fluctuation analysis,”
Journal of Chemical Physics , vol. 140, p. 234105, 2014.[13] K. Opron, K. L. Xia, and G. W. Wei, “Communication: Capturing protein multiscale thermalfluctuations,”
Journal of Chemical Physics , vol. 142, no. 211101, 2015.[14] D. D. Nguyen, K. L. Xia, and G. W. Wei, “Generalized flexibility-rigidity index,”
Journal of ChemicalPhysics , vol. 144, p. 234106, 2016.[15] D. Bramer and G.-W. Wei, “Multiscale weighted colored graphs for protein flexibility and rigidityanalysis,”
The Journal of Chemical Physics , vol. 148, no. 5, p. 054103, 2018.[16] M. Heinig and D. Frishman, “Stride: a web server for secondary structure assignment from knownatomic coordinates of proteins,”
Nucleic Acids Research , vol. 32, no. suppl 2, pp. W500–W502, 2004.[17] J. K. Park, R. Jernigan, and Z. Wu, “Coarse grained normal mode analysis vs. refined gaussiannetwork model for protein residue-level structural fluctuations,”
Bulletin of Mathematical Biology ,vol. 75, pp. 124 –160, 2013. 2218] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutationby element specific persistent homology,”
Bioinformatics , vol. 33, pp. 3549–3557, 2017.[19] Z. X. Cang and G. W. Wei, “Integration of element specific persistent homology and machine learningfor protein-ligand binding affinity prediction ,”
International Journal for Numerical Methods inBiomedical Engineering , vol. 34, p. 22914, 2018.[20] N. Go, T. Noguti, and T. Nishikawa, “Dynamics of a small globular protein in terms of low-frequencyvibrational modes,”
Proc. Natl. Acad. Sci. , vol. 80, pp. 3696 – 3700, 1983.[21] B. Brooks and M. Karplus, “Harmonic dynamics of proteins: normal modes and fluctuations inbovine pancreatic trypsin inhibitor,”
Proceedings of the National Academy of Sciences , vol. 80, no. 21,pp. 6571–6575, 1983.[22] K. Opron, K. L. Xia, Z. Burton, and G. W. Wei, “Flexibility-rigidity index for protein-nucleic acidflexibility and fluctuation analysis,”
Journal of Computational Chemistry , vol. 37, pp. 1283–1295,2016.[23] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, flexibility and folding,”