[PDF] Atom-specific persistent homology and its application to protein flexibility analysis

Abstract

Recently, persistent homology has had tremendous success in biomolecular data analysis. It works by examining the topological relationship or connectivity of a group of atoms in a molecule at a variety of scales, then rendering a family of topological representations of the molecule. However, persistent homology is rarely {employed} for the analysis of atomic properties, such as biomolecular flexibility analysis or B factor prediction. This work introduces atom-specific persistent homology to provide a local atomic level representation of a molecule via a global topological tool. This is achieved through the construction of a pair of conjugated sets of atoms and corresponding conjugated simplicial complexes, as well as conjugated topological spaces. The difference between the topological invariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics and leads to an atom-specific topological representation of individual atomic properties in a molecule. Atom-specific topological features are integrated with various machine learning algorithms, including gradient boosting trees and convolutional neural network for protein thermal fluctuation analysis and B factor prediction. Extensive numerical results indicate the proposed method provides a powerful topological tool for analyzing and predicting localized information.

Full PDF

AAtom-speciﬁc persistent homology and its application to proteinﬂexibility analysis

David Bramer and Guo-Wei Wei , , ∗ Department of MathematicsMichigan State University, MI 48824, USA Department of Biochemistry and Molecular BiologyMichigan State University, MI 48824, USA Department of Electrical and Computer EngineeringMichigan State University, MI 48824, USAMarch 27, 2019

Abstract

Recently, persistent homology has had tremendous success in biomolecular data analysis. Itworks by examining the topological relationship or connectivity of a group of atoms in a molecule ata variety of scales, then rendering a family of topological representations of the molecule. However,persistent homology is rarely employed for the analysis of atomic properties, such as biomolecularﬂexibility analysis or B factor prediction. This work introduces atom-speciﬁc persistent homologyto provide a local atomic level representation of a molecule via a global topological tool. This isachieved through the construction of a pair of conjugated sets of atoms and corresponding conjugatedsimplicial complexes, as well as conjugated topological spaces. The diﬀerence between the topologicalinvariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics andleads to an atom-speciﬁc topological representation of individual atomic properties in a molecule.Atom-speciﬁc topological features are integrated with various machine learning algorithms, includinggradient boosting trees and convolutional neural network for protein thermal ﬂuctuation analysis andB factor prediction. Extensive numerical results indicate the proposed method provides a powerfultopological tool for analyzing and predicting localized information.

Keywords:

Atom-speciﬁc topology, Element-speciﬁc persistent homology, Protein ﬂexibility, Gra-dient boosting tree, Convolutional neural network. ∗ Address correspondences to Guo-Wei Wei. E-mail:[email protected] a r X i v : . [ q - b i o . B M ] M a r ONTENTS CONTENTS

Contents INTRODUCTION

In recent years tools from topology have been successfully applied to protein analysis [1–6]. Topologyoﬀers one of highest level of abstractions of geometric data and allows one to infer high dimensionalstructure from low dimensional representations. However, conventional topology oversimpliﬁes geometryand thus lacks descriptive power for most real world problems. Persistent homology (PH) overcomesthis diﬃculty by introducing a ﬁltration parameter that describes the geometry in terms of a familyof Betti numbers at various scales known as a barcode [7–10]. Indeed, three dimensional (3D) proteinspatial information from a protein data bank (PDB) ﬁle can be converted into a family of simplicialcomplexes. One can apply tools from algebraic topology to convert structural information into globaltopological invariants that provide a useful representation of biomolecular properties [11]. However,for quantitative biomolecular analysis and prediction, persistent homology alone neglects chemical andbiology information. Element-speciﬁc persistent homology has been introduced to incorporate chemicaland biological information into topological invariants [12,13]. Similarity and diﬀerences between barcodesfrom diﬀerent molecules can be measured by Wasserstein [14] and/or Bottleneck [15] distances. However,the previous applications of persistent homology and element-speciﬁc persistent homology are for themodeling and prediction of molecule-level thermodynamical or structural properties, such as protein-ligand binding aﬃnities [13], protein folding free energy changes upon mutations [12,16], drug toxicity [17],solubility, partition coeﬃcient [18], and drug virtual screening (ligand and decoy classiﬁcation) [19].Essentially, topology is a global tool that examines the connectivity and relationship among many atomsin a neighborhood as a whole. High dimensional topological invariants, such as Betti 1 and Betti 2,describe the collective behavior of many atoms. Therefore, it is not clear how to represent atomic levelproperty, such as the B factor of an atom, by persistent homology.In proteins, beta factor (B factor) or (Debye-Waller factor is a measure of the attenuation of X-rayscattering caused by thermal motion. The strength of the thermal motion of an atom is theoreticallyproportional to its B factor during the structure determination from X-ray diﬀraction data. It is wellknown that biomolecular ﬂexibility provides an important link between its structure and function. Inparticular, it has been shown that intrinsic structural ﬂexibility correlates to meaningful protein confor-mational variations, reactivity and enzymatic function [20]. As such, the accurate prediction of proteinB-factor is essential to our understanding of protein structure, function and dynamics [21].Early methods used to predict protein B factor were derived from Hooke’s Law and are known as elasticmass-and-spring networks. In these models, alpha carbons (C α ) of biological macromolecules are treatedas a mass and spring network and motions are predicted based on a harmonic potential. Given a protein,each C α is represented as a node in the network and edges are weighted based on a potential function.Nodes are connected by an edge if they fall within a pre-deﬁned euclidean cutoﬀ distance. This capturesthe local covalent and non-covalent interactions between an individual atom and nearby atoms. One ofthe ﬁrst mass-and-spring methods used for protein B factor prediction is normal mode analysis (NMA).Like most B factor prediction methods, NMA is independent of time and uses a Hamiltonian interactionmatrix. Eigenvalues of the matrix system correspond to characteristic frequencies of the protein and thesefrequencies correlate with protein B factors. Low-frequency modes correlate with cooperative motion andcan be useful for hinge detection and domain motion. NMA has also been successfully implemented tounderstand the deformation of supramolecular complexes. [20, 22–24]Elastic network model (ENM) was introduced as a more eﬃcient model that signiﬁcantly reducescomputational cost compared to NMA through the use of a simpliﬁed spring network [25]. A speciﬁcexample is anisotropic network model (ANM) [26]. Gaussian network model (GNM) further reduces thecomputational cost by ignoring the anisotropic motion, rendering a more accurate method for protein C α B factor analysis [27–29].All of the aforementioned methods depend on matrix diagonalization, which has the computationalcomplexity of O ( N ), where is the number of matrix atoms involved in the analysis. Recently, ﬂexibilityand Rigidity Index (FRI) methods have been proposed as a geometric graph approach to further reducethe computational cost. FRI methods rely on constructing a distance matrix using radial basis functions3 METHODS AND ALGORITHMS to scale atom to atom distance non-linearly [30]. All versions of FRI produce a ﬂexibility index, thatcorrelates to the B factor, for each C α . Several versions of FRI have been developed. Among them,fast FRI (fFRI) is of O ( N ) in computational complexity [31]. FRI methods are also more accurate thanall of the earlier algebraic graph-based methods. Additionally, anisotropic FRI (aFRI) provides highquality anisotropic motion analysis [31]. Moreover, using several radial basis functions with diﬀerentparametrizations, the multiscale ﬂexibility rigidity index (mFRI) can successfully capture multiscaleatomic interactions [32].More recently, the authors introduced a multiscale weighted colored graph (MWCG) model. TheMWCG is another geometric graph theory model that has been shown to be the best B factor predictionmodel to date. First, element-speciﬁc interaction subgraphs are constructed based on selected atomicinteractions between certain element types. Atoms are represented as graph nodes and subgraphs aregenerated using pairs of atoms of certain elements (e. g., carbon, nitrogen, oxygen, sulfur). A centralitymetric that uses radial basis functions is applied to pairwise interactions in each subgraph. By varying theparametrization of the radial basis functions the MWCG model can capture multiple protein interactionscales. MWCG is unique in its ability to utilize both element speciﬁc and multiscale interactions for im-proved B factor prediction [33]. Most recently, MWCG is incorporated with machine learning algorithmsfor across-protein blind predictions of protein B factors [34].The objective of the present work is to extend the utility of persistent homology for atomic levelproperty modeling and prediction. To this end, we introduce atom-speciﬁc persistent homology (ASPH) tocreate a local atomic representation of an atom using a global topological tool in a novel way. Speciﬁcally,ASPH constructs a pair of conjugated sets of point clouds or atoms centered around the atom of interest.The ﬁrst set of a pair of conjugated sets of atoms for a given atom is selected by a local sphere ofradius r c around the atom of interest. The second set of atoms is deﬁned by excluding the atom ofinterest in the ﬁrst set. Conjugated simplicial complexes, conjugated chain groups, conjugated homologygroups as well as conjugated persistence barcodes or diagrams are induced by an identical ﬁltration.Conjugated persistence barcodes are compared with Bottleneck and Wasserstein metrics. The resultingdistance provides a global topological representation of the localized atomic property, such as proteinﬂexibility analysis and atomic-level protein B-factor information. Obviously, the proposed atom-speciﬁctopology can be applied to a wide variety of chemical and biological problems where atomic propertiesare measured, such as the chemical shifts of nuclear magnetic resonance (NMR), the B-factors of X-raystructure determination, and the shift and line broadening of other atomic spectroscopy.We focus on protein C α B-factor prediction but the approach provided in this work is a generalframework that can be used to predict B factors of any atom in a protein. First, we use the generatedatom-speciﬁc persistent homology features to ﬁt B factors within a given protein using linear least squaresminimization. Then the atom-speciﬁc persistent homology features are combined with other local andglobal protein features to construct machine learning models for the blind prediction of protein B factorsacross diﬀerent proteins. Additionally, image-like multiscale atom-speciﬁc persistent homology featuresare generated using an early technique [35]. These image like features, together with other features, arefed into convolutional neural networks (CNN). Training and validation are carried out using a large anddiverse set of proteins from the protein data bank (PDB). The proposed method oﬀers some of the bestresults for blind B factor predictions of a set of 364 proteins.

Topology describes (continuous) objects in terms of topological invariants, i.e., Betti numbers. Betti-0,Betti-1, and Betti-2 which can be interpreted as connected components, rings, cavities, etc. Table 1provides examples of the Betti numbers of a point, circle, sphere, and torus.4 .1 Atom-speciﬁc persistent homology 2 METHODS AND ALGORITHMS

Table 1:

Topological invariants displayed as Betti numbers. Betti-0 represents the number of connected compo-nents, Betti-1 the number of tunnels or circles, and Betti-2 the number of cavities or voids. Two auxiliary ringsare added to the torus to illustrate that its Betti-1=2.

Example Point Circle Sphere TorusBetti-0 1 1 1 1Betti-1 0 1 0 2Betti-2 0 0 1 1

Figure 1:

From left to right an example of a 0-simplex, 1-simplex, 2-simplex, and 3-simplex.

Given discrete data points, such as a point cloud or the set of atoms in a molecule, we use simplicialcomplexes to describe the topological relationship, or connectivity of the point cloud, to systematicallyidentify topological invariants. First, a few simplicial complexes, as shown in Figure 1, are made upof vertices, edges, triangles, and tetrahedrons, denoted 0-simplex, 1-simplex, 2-simplex, and 3-simplex,respectively. Homology groups constructed from simplicial complexes give rise topological invariants.Given discrete dataset or a set of protein atoms, nontrivial topological information is generated bypersistent homology. This introduces a ﬁltration parameter to create a family of simplexes, which leads toa family of simplicial complexes, homology groups and associated topological invariants. By continuouslyvarying the ﬁltration parameter over an interval, the topological relationship among a given set of atomsis systematically reset, rendering a family of homology groups and corresponding topological invariants,which can be plotted as a persistence diagram, or a set of barcodes. Both persistence diagrams andbarcodes record the birth and death (appearance and cessation) of Betti numbers during the ﬁltrationprocess. Many simplicial complex deﬁnitions, which determine the rules of the corresponding topologicalrelationship, have been proposed. Commonly used deﬁnitions include Vietoris-Rips (VR) complex, ˇCechcomplex, and alpha complex.Persistent homology allows the extraction of topological invariants that are embedded in the highdimensional data space of biomolecules. The resulting topological invariants over the ﬁltration, i.e.,persistence diagrams or persistence barcodes of diﬀerent molecules can be compared using Bottleneckand Wasserstein distances.The goal of atom-speciﬁc persistent homology is to extract topological information of a given atomin a molecule. To embed local atomic information into a global topological description, we construct apair of conjugated sets of point clouds, namely the original dataset and a datset excluding the atom ofinterest. The Bottleneck and Wasserstein distances between these two persistence diagrams reveal thedesirable topological information of the given atom.5 .1 Atom-speciﬁc persistent homology 2 METHODS AND ALGORITHMS

A (geometric) simplex is a generalization of a triangle or tetrahedron to arbitrary dimensions. A k -simplexis a convex hull of k + 1 vertices represented by a set of aﬃnely independent points σ = { λ u + λ u + . . . + λ k u k | (cid:88) λ i = 1 , λ i ≥ , i = 0 , , . . . , k } , (1)where { u , u , . . . , u k } ⊂ R d with d ≥ k is the set of points, σ is the k -simplex, and constraints on λ i ’sensure the formation of a convex hull. An aﬃnely independent combination of points can have at most k + 1 points in R k . For example a 1-simplex is a line segment, a 2-simplex a triangle, and a 3-simplex atetrahedron. A subset of the k + 1 vertices of a k simplex with m + 1 vertices forms a convex hull in alower dimension and is called an m -face of the k -simplex. An m -face is proper is m < k . The boundaryof a k -simplex σ , is deﬁned as the formal sum of its ( k + 1) faces. Given as ∂ k σ = k (cid:88) i =0 ( − i [ u , . . . , ˆ u i , . . . , u k ] , (2)where [ u , . . . , ˆ u i , . . . , u k ] denotes the convex hull formed by vertices of σ with the vertex u i being excludedand ∂ k is called the boundary operator. A collection of ﬁnitely many simplicies forms a simplicial complexdenoted by K . All simplicial complexes satisfy the following conditions.1. Faces of any simplex in K are also simplices in K .2. The intersection of any two simplicies σ , σ ∈ K is a face of both σ and σ . Given a simplicial complex K , a k -chain c k of K is a formal sum of the k -simplices in K and is deﬁnedas c k = (cid:80) a i σ i where σ i are the k -simplices and a i ’s coeﬃcients. Generally, a i are element of a ﬁeldsuch as R , Q , or Z n . Computationally, it is common to choose a i to be in Z . The group of k -chainsin K , denoted C k , forms an Abelian group under addition in modulo two. This allows us to extend thedeﬁnition of the boundary operator introduced in Eq. (2) to chains.The boundary operator applied to a k -chain c k is deﬁned as ∂ k c k = (cid:88) a i ∂ k σ i , (3)where σ i ’s are k -simplices. The boundary operator is a map from C k to C k − , which is also knownas a boundary map for chains. Note that in Z , the boundary operator ∂ k satisﬁes the property that ∂ k ◦ ∂ k +1 σ = 0 for any ( k + 1)-simplex σ following the fact that any ( k − σ is contained inexactly two k -faces of σ . The chain complex is deﬁned as a sequence of chains connected by boundarymaps with decreasing dimension and is denoted . . . → C n ( K ) ∂ n −→ C n − ( K ) ∂ n − −−−→ . . . ∂ −→ C ( K ) ∂ −→ . (4)The k -cycle group and k -boundary group are then deﬁned as kernel and image of ∂ k and ∂ k +1 respectively,and Z k = Ker ∂ k = { c ∈ C k | ∂ k c = 0 } , (5) B k = Im ∂ k +1 = { c ∈ C k |∃ d ∈ C k +1 : c = ∂ k +1 d } , (6)where Z k is the k -cycle group and B k is the k -boundary group. Since ∂ k ◦ ∂ k +1 = 0, we have B k ⊂ Z k ⊂ C k .Then the k -homology group is deﬁned to be the quotient group of the k -cycle group modulo the k -boundary group, H k = Z k / B k (7)where H k is the k -homology group. The k th Betti number is deﬁned to be rank of the k -homology groupas β k = rank( H k ). 6 .1 Atom-speciﬁc persistent homology 2 METHODS AND ALGORITHMS For a simplicial complex K , we deﬁne a ﬁltration of K as a nested sequence of subcomplexes of K , ∅ ⊆ K ⊆ K . . . ⊆ K n = K (8)In persistent homology, the nested sequence of subcomplexes usually depends on a ﬁltration parameter.The persistence of a topological feature is denoted graphically by its life span with respect to ﬁltrationparameter. Subcomplexes corresponding to various ﬁltration parameters oﬀer the topological ﬁngerprintsover multiple scales. The k th persistence Betti number β i,jk is given by the ranks of the k th homologygroups of K i that are alive and are deﬁned as β i,jk = rank( H i,jk ) = rank( Z k ( K i ) / ( B k ( K j ) ∩ Z k ( K i ))) . (9)The persistence of Betti numbers over the ﬁltration interval can be recorded in many diﬀerent ways.The commonly used ones are persistence barcodes and persistence diagrams. An example of barcodes isprovided in Figure 2. (a) (b)Figure 2: (a) An example of 5 points in R and (b) the corresponding persistence barcodes. The length of eachbarcode corresponds to the persistence of each topological object ( β , β , β ,etc..) over the Vietoris-Rips (VR)complex ﬁltration. In this work, we use Bottleneck and Wasserstein distances to extract atom-speciﬁc topological informationand facilitate atom-speciﬁc persistent homology. Let X and Y be multisets of data points, the Bottleneckand Wasserstein distances of X and Y are given by [15] d B ( X, Y ) = inf γ ∈ B ( X,Y ) sup x ∈ X || x − γ ( x ) || ∞ , (10)and [14] d pW ( X, Y ) = (cid:32) inf γ ∈ B ( X,Y ) (cid:88) x ∈ X || x − γ ( x ) || p ∞ (cid:33) /p , (11)respectively. Here B ( X, Y ) is the collection of all bijections from X to Y . Note that in our work,topological invariants of diﬀerent dimensions are compared separately.7 .1 Atom-speciﬁc persistent homology 2 METHODS AND ALGORITHMS Given a metric space M and a cutoﬀ distance d , a simplex is formed if all points have pairwise distancesno greater than d . All such simplices form the Vietoris-Rips (VR) complex. The abstract nature of theVR complex allows the construction of simplicial complexes from a correlation function, which modelsthe pairwise interaction of atoms using a radial basis function versus more standard distance metrics.The R library TDA is used to generate persistence barcodes [36] . Element-speciﬁc persistent homology was introduced to embed chemical and biology information intotopological invariants [12,19]. Its essential idea is to construct topological representations from subsets ofatoms in various element types in a protein. For example, if one selects all carbon atoms in a protein, theresulting persistence barcodes will represent the strength and network of hydrophobicity in the protein.

Figure 3:

Illustration of Atom-speciﬁc persistent homology point clouds. Top: the original point cloud. Theatom of interest is at the center of the circle. Second row: a pair of conjugated sets of point clouds for atom-speciﬁc persistent homology. The rest: Four pairs of conjugated point clouds for atom-speciﬁc and element-speciﬁcpersistent homology.

In contrast, atom-speciﬁc persistent homology is designed to highlight the topological information of8 .1 Atom-speciﬁc persistent homology 2 METHODS AND ALGORITHMS a given atom in a biomolecule. It creates two conjugated subsets of atoms centered around the atomof interest, one with and one without the speciﬁc atom. Conjugated simplicial complexes, conjugatedhomology groups and conjugated topological invariants are generated for the conjugated sets of pointsclouds. The diﬀerence between the conjugated topological invariants, measured by both Wasserstein andBottleneck distances, oﬀers a topological representation of the atom of interest. As shown in Figure 3,atom-speciﬁc and element-speciﬁc conjugated point clouds can be constructed for a given dataset.In this work, we focus on C α B factor predictions. We use element speciﬁc persistent homology toenhance the topological representation of each C α neighborhood. Meanwhile, we develop atom-speciﬁcpersistent homology to pinpoint the topological representation at each C α atom. With these selectionsof subsets, Vietoris-Rips complexes are constructed by contact maps or matrix ﬁltration [1].To capture element-speciﬁc interactions we consider three subsets of carbon-carbon, carbon-nitrogen,and carbon-oxygen point clouds. This gives us the following element speciﬁc pairs, P = { CC , CN , CO } . (12)For a given Protein Data Bank (PDB) ﬁle, persistence barcodes are calculated as follows. Given a speciﬁcC α of interest, say r ki ∈ P k in an element speciﬁc set P k ( P = CC , P = CN, and P = CO) , a pointcloud consisting of all atoms within a pre-deﬁned cutoﬀ radius r c is selected: R ki = { r kj (cid:12)(cid:12) || r ki − r kj || < r c , r ki , r kj ∈ P k , ∀ j ∈ , , . . . N } , (13)where N is the number of atoms in the k th element pair P k . A conjugated set of point cloud, ˆ R ki , includesthe same set of atoms, except for r ki . For a given pair of conjugated point clouds R ki and ˆ R ki , conjugatedsimplicial complexes, conjugated homology groups, and conjugated persistence barcodes are computed viapersistent homology. We compute Euclidean distance based ﬁltration using the Vietoris-Rips complex.Additionally, for a given set of atoms selected according to atom-speciﬁc and element speciﬁc construc-tions, we generate a family of multiresolution persistence barcodes by a resolution controlled ﬁltrationmatrix: [1] M nm ( ϑ ) = 1 − Φ( || r n − r m || ; ϑ ) , (14)where ϑ denotes a set of kernel parameters. We have used both exponential kernelsΦ( || r n − r m || ; η, κ ) = e − ( || r n − r m || /η ) κ , κ > || r n − r m || ; η, ν ) = 11 + (cid:0) || r n − r m || /η (cid:1) ν , ν > η κ , and ν are pre-deﬁned constants. This ﬁltration matrix is used in association with the Vietoris-Rips complex to generate persistence barcodes or persistence diagrams. Then these topological invariantsare compared using both Bottleneck and Wasserstein distances. An example of the conjugated persistencebarcode pair generated for a C α atom is illustrated in Figure 4.9 .2 Machine learning models 2 METHODS AND ALGORITHMS (a) (b)Figure 4: Illustration of residue 338 C α atom-speciﬁc persistent homology in the CC element-speciﬁc point cloudof protein PDBID 1AIE. For this example residues 332-339 are used and are shown on the left. The C α locationused to generate the barcodes (right) is highlighted in red in the left chart. Conjugated persistence barcodes aregenerated with and without the selected C α . Topological features are used for prediction of protein B factor using both least squares ﬁtting andmachine learning as described in the following subsections.

Gradient boosting is an ensemble method that uses a number of “weak learners” to construct a predictionmodel in an iterative manner. The method is optimized via gradient descent, which minimizes theresiduals of a loss function. At each step of the gradient boosting, gradient boosting trees (GBTs)incorporate decision trees to improve their predictive power. Ensemble methods like GBTs are usefulbecause they can handle a diverse feature set, have strong predictive power, and are typically robust tooutliers and against overﬁtting.In this work, we optimize the GBT hyper-parameters using the standard practice of a grid search.The parameters used for testing are provided in Table 2. Any hyper-parameters not listed in the tablewere taken to be the default values provided by the python scikit-learn package.

Table 2:

Boosted gradient tree hyperparameters used for testing. Parameters were determined using a gridsearch. Any hyperparameters that is not listed were taken to be the default values provided by the pythonscikit-learn package.

Parameter SettingLoss Function QuantileAlpha 0.975Estimators 500Learning Rate 0.25Max Depth 4Min Samples Leaf 9Min Samples Split 910 .2 Machine learning models 2 METHODS AND ALGORITHMS

Neural networks are modeled after the function of neurons in brain. A neural network applies activationfunctions, called perceptrons, to inputs. Weights of the network are trained to minimize a loss functionover many epochs, or passes of an entire training dataset. When a neural network has several layers ofperceptrons we call it a deep neural network (DNN) and the intermediate layers are known as hiddenlayers.Convolutional neural networks (CNNs) have recently had great success in image classiﬁcation. Usingconvolutions of a pre-deﬁned ﬁlter size and number of ﬁlters, CNNs can automatically extract high-levelfeatures from input images. CNNs are advantageous because they can perform as well as other modelswithout training as many parameters as a densely connected deep neural network. By applying severalconvolutions one can extract high-level features of an image. In this work we generate a image-like heatmap by using a range of kernel parameters for atom-speciﬁc and element-speciﬁc persistent homology.The CNN output is then ﬂattened and fed as input to a DNN along with global and local protein features.This allows us to use the same feature set as the boosted gradient method as well as the generated PHimage data. A diagram of the CNN architecture is provided in Figure 5.

Figure 5:

The deep learning architecture using a convolutional neural network combined with a deep neuralnetwork. The plus symbol represents the concatenation of features.

For each C α of the training set, the CNN is passed a three-channel persistent homology image ofdimension (8,10,3). The model takes the input image data and applies two convolutional layers with2x2 ﬁlters followed by a dropout of 0.5. The image data is passed through a dense layer, ﬂattened, thenjoined with the other global and local features to form a dense layer of 218 neurons. This is followed bya dropout layer of 0.5, another dense layer of 100 neurons, a dropout layer of 0.25, a dense layer of 10neurons, and ﬁnishes with a dense layer of output. Figure 5 provides an illustration of the deep CNNused in this work.The deep convolutional neural network has several hyper-parameters that can be tuned. As with theGBT, the deep convolutional neural network hyper-parameters are optimized using a basic grid search.Table 3 provides the parameters used for testing. Any hyper-parameters that are not listed below weretaken to be the default values provided by the python Keras package.11 .3 Machine learning features 2 METHODS AND ALGORITHMS Table 3:

Convolutional Neural Network (CNN) parameters used for testing. Parameters were determined usinga grid search. Any hyper-parameters not listed below were taken to be the default values provided by pythonwith the Keras package.

Parameter SettingLearning Rate 0.001Epoch 1000Batch Size 1000Loss Mean Squared ErrorOptimizer Adam

In this work, we combine the predictions of two machine learning models to construct a simple consensusmodel. The consensus prediction used in this work is generated by the average of C α B factor valuespredicted from the GBT and deep CNN models.

A variety of element-speciﬁc and atom-speciﬁc persistence barcodes were generated using the techniquesdiscussed in Sec. 2.1.7. In this work, we include 60 topological features. These features are generatedin several ways by varying: kernels (Lorentz and exponential), element-speciﬁc pairs (CC, CN, CO), anddistance metrics (Wasserstein-0 and Wasserstein-1, Bottleneck-0 and Bottleneck-1). For this work allpersistent homology features were generated with the cutoﬀ of 11˚A.

The distances evaluated from Wasserstein and Bottleneck evaluations of persistence diagrams depend onthe boundary of the diagrams. Speciﬁcally, when two persistence diagrams are compared, the extra eventson one diagram that do not match any events on the other diagram might contribute to the ﬁnal distanceby their distances from the boundary. For this reason, we create two additional persistence diagrams inwhich the y -axis is rotated clockwise by 30 ◦ or 60 ◦ , respectively, see Figure 6. This modiﬁcation changesthe Bottleneck and Wasserstein distances and allows the model to recognize elements that have a shortpersistence (i.e. have a short lifespan). Lastly, we modiﬁed the persistence diagram by reﬂecting aroundthe diagonal axis. An example of this modiﬁcation is illustrated in Figure 6. Table 4 provides a list ofkernels, kernel parameters, y -axis change, distance metric, and element-speciﬁc pairs used to generatefeatures in machine learning models. Figure 6:

Illustration of modiﬁed persistence diagrams used in distance calculations. Left: Unchanged. Middle:Rotated 30 ◦ . Right: rotated 60 ◦ . Black dots are Betti-0 events and triangles are Betti-1 events. RESULTS

Table 4:

Parameters used for topological feature generation. All features used a cutoﬀ of 11˚A. Both lorentz(Lor) and exponential (exp) kernels and Bottleneck (B) and Wasserstein (W) distance metrics were used.

No. features Kernel Kernel parameter Diagram Distance metric Element-speciﬁc pair12 Lor η = 21, ν = 5 Unchanged B, W CC, CN, CO12 Exp η = 10, κ = 1 Unchanged B, W CC, CN, CO12 Exp η = 2, κ = 1 Diagonal reﬂection B, W CC, CN, CO12 Exp η = 2, κ = 1 Rotated 30 ◦ B, W CC, CN, CO12 Exp η = 2, κ = 1 Rotated 60 ◦ B, W CC, CN, COOther features include global features from PDB ﬁles, i.e., R-value, protein resolution, and numberof heavy atoms. Additional local features include packing density, amino acid type, occupancy, andsecondary structure information generated by STRIDE software [37].

Using the process described in Section 2.1.7 we generate 2D image-like persistent homology features, F ki = { f ki ( η, κ ) } , for each C α of the proteins in the dataset by using various values of η and κ in thekernel function. A cutoﬀ of 11 ˚A with an exponential kernel and diﬀerent values of η and κ are used tocapture a wide variety of scales. In particular we use η = { , , , , , , , } , and κ = { , , , , , , , , , } . The image-like matrix is given by F ki in Eq. (17), where each atom F ki represents the PH feature of the i th C α atom, and k th atom interaction (C, N, or O). F ki =  f ki (1 , f ki (1 , . . . f ki (1 , f ki (1 , f ki (2 , f ki (2 , . . . f ki (2 , f ki (2 , f ki (15 , f ki (15 , . . . f ki (15 , f ki (15 , (cid:124) (cid:123)(cid:122) (cid:125) κ f ki (20 , f ki (20 , . . . f ki (20 , f ki (20 ,  η (17) This results in 2D PH images of dimension (8,10). Images are created for element-speciﬁc C α inter-actions with carbon, nitrogen, and oxygen atom giving each image three channels. This results in a ﬁnalimage dimension of (8,10,3) for each C α atom. In this work, we use two data sets, one from Refs. [31,32] and the other from Park, Jernigan, and Wu [38].The ﬁrst contains 364 proteins [31, 32] and the second contains 3 subsets of small, medium, and largeproteins [38]. All sequences have a resolution of 3 ˚A or higher and an average resolution of 1.3 ˚A and thesets include proteins that range from 4 to 3912 residues [38].For all testing, we exclude protein 1AGN due to known problems with this protein data [32]. Proteins1NKO, 2OCT, and 3FVA are also excluded because these proteins have residues with B factors reportedas zero, which is unphysical. For the machine learning results, proteins 1OB4, 1OB7, 2OLX, and 3MD5are excluded because the STRIDE software is unable to provide secondary features for these proteins.The image like features used in all convolutional neural networks were standardized with mean 0 andvariance of 1 13 .2 Evaluation metric 3 RESULTS

We use the proposed methods to predict B factors of all C α atoms present in a protein. Linear leastsquare ﬁtting was done using only topological features. The machine learning models were executed usinga leave-one-(protein)-out method to blindly predict the B factors of all C α atoms in each protein. Themachine learning models were trained using the data and features described in Sections 2.1.7, 2.2, 2.3.For comparison, we include previously existing C α B factor prediction ﬁtting methods.To quantitatively assess our method for B factor prediction we use the Pearson correlation coeﬃcientgiven by PCC = N (cid:88) i =1 ( B ei − ¯ B e )( B ti − ¯ B t ) (cid:20) N (cid:88) i =1 ( B ei − ¯ B e ) N (cid:88) i =1 ( B ti − ¯ B t ) (cid:21) / , (18)where B ti , i = 1 , , . . . , N are predicted B factors using the proposed method and B ei , i = 1 , , . . . , N experimental B factors from the PDB ﬁle. The terms B ti and B ei represent the i th theoretical andexperimental B factors respectively. Here ¯ B e and ¯ B t are averaged B factors. Table 5:

Parameters used for the persistent homology element speciﬁc features with a cutoﬀ of 11 ˚A.

Kernel Type ν η n κ Lorentz ( n = 1) 5 21 -Exponential ( n = 2) - 10 1In this work, the optimal cutoﬀ of r c = 11˚A is found over a grid search using various cutoﬀ distances.Figure 7 displays the average Pearson correlation coeﬃcient, obtained via ﬁtting, over an entire datasetof 364 protein using all persistent homology metrics with various point cloud distance cutoﬀs. Figure 7:

Average pearson correlation coeﬃcient over the entire protein dataset ﬁtting all 24 persistent homologyfeatures using various cuttoﬀ distances.

For each protein we use the parameters listed in Table 5. The values used in this work were determinedusing the standard practice of a grid search. 14 .4 Least squares ﬁtting within proteins 3 RESULTS

Table 6:

Average Pearson correlation coeﬃcients of least squares ﬁtting C α B factor prediction of small, medium,large, and superset using 11˚A cutoﬀ. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choicesare included. Results for pFRI are taken from Opron et al [31]. GNM and NMA value are taken from the coursegrained C α results reported in Park et al [38]. HB & W B WExp Lor Both Exp Lor Both Exp Lor Both pFRI GNM NMASmall 0.87 0.84 0.94 0.74 0.72 0.85 0.74 0.73 0.86 0.59 0.54 0.48Medium 0.68 0.68 0.78 0.62 0.61 0.69 0.60 0.63 0.69 0.61 0.55 0.48Large 0.61 0.60 0.70 0.54 0.54 0.61 0.51 0.55 0.62 0.59 0.53 0.49Superset 0.65 0.64 0.73 0.58 0.58 0.65 0.55 0.59 0.66 0.63 0.57 NA

The Pearson correlation coeﬃcients using least squares ﬁtting for C α B factor prediction of small, medium,and large protein subsets are provided in Tables 12, 13, and 14 respectively. Results for the all proteins inthe dataset are provided in Table 15. The average Pearson correlation coeﬃcients for small, medium, large,and superset data sets are provided in Table 6. Table 6 includes ﬁtting results using only Bottleneck,only Wasserstein, and using both Bottleneck and Wasserstein metrics. We also include results usingonly exponential kernel, only a Lorentz kernel, or both an exponential and Lorentz kernel for ﬁtting. Allresults reported here PH features generated with a cutoﬀ of 11˚A and include three element-speciﬁc subsets(carbon-carbon, carbon-nitrogen, carbon-oxygen). Overall ﬁtting methods using the various persistenthomology features performed similarly. The best results came from using features generated by bothexponential and Lorentz kernels and both Bottleneck and Wasserstein distances. Using both kernels andboth distance metrics resulted in an average correlation coeﬃcient of 0.73 for the superset.

The aforementioned least squares ﬁtting methods cannot predict the B factors of unknown proteins.Machine learning methods enable us to blindly predict B factors across proteins. In this section, weutilize both boosted gradient and convolutional neural network algorithms for the blind prediction of Bfactor across diﬀerent proteins. Taken together, the entire dataset contains more than 620 000 atoms.We use a leave-one-protein out cross validation in our prediction. That is, for each protein, the datafrom a protein whose B factors will be predicted, is excluded from the training data. This gives rise to atraining set of roughly 600 000 data points for each protein (i.e., atoms and associated B factors). ThePearson correlation coeﬃcients using boosted gradient (GBT), convolutional neural network (CNN), andconsensus method (CON) for C α B factor prediction of small, medium, and large protein subsets areprovided in Tables 8, 9, and 10 respectively. Parameters for GBT and CNN methods can be found inTables 2 and 3. The global and local features used for training and testing are provided in Section 2.3.Results for all proteins are provided in Table 11. The average Pearson correlation coeﬃcients for small,medium, large, and superset data sets are provided in Table 7. All results reported here use a cutoﬀ of11˚A and include three element-speciﬁc subsets (carbon-carbon, carbon-nitrogen, carbon-oxygen). Kernelparameters for both exponential and Lorentz kernels are provided in Table 5. Results from previouslyexisting C α B factor prediction methods are included for comparison in Table 7. Overall both GBT andCNN algorithms perform similarly. As expected, the CNN method outperforms the GBT with averagecorrelation coeﬃcients over the superset of 0.60 and 0.59, respectively. The consensus method improvesupon both results with an average Pearson correlation coeﬃcient of 0.61 over the superset. Table 7 showsthat the blind prediction machine learning models perform better than ﬁtting models GNM and NMAand similar to the pFRI ﬁtting model. 15

CONCLUSION

Table 7:

Average Pearson correlation coeﬃcients C α B factor predictions for small-, medium-, and large-sizedprotein sets along with the entire superset of the 364 protein dataset. Gradient boosted tree (GBT), convolu-tional neural network, and consensus (CON) results are obtained by leave-one-protein-out (blind). The resultsof parameter-free ﬂexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis(NMA) were obtained via the least squares ﬁtting of individual proteins.

CNN GBT CON pFRI GNM NMASmall 0.63 0.58 0.62 0.59 0.54 0.48Medium 0.60 0.58 0.61 0.61 0.55 0.48Large 0.58 0.59 0.58 0.59 0.53 0.49Superset 0.60 0.59 0.61 0.63 0.57 NA

An essential component of the paradigm of protein dynamics is the correlation between protein ﬂexibilityand protein function. The shear complexity and large number of degrees of freedom make quantitative un-derstanding of ﬂexibility and function an inherently diﬃcult problem. Several time-independent methodsfor predicting protein B factors exist. Examples include NMA [23,39,24,22], ENM [25], GNM [27,28,40],and FRI methods [30–32, 41]. None of the methods above are able to blindly predict protein B factors ofan unknown protein. We hypothesize that the intrinsic physics of proteins lie in a low-dimensional spaceembedded in a high-dimensional data space. Based on this hypothesis the authors previously introducedthe graph theory based multiscale weighted colored graph (MWCG) [33, 34]. The authors showed thatMWCG’s are able to successfully blindly predict cross-protein B factors.In this work we explore this hypothesis further by creating a B factor predictor using tools fromalgebraic topology. In order to construct localized topological representations for individual atoms fromglobal topological tools, we propose atom-speciﬁc topology and atom-speciﬁc persistent homology. Thisapproach creates two conjugated sets of atoms: the ﬁrst set is centered around the given atom of interestwhile the other set is identical but excludes the atom of interest. Element-speciﬁc selections are fur-ther implemented to embed biological information into atom-speciﬁc persistent homology. The distancebetween the topological invariants generated from these conjugated sets of atoms is used to representthe atom of interest. Both Bottleneck and Wasserstein metrics are utilized to estimate the topologicaldistances between conjugated barcodes. The Vietoris-Rips complex is employed for topological barcodegeneration.To test the proposed method we use over 300 proteins or more than 600,000 B factors. Atom-speciﬁcpersistent homology features are generated using several element-speciﬁc interactions, kernel choices,parametrizations, and barcode distance metrics. First we employ topological features to ﬁt protein Bfactors using linear least squares. Using topological features our ﬁtting model outperformed previousﬁtting models with an average Pearson correlation coeﬃcient of 0.73 over the superset of proteins. Nextwe considered using the topological features to blindly predict protein B factors of C α atoms. Wegenerated two machine learning models, a gradient boosted tree (GBT) and deep convolutional neuralnetwork (CNN). Additionally we averaged the C α prediction from the two models to generate a morerobust consensus model. A variety of local and global features were included in addition to the generatedtopological features. Our blind prediction consensus model outperformed both GNM and NMA ﬁttingmodels and produced similar results to the pFRI ﬁtting model.To the authors’ knowledge, this work is the ﬁrst time persistent homology has been used to predictthe B factor of atoms in a protein. This approach is novel because topology is a global property and on itsown cannot be used to describe local atomic information. Our unique approach allows us to create localtopological representation with a variety of customizable parameters using a global mathematical tool.This allows the model to account for multiple spatial interaction scales and element speciﬁc interactions.Our results demonstrate that this is a accurate and robust approach. Moreover, the results could easily beimproved by including a larger dataset, ﬁne tuning parameters, and exploring diﬀerent machine learning16 APPENDIX approaches.This method can be applied to a variety of interesting applications related to protein dynamics. Exam-ples include allosteric site detection, computer-aided drug design, hinge detection, hot spot identiﬁcation,and protein folding stability changes upon mutation. More generally this method may be amenable toproblems outside proteins such as network dynamics and social network centrality measure.

Acknowledgment

This work was supported in part by NSF Grants DMS-1721024 and DMS-1761320, and NIH grantGM126189.

Table 8:

Pearson correlation coeﬃcients for cross protein C α atom blind B factor prediction obtained by boostedgradient (GBT), convolutional neural network (CNN), and consensus (CON) for the small-sized protein set. PDB ID N GBT CNN CON1AIE 31 0.75 0.7 0.781AKG 16 0.27 0.32 0.291BX7 51 0.74 0.74 0.761ETL 12 0.37 0.82 0.551ETM 12 0.37 0.63 0.431ETN 12 0.07 0.48 0.131FF4 65 0.61 0.66 0.641GK7 39 0.77 0.9 0.821GVD 56 0.71 0.55 0.691HJE 13 0.84 0.75 0.91KYC 15 0.62 0.69 0.661NOT 13 0.69 0.96 0.81O06 22 0.94 0.93 0.951P9I 29 0.73 0.73 0.741PEF 18 0.79 0.82 0.821PEN 16 0.36 0.74 0.441Q9B 44 0.59 0.85 0.671RJU 36 0.6 0.46 0.581U06 55 0.44 0.4 0.451UOY 64 0.72 0.7 0.761USE 47 0.05 0.32 0.121VRZ 13 0.54 0.34 0.541XY2 8 0.79 0.82 0.811YJO 6 0.7 -0.06 0.571YZM 46 0.69 0.64 0.72DSX 52 0.34 0.34 0.362JKU 38 0.57 0.71 0.662NLS 36 0.23 0.47 0.292OL9 6 0.94 0.85 0.946RXN 45 0.59 0.6 0.6117

APPENDIX

Table 9:

Pearson correlation coeﬃcients for cross protein C α atom blind B factor prediction obtained by boostedgradient (GBT), convolutional neural network (CNN), and consensus (CON) for the medium-sized protein set. PDB ID N GBT CNN CON1ABA 87 0.73 0.71 0.741CYO 88 0.64 0.7 0.681FK5 93 0.59 0.6 0.611GXU 89 0.67 0.68 0.691I71 83 0.53 0.58 0.561LR7 73 0.62 0.61 0.641N7E 95 0.63 0.58 0.651NNX 93 0.78 0.79 0.81NOA 113 0.55 0.53 0.561OPD 85 0.42 0.34 0.411QAU 112 0.51 0.59 0.571R7J 90 0.71 0.77 0.751UHA 82 0.71 0.74 0.731ULR 87 0.54 0.53 0.561USM 77 0.73 0.72 0.751V05 96 0.6 0.64 0.631W2L 97 0.43 0.5 0.471X3O 80 0.41 0.43 0.441Z21 96 0.68 0.65 0.691ZVA 75 0.7 0.7 0.712BF9 35 0.48 0.79 0.582BRF 103 0.72 0.77 0.752CE0 109 0.6 0.66 0.642E3H 81 0.65 0.68 0.672EAQ 89 0.57 0.63 0.612EHS 75 0.62 0.67 0.652FQ3 85 0.77 0.82 0.812IP6 87 0.6 0.66 0.632MCM 112 0.71 0.77 0.752NUH 104 0.72 0.56 0.72PKT 93 0.01 -0.04 -0.012PLT 98 0.52 0.53 0.542QJL 107 0.54 0.57 0.562RB8 93 0.67 0.7 0.73BZQ 99 0.45 0.53 0.495CYT 103 0.39 0.34 0.3918

APPENDIX

Table 10:

Pearson correlation coeﬃcients for cross protein C α atom blind B factor prediction obtained boostedgradient (GBT), convolutional neural network (CNN), and consensus (CON) for the large-sized protein set. PDB ID N GBT CNN CON1AHO 66 0.66 0.66 0.71ATG 231 0.55 0.51 0.551BYI 238 0.61 0.5 0.61CCR 109 0.55 0.6 0.591E5K 188 0.74 0.72 0.741EW4 106 0.59 0.6 0.611IFR 113 0.7 0.64 0.71NLS 238 0.55 0.57 0.571O08 221 0.49 0.47 0.491PMY 123 0.59 0.7 0.651PZ4 113 0.72 0.8 0.771QTO 122 0.53 0.48 0.541RRO 108 0.4 0.45 0.431UKU 102 0.75 0.76 0.771V70 105 0.63 0.62 0.641WBE 206 0.6 0.56 0.61WHI 122 0.59 0.56 0.61WPA 107 0.65 0.65 0.672AGK 233 0.67 0.63 0.672C71 225 0.57 0.6 0.62CG7 110 0.3 0.32 0.322CWS 235 0.61 0.47 0.62HQK 232 0.77 0.77 0.782HYK 237 0.65 0.63 0.652I24 113 0.44 0.46 0.462IMF 203 0.53 0.58 0.562PPN 122 0.64 0.54 0.632R16 185 0.44 0.49 0.462V9V 149 0.53 0.52 0.542VIM 114 0.44 0.47 0.472VPA 217 0.66 0.75 0.712VYO 207 0.6 0.63 0.633SEB 238 0.63 0.6 0.633VUB 101 0.59 0.55 0.59

Table 11:

Pearson correlation coeﬃcients for cross protein C α atom blind B factor prediction obtained by boostedgradient (GBT), convolutional neural network (CNN), and consensus method (CON) for the Superset. PDB ID N GBT CNN CON PDB ID N GBT CNN CON

APPENDIX

Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON

APPENDIX

Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON

APPENDIX

Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON

APPENDIX

Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON

Table 15:

Pearson correlation coeﬃcients of least squares ﬁtting C α B factor prediction of all proteins using 11˚Acutoﬀ. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included.

B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1ABA 87 0.67 0.67 0.76 0.54 0.62 0.68 0.56 0.63 0.701AHO 66 0.75 0.78 0.88 0.72 0.73 0.79 0.53 0.65 0.751AIE 31 0.97 0.88 0.99 0.78 0.64 0.90 0.90 0.77 0.961AKG 16 0.82 0.66 1.00 0.60 0.53 0.72 0.53 0.56 0.871ATG 231 0.50 0.50 0.61 0.45 0.47 0.53 0.38 0.48 0.511BGF 124 0.75 0.70 0.82 0.64 0.54 0.75 0.68 0.61 0.751BX7 51 0.86 0.74 0.89 0.79 0.68 0.82 0.81 0.69 0.821BYI 238 0.50 0.51 0.58 0.41 0.46 0.49 0.44 0.48 0.541CCR 109 0.65 0.66 0.71 0.53 0.56 0.65 0.43 0.58 0.631CYO 88 0.71 0.69 0.78 0.66 0.58 0.68 0.65 0.59 0.671DF4 57 0.93 0.92 0.97 0.92 0.89 0.95 0.88 0.91 0.941E5K 188 0.67 0.68 0.74 0.66 0.67 0.68 0.63 0.67 0.6923

APPENDIX