Atom-specific persistent homology and its application to protein flexibility analysis
AAtom-specific persistent homology and its application to proteinflexibility analysis
David Bramer and Guo-Wei Wei , , ∗ Department of MathematicsMichigan State University, MI 48824, USA Department of Biochemistry and Molecular BiologyMichigan State University, MI 48824, USA Department of Electrical and Computer EngineeringMichigan State University, MI 48824, USAMarch 27, 2019
Abstract
Recently, persistent homology has had tremendous success in biomolecular data analysis. Itworks by examining the topological relationship or connectivity of a group of atoms in a molecule ata variety of scales, then rendering a family of topological representations of the molecule. However,persistent homology is rarely employed for the analysis of atomic properties, such as biomolecularflexibility analysis or B factor prediction. This work introduces atom-specific persistent homologyto provide a local atomic level representation of a molecule via a global topological tool. This isachieved through the construction of a pair of conjugated sets of atoms and corresponding conjugatedsimplicial complexes, as well as conjugated topological spaces. The difference between the topologicalinvariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics andleads to an atom-specific topological representation of individual atomic properties in a molecule.Atom-specific topological features are integrated with various machine learning algorithms, includinggradient boosting trees and convolutional neural network for protein thermal fluctuation analysis andB factor prediction. Extensive numerical results indicate the proposed method provides a powerfultopological tool for analyzing and predicting localized information.
Keywords:
Atom-specific topology, Element-specific persistent homology, Protein flexibility, Gra-dient boosting tree, Convolutional neural network. ∗ Address correspondences to Guo-Wei Wei. E-mail:[email protected] a r X i v : . [ q - b i o . B M ] M a r ONTENTS CONTENTS
Contents INTRODUCTION
In recent years tools from topology have been successfully applied to protein analysis [1–6]. Topologyoffers one of highest level of abstractions of geometric data and allows one to infer high dimensionalstructure from low dimensional representations. However, conventional topology oversimplifies geometryand thus lacks descriptive power for most real world problems. Persistent homology (PH) overcomesthis difficulty by introducing a filtration parameter that describes the geometry in terms of a familyof Betti numbers at various scales known as a barcode [7–10]. Indeed, three dimensional (3D) proteinspatial information from a protein data bank (PDB) file can be converted into a family of simplicialcomplexes. One can apply tools from algebraic topology to convert structural information into globaltopological invariants that provide a useful representation of biomolecular properties [11]. However,for quantitative biomolecular analysis and prediction, persistent homology alone neglects chemical andbiology information. Element-specific persistent homology has been introduced to incorporate chemicaland biological information into topological invariants [12,13]. Similarity and differences between barcodesfrom different molecules can be measured by Wasserstein [14] and/or Bottleneck [15] distances. However,the previous applications of persistent homology and element-specific persistent homology are for themodeling and prediction of molecule-level thermodynamical or structural properties, such as protein-ligand binding affinities [13], protein folding free energy changes upon mutations [12,16], drug toxicity [17],solubility, partition coefficient [18], and drug virtual screening (ligand and decoy classification) [19].Essentially, topology is a global tool that examines the connectivity and relationship among many atomsin a neighborhood as a whole. High dimensional topological invariants, such as Betti 1 and Betti 2,describe the collective behavior of many atoms. Therefore, it is not clear how to represent atomic levelproperty, such as the B factor of an atom, by persistent homology.In proteins, beta factor (B factor) or (Debye-Waller factor is a measure of the attenuation of X-rayscattering caused by thermal motion. The strength of the thermal motion of an atom is theoreticallyproportional to its B factor during the structure determination from X-ray diffraction data. It is wellknown that biomolecular flexibility provides an important link between its structure and function. Inparticular, it has been shown that intrinsic structural flexibility correlates to meaningful protein confor-mational variations, reactivity and enzymatic function [20]. As such, the accurate prediction of proteinB-factor is essential to our understanding of protein structure, function and dynamics [21].Early methods used to predict protein B factor were derived from Hooke’s Law and are known as elasticmass-and-spring networks. In these models, alpha carbons (C α ) of biological macromolecules are treatedas a mass and spring network and motions are predicted based on a harmonic potential. Given a protein,each C α is represented as a node in the network and edges are weighted based on a potential function.Nodes are connected by an edge if they fall within a pre-defined euclidean cutoff distance. This capturesthe local covalent and non-covalent interactions between an individual atom and nearby atoms. One ofthe first mass-and-spring methods used for protein B factor prediction is normal mode analysis (NMA).Like most B factor prediction methods, NMA is independent of time and uses a Hamiltonian interactionmatrix. Eigenvalues of the matrix system correspond to characteristic frequencies of the protein and thesefrequencies correlate with protein B factors. Low-frequency modes correlate with cooperative motion andcan be useful for hinge detection and domain motion. NMA has also been successfully implemented tounderstand the deformation of supramolecular complexes. [20, 22–24]Elastic network model (ENM) was introduced as a more efficient model that significantly reducescomputational cost compared to NMA through the use of a simplified spring network [25]. A specificexample is anisotropic network model (ANM) [26]. Gaussian network model (GNM) further reduces thecomputational cost by ignoring the anisotropic motion, rendering a more accurate method for protein C α B factor analysis [27–29].All of the aforementioned methods depend on matrix diagonalization, which has the computationalcomplexity of O ( N ), where is the number of matrix atoms involved in the analysis. Recently, flexibilityand Rigidity Index (FRI) methods have been proposed as a geometric graph approach to further reducethe computational cost. FRI methods rely on constructing a distance matrix using radial basis functions3 METHODS AND ALGORITHMS to scale atom to atom distance non-linearly [30]. All versions of FRI produce a flexibility index, thatcorrelates to the B factor, for each C α . Several versions of FRI have been developed. Among them,fast FRI (fFRI) is of O ( N ) in computational complexity [31]. FRI methods are also more accurate thanall of the earlier algebraic graph-based methods. Additionally, anisotropic FRI (aFRI) provides highquality anisotropic motion analysis [31]. Moreover, using several radial basis functions with differentparametrizations, the multiscale flexibility rigidity index (mFRI) can successfully capture multiscaleatomic interactions [32].More recently, the authors introduced a multiscale weighted colored graph (MWCG) model. TheMWCG is another geometric graph theory model that has been shown to be the best B factor predictionmodel to date. First, element-specific interaction subgraphs are constructed based on selected atomicinteractions between certain element types. Atoms are represented as graph nodes and subgraphs aregenerated using pairs of atoms of certain elements (e. g., carbon, nitrogen, oxygen, sulfur). A centralitymetric that uses radial basis functions is applied to pairwise interactions in each subgraph. By varying theparametrization of the radial basis functions the MWCG model can capture multiple protein interactionscales. MWCG is unique in its ability to utilize both element specific and multiscale interactions for im-proved B factor prediction [33]. Most recently, MWCG is incorporated with machine learning algorithmsfor across-protein blind predictions of protein B factors [34].The objective of the present work is to extend the utility of persistent homology for atomic levelproperty modeling and prediction. To this end, we introduce atom-specific persistent homology (ASPH) tocreate a local atomic representation of an atom using a global topological tool in a novel way. Specifically,ASPH constructs a pair of conjugated sets of point clouds or atoms centered around the atom of interest.The first set of a pair of conjugated sets of atoms for a given atom is selected by a local sphere ofradius r c around the atom of interest. The second set of atoms is defined by excluding the atom ofinterest in the first set. Conjugated simplicial complexes, conjugated chain groups, conjugated homologygroups as well as conjugated persistence barcodes or diagrams are induced by an identical filtration.Conjugated persistence barcodes are compared with Bottleneck and Wasserstein metrics. The resultingdistance provides a global topological representation of the localized atomic property, such as proteinflexibility analysis and atomic-level protein B-factor information. Obviously, the proposed atom-specifictopology can be applied to a wide variety of chemical and biological problems where atomic propertiesare measured, such as the chemical shifts of nuclear magnetic resonance (NMR), the B-factors of X-raystructure determination, and the shift and line broadening of other atomic spectroscopy.We focus on protein C α B-factor prediction but the approach provided in this work is a generalframework that can be used to predict B factors of any atom in a protein. First, we use the generatedatom-specific persistent homology features to fit B factors within a given protein using linear least squaresminimization. Then the atom-specific persistent homology features are combined with other local andglobal protein features to construct machine learning models for the blind prediction of protein B factorsacross different proteins. Additionally, image-like multiscale atom-specific persistent homology featuresare generated using an early technique [35]. These image like features, together with other features, arefed into convolutional neural networks (CNN). Training and validation are carried out using a large anddiverse set of proteins from the protein data bank (PDB). The proposed method offers some of the bestresults for blind B factor predictions of a set of 364 proteins.
Topology describes (continuous) objects in terms of topological invariants, i.e., Betti numbers. Betti-0,Betti-1, and Betti-2 which can be interpreted as connected components, rings, cavities, etc. Table 1provides examples of the Betti numbers of a point, circle, sphere, and torus.4 .1 Atom-specific persistent homology 2 METHODS AND ALGORITHMS
Table 1:
Topological invariants displayed as Betti numbers. Betti-0 represents the number of connected compo-nents, Betti-1 the number of tunnels or circles, and Betti-2 the number of cavities or voids. Two auxiliary ringsare added to the torus to illustrate that its Betti-1=2.
Example Point Circle Sphere TorusBetti-0 1 1 1 1Betti-1 0 1 0 2Betti-2 0 0 1 1
Figure 1:
From left to right an example of a 0-simplex, 1-simplex, 2-simplex, and 3-simplex.
Given discrete data points, such as a point cloud or the set of atoms in a molecule, we use simplicialcomplexes to describe the topological relationship, or connectivity of the point cloud, to systematicallyidentify topological invariants. First, a few simplicial complexes, as shown in Figure 1, are made upof vertices, edges, triangles, and tetrahedrons, denoted 0-simplex, 1-simplex, 2-simplex, and 3-simplex,respectively. Homology groups constructed from simplicial complexes give rise topological invariants.Given discrete dataset or a set of protein atoms, nontrivial topological information is generated bypersistent homology. This introduces a filtration parameter to create a family of simplexes, which leads toa family of simplicial complexes, homology groups and associated topological invariants. By continuouslyvarying the filtration parameter over an interval, the topological relationship among a given set of atomsis systematically reset, rendering a family of homology groups and corresponding topological invariants,which can be plotted as a persistence diagram, or a set of barcodes. Both persistence diagrams andbarcodes record the birth and death (appearance and cessation) of Betti numbers during the filtrationprocess. Many simplicial complex definitions, which determine the rules of the corresponding topologicalrelationship, have been proposed. Commonly used definitions include Vietoris-Rips (VR) complex, ˇCechcomplex, and alpha complex.Persistent homology allows the extraction of topological invariants that are embedded in the highdimensional data space of biomolecules. The resulting topological invariants over the filtration, i.e.,persistence diagrams or persistence barcodes of different molecules can be compared using Bottleneckand Wasserstein distances.The goal of atom-specific persistent homology is to extract topological information of a given atomin a molecule. To embed local atomic information into a global topological description, we construct apair of conjugated sets of point clouds, namely the original dataset and a datset excluding the atom ofinterest. The Bottleneck and Wasserstein distances between these two persistence diagrams reveal thedesirable topological information of the given atom.5 .1 Atom-specific persistent homology 2 METHODS AND ALGORITHMS
A (geometric) simplex is a generalization of a triangle or tetrahedron to arbitrary dimensions. A k -simplexis a convex hull of k + 1 vertices represented by a set of affinely independent points σ = { λ u + λ u + . . . + λ k u k | (cid:88) λ i = 1 , λ i ≥ , i = 0 , , . . . , k } , (1)where { u , u , . . . , u k } ⊂ R d with d ≥ k is the set of points, σ is the k -simplex, and constraints on λ i ’sensure the formation of a convex hull. An affinely independent combination of points can have at most k + 1 points in R k . For example a 1-simplex is a line segment, a 2-simplex a triangle, and a 3-simplex atetrahedron. A subset of the k + 1 vertices of a k simplex with m + 1 vertices forms a convex hull in alower dimension and is called an m -face of the k -simplex. An m -face is proper is m < k . The boundaryof a k -simplex σ , is defined as the formal sum of its ( k + 1) faces. Given as ∂ k σ = k (cid:88) i =0 ( − i [ u , . . . , ˆ u i , . . . , u k ] , (2)where [ u , . . . , ˆ u i , . . . , u k ] denotes the convex hull formed by vertices of σ with the vertex u i being excludedand ∂ k is called the boundary operator. A collection of finitely many simplicies forms a simplicial complexdenoted by K . All simplicial complexes satisfy the following conditions.1. Faces of any simplex in K are also simplices in K .2. The intersection of any two simplicies σ , σ ∈ K is a face of both σ and σ . Given a simplicial complex K , a k -chain c k of K is a formal sum of the k -simplices in K and is definedas c k = (cid:80) a i σ i where σ i are the k -simplices and a i ’s coefficients. Generally, a i are element of a fieldsuch as R , Q , or Z n . Computationally, it is common to choose a i to be in Z . The group of k -chainsin K , denoted C k , forms an Abelian group under addition in modulo two. This allows us to extend thedefinition of the boundary operator introduced in Eq. (2) to chains.The boundary operator applied to a k -chain c k is defined as ∂ k c k = (cid:88) a i ∂ k σ i , (3)where σ i ’s are k -simplices. The boundary operator is a map from C k to C k − , which is also knownas a boundary map for chains. Note that in Z , the boundary operator ∂ k satisfies the property that ∂ k ◦ ∂ k +1 σ = 0 for any ( k + 1)-simplex σ following the fact that any ( k − σ is contained inexactly two k -faces of σ . The chain complex is defined as a sequence of chains connected by boundarymaps with decreasing dimension and is denoted . . . → C n ( K ) ∂ n −→ C n − ( K ) ∂ n − −−−→ . . . ∂ −→ C ( K ) ∂ −→ . (4)The k -cycle group and k -boundary group are then defined as kernel and image of ∂ k and ∂ k +1 respectively,and Z k = Ker ∂ k = { c ∈ C k | ∂ k c = 0 } , (5) B k = Im ∂ k +1 = { c ∈ C k |∃ d ∈ C k +1 : c = ∂ k +1 d } , (6)where Z k is the k -cycle group and B k is the k -boundary group. Since ∂ k ◦ ∂ k +1 = 0, we have B k ⊂ Z k ⊂ C k .Then the k -homology group is defined to be the quotient group of the k -cycle group modulo the k -boundary group, H k = Z k / B k (7)where H k is the k -homology group. The k th Betti number is defined to be rank of the k -homology groupas β k = rank( H k ). 6 .1 Atom-specific persistent homology 2 METHODS AND ALGORITHMS For a simplicial complex K , we define a filtration of K as a nested sequence of subcomplexes of K , ∅ ⊆ K ⊆ K . . . ⊆ K n = K (8)In persistent homology, the nested sequence of subcomplexes usually depends on a filtration parameter.The persistence of a topological feature is denoted graphically by its life span with respect to filtrationparameter. Subcomplexes corresponding to various filtration parameters offer the topological fingerprintsover multiple scales. The k th persistence Betti number β i,jk is given by the ranks of the k th homologygroups of K i that are alive and are defined as β i,jk = rank( H i,jk ) = rank( Z k ( K i ) / ( B k ( K j ) ∩ Z k ( K i ))) . (9)The persistence of Betti numbers over the filtration interval can be recorded in many different ways.The commonly used ones are persistence barcodes and persistence diagrams. An example of barcodes isprovided in Figure 2. (a) (b)Figure 2: (a) An example of 5 points in R and (b) the corresponding persistence barcodes. The length of eachbarcode corresponds to the persistence of each topological object ( β , β , β ,etc..) over the Vietoris-Rips (VR)complex filtration. In this work, we use Bottleneck and Wasserstein distances to extract atom-specific topological informationand facilitate atom-specific persistent homology. Let X and Y be multisets of data points, the Bottleneckand Wasserstein distances of X and Y are given by [15] d B ( X, Y ) = inf γ ∈ B ( X,Y ) sup x ∈ X || x − γ ( x ) || ∞ , (10)and [14] d pW ( X, Y ) = (cid:32) inf γ ∈ B ( X,Y ) (cid:88) x ∈ X || x − γ ( x ) || p ∞ (cid:33) /p , (11)respectively. Here B ( X, Y ) is the collection of all bijections from X to Y . Note that in our work,topological invariants of different dimensions are compared separately.7 .1 Atom-specific persistent homology 2 METHODS AND ALGORITHMS Given a metric space M and a cutoff distance d , a simplex is formed if all points have pairwise distancesno greater than d . All such simplices form the Vietoris-Rips (VR) complex. The abstract nature of theVR complex allows the construction of simplicial complexes from a correlation function, which modelsthe pairwise interaction of atoms using a radial basis function versus more standard distance metrics.The R library TDA is used to generate persistence barcodes [36] . Element-specific persistent homology was introduced to embed chemical and biology information intotopological invariants [12,19]. Its essential idea is to construct topological representations from subsets ofatoms in various element types in a protein. For example, if one selects all carbon atoms in a protein, theresulting persistence barcodes will represent the strength and network of hydrophobicity in the protein.
Figure 3:
Illustration of Atom-specific persistent homology point clouds. Top: the original point cloud. Theatom of interest is at the center of the circle. Second row: a pair of conjugated sets of point clouds for atom-specific persistent homology. The rest: Four pairs of conjugated point clouds for atom-specific and element-specificpersistent homology.
In contrast, atom-specific persistent homology is designed to highlight the topological information of8 .1 Atom-specific persistent homology 2 METHODS AND ALGORITHMS a given atom in a biomolecule. It creates two conjugated subsets of atoms centered around the atomof interest, one with and one without the specific atom. Conjugated simplicial complexes, conjugatedhomology groups and conjugated topological invariants are generated for the conjugated sets of pointsclouds. The difference between the conjugated topological invariants, measured by both Wasserstein andBottleneck distances, offers a topological representation of the atom of interest. As shown in Figure 3,atom-specific and element-specific conjugated point clouds can be constructed for a given dataset.In this work, we focus on C α B factor predictions. We use element specific persistent homology toenhance the topological representation of each C α neighborhood. Meanwhile, we develop atom-specificpersistent homology to pinpoint the topological representation at each C α atom. With these selectionsof subsets, Vietoris-Rips complexes are constructed by contact maps or matrix filtration [1].To capture element-specific interactions we consider three subsets of carbon-carbon, carbon-nitrogen,and carbon-oxygen point clouds. This gives us the following element specific pairs, P = { CC , CN , CO } . (12)For a given Protein Data Bank (PDB) file, persistence barcodes are calculated as follows. Given a specificC α of interest, say r ki ∈ P k in an element specific set P k ( P = CC , P = CN, and P = CO) , a pointcloud consisting of all atoms within a pre-defined cutoff radius r c is selected: R ki = { r kj (cid:12)(cid:12) || r ki − r kj || < r c , r ki , r kj ∈ P k , ∀ j ∈ , , . . . N } , (13)where N is the number of atoms in the k th element pair P k . A conjugated set of point cloud, ˆ R ki , includesthe same set of atoms, except for r ki . For a given pair of conjugated point clouds R ki and ˆ R ki , conjugatedsimplicial complexes, conjugated homology groups, and conjugated persistence barcodes are computed viapersistent homology. We compute Euclidean distance based filtration using the Vietoris-Rips complex.Additionally, for a given set of atoms selected according to atom-specific and element specific construc-tions, we generate a family of multiresolution persistence barcodes by a resolution controlled filtrationmatrix: [1] M nm ( ϑ ) = 1 − Φ( || r n − r m || ; ϑ ) , (14)where ϑ denotes a set of kernel parameters. We have used both exponential kernelsΦ( || r n − r m || ; η, κ ) = e − ( || r n − r m || /η ) κ , κ > || r n − r m || ; η, ν ) = 11 + (cid:0) || r n − r m || /η (cid:1) ν , ν > η κ , and ν are pre-defined constants. This filtration matrix is used in association with the Vietoris-Rips complex to generate persistence barcodes or persistence diagrams. Then these topological invariantsare compared using both Bottleneck and Wasserstein distances. An example of the conjugated persistencebarcode pair generated for a C α atom is illustrated in Figure 4.9 .2 Machine learning models 2 METHODS AND ALGORITHMS (a) (b)Figure 4: Illustration of residue 338 C α atom-specific persistent homology in the CC element-specific point cloudof protein PDBID 1AIE. For this example residues 332-339 are used and are shown on the left. The C α locationused to generate the barcodes (right) is highlighted in red in the left chart. Conjugated persistence barcodes aregenerated with and without the selected C α . Topological features are used for prediction of protein B factor using both least squares fitting andmachine learning as described in the following subsections.
Gradient boosting is an ensemble method that uses a number of “weak learners” to construct a predictionmodel in an iterative manner. The method is optimized via gradient descent, which minimizes theresiduals of a loss function. At each step of the gradient boosting, gradient boosting trees (GBTs)incorporate decision trees to improve their predictive power. Ensemble methods like GBTs are usefulbecause they can handle a diverse feature set, have strong predictive power, and are typically robust tooutliers and against overfitting.In this work, we optimize the GBT hyper-parameters using the standard practice of a grid search.The parameters used for testing are provided in Table 2. Any hyper-parameters not listed in the tablewere taken to be the default values provided by the python scikit-learn package.
Table 2:
Boosted gradient tree hyperparameters used for testing. Parameters were determined using a gridsearch. Any hyperparameters that is not listed were taken to be the default values provided by the pythonscikit-learn package.
Parameter SettingLoss Function QuantileAlpha 0.975Estimators 500Learning Rate 0.25Max Depth 4Min Samples Leaf 9Min Samples Split 910 .2 Machine learning models 2 METHODS AND ALGORITHMS
Neural networks are modeled after the function of neurons in brain. A neural network applies activationfunctions, called perceptrons, to inputs. Weights of the network are trained to minimize a loss functionover many epochs, or passes of an entire training dataset. When a neural network has several layers ofperceptrons we call it a deep neural network (DNN) and the intermediate layers are known as hiddenlayers.Convolutional neural networks (CNNs) have recently had great success in image classification. Usingconvolutions of a pre-defined filter size and number of filters, CNNs can automatically extract high-levelfeatures from input images. CNNs are advantageous because they can perform as well as other modelswithout training as many parameters as a densely connected deep neural network. By applying severalconvolutions one can extract high-level features of an image. In this work we generate a image-like heatmap by using a range of kernel parameters for atom-specific and element-specific persistent homology.The CNN output is then flattened and fed as input to a DNN along with global and local protein features.This allows us to use the same feature set as the boosted gradient method as well as the generated PHimage data. A diagram of the CNN architecture is provided in Figure 5.
Figure 5:
The deep learning architecture using a convolutional neural network combined with a deep neuralnetwork. The plus symbol represents the concatenation of features.
For each C α of the training set, the CNN is passed a three-channel persistent homology image ofdimension (8,10,3). The model takes the input image data and applies two convolutional layers with2x2 filters followed by a dropout of 0.5. The image data is passed through a dense layer, flattened, thenjoined with the other global and local features to form a dense layer of 218 neurons. This is followed bya dropout layer of 0.5, another dense layer of 100 neurons, a dropout layer of 0.25, a dense layer of 10neurons, and finishes with a dense layer of output. Figure 5 provides an illustration of the deep CNNused in this work.The deep convolutional neural network has several hyper-parameters that can be tuned. As with theGBT, the deep convolutional neural network hyper-parameters are optimized using a basic grid search.Table 3 provides the parameters used for testing. Any hyper-parameters that are not listed below weretaken to be the default values provided by the python Keras package.11 .3 Machine learning features 2 METHODS AND ALGORITHMS Table 3:
Convolutional Neural Network (CNN) parameters used for testing. Parameters were determined usinga grid search. Any hyper-parameters not listed below were taken to be the default values provided by pythonwith the Keras package.
Parameter SettingLearning Rate 0.001Epoch 1000Batch Size 1000Loss Mean Squared ErrorOptimizer Adam
In this work, we combine the predictions of two machine learning models to construct a simple consensusmodel. The consensus prediction used in this work is generated by the average of C α B factor valuespredicted from the GBT and deep CNN models.
A variety of element-specific and atom-specific persistence barcodes were generated using the techniquesdiscussed in Sec. 2.1.7. In this work, we include 60 topological features. These features are generatedin several ways by varying: kernels (Lorentz and exponential), element-specific pairs (CC, CN, CO), anddistance metrics (Wasserstein-0 and Wasserstein-1, Bottleneck-0 and Bottleneck-1). For this work allpersistent homology features were generated with the cutoff of 11˚A.
The distances evaluated from Wasserstein and Bottleneck evaluations of persistence diagrams depend onthe boundary of the diagrams. Specifically, when two persistence diagrams are compared, the extra eventson one diagram that do not match any events on the other diagram might contribute to the final distanceby their distances from the boundary. For this reason, we create two additional persistence diagrams inwhich the y -axis is rotated clockwise by 30 ◦ or 60 ◦ , respectively, see Figure 6. This modification changesthe Bottleneck and Wasserstein distances and allows the model to recognize elements that have a shortpersistence (i.e. have a short lifespan). Lastly, we modified the persistence diagram by reflecting aroundthe diagonal axis. An example of this modification is illustrated in Figure 6. Table 4 provides a list ofkernels, kernel parameters, y -axis change, distance metric, and element-specific pairs used to generatefeatures in machine learning models. Figure 6:
Illustration of modified persistence diagrams used in distance calculations. Left: Unchanged. Middle:Rotated 30 ◦ . Right: rotated 60 ◦ . Black dots are Betti-0 events and triangles are Betti-1 events. RESULTS
Table 4:
Parameters used for topological feature generation. All features used a cutoff of 11˚A. Both lorentz(Lor) and exponential (exp) kernels and Bottleneck (B) and Wasserstein (W) distance metrics were used.
No. features Kernel Kernel parameter Diagram Distance metric Element-specific pair12 Lor η = 21, ν = 5 Unchanged B, W CC, CN, CO12 Exp η = 10, κ = 1 Unchanged B, W CC, CN, CO12 Exp η = 2, κ = 1 Diagonal reflection B, W CC, CN, CO12 Exp η = 2, κ = 1 Rotated 30 ◦ B, W CC, CN, CO12 Exp η = 2, κ = 1 Rotated 60 ◦ B, W CC, CN, COOther features include global features from PDB files, i.e., R-value, protein resolution, and numberof heavy atoms. Additional local features include packing density, amino acid type, occupancy, andsecondary structure information generated by STRIDE software [37].
Using the process described in Section 2.1.7 we generate 2D image-like persistent homology features, F ki = { f ki ( η, κ ) } , for each C α of the proteins in the dataset by using various values of η and κ in thekernel function. A cutoff of 11 ˚A with an exponential kernel and different values of η and κ are used tocapture a wide variety of scales. In particular we use η = { , , , , , , , } , and κ = { , , , , , , , , , } . The image-like matrix is given by F ki in Eq. (17), where each atom F ki represents the PH feature of the i th C α atom, and k th atom interaction (C, N, or O). F ki = f ki (1 , f ki (1 , . . . f ki (1 , f ki (1 , f ki (2 , f ki (2 , . . . f ki (2 , f ki (2 , f ki (15 , f ki (15 , . . . f ki (15 , f ki (15 , (cid:124) (cid:123)(cid:122) (cid:125) κ f ki (20 , f ki (20 , . . . f ki (20 , f ki (20 , η (17) This results in 2D PH images of dimension (8,10). Images are created for element-specific C α inter-actions with carbon, nitrogen, and oxygen atom giving each image three channels. This results in a finalimage dimension of (8,10,3) for each C α atom. In this work, we use two data sets, one from Refs. [31,32] and the other from Park, Jernigan, and Wu [38].The first contains 364 proteins [31, 32] and the second contains 3 subsets of small, medium, and largeproteins [38]. All sequences have a resolution of 3 ˚A or higher and an average resolution of 1.3 ˚A and thesets include proteins that range from 4 to 3912 residues [38].For all testing, we exclude protein 1AGN due to known problems with this protein data [32]. Proteins1NKO, 2OCT, and 3FVA are also excluded because these proteins have residues with B factors reportedas zero, which is unphysical. For the machine learning results, proteins 1OB4, 1OB7, 2OLX, and 3MD5are excluded because the STRIDE software is unable to provide secondary features for these proteins.The image like features used in all convolutional neural networks were standardized with mean 0 andvariance of 1 13 .2 Evaluation metric 3 RESULTS
We use the proposed methods to predict B factors of all C α atoms present in a protein. Linear leastsquare fitting was done using only topological features. The machine learning models were executed usinga leave-one-(protein)-out method to blindly predict the B factors of all C α atoms in each protein. Themachine learning models were trained using the data and features described in Sections 2.1.7, 2.2, 2.3.For comparison, we include previously existing C α B factor prediction fitting methods.To quantitatively assess our method for B factor prediction we use the Pearson correlation coefficientgiven by PCC = N (cid:88) i =1 ( B ei − ¯ B e )( B ti − ¯ B t ) (cid:20) N (cid:88) i =1 ( B ei − ¯ B e ) N (cid:88) i =1 ( B ti − ¯ B t ) (cid:21) / , (18)where B ti , i = 1 , , . . . , N are predicted B factors using the proposed method and B ei , i = 1 , , . . . , N experimental B factors from the PDB file. The terms B ti and B ei represent the i th theoretical andexperimental B factors respectively. Here ¯ B e and ¯ B t are averaged B factors. Table 5:
Parameters used for the persistent homology element specific features with a cutoff of 11 ˚A.
Kernel Type ν η n κ Lorentz ( n = 1) 5 21 -Exponential ( n = 2) - 10 1In this work, the optimal cutoff of r c = 11˚A is found over a grid search using various cutoff distances.Figure 7 displays the average Pearson correlation coefficient, obtained via fitting, over an entire datasetof 364 protein using all persistent homology metrics with various point cloud distance cutoffs. Figure 7:
Average pearson correlation coefficient over the entire protein dataset fitting all 24 persistent homologyfeatures using various cuttoff distances.
For each protein we use the parameters listed in Table 5. The values used in this work were determinedusing the standard practice of a grid search. 14 .4 Least squares fitting within proteins 3 RESULTS
Table 6:
Average Pearson correlation coefficients of least squares fitting C α B factor prediction of small, medium,large, and superset using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choicesare included. Results for pFRI are taken from Opron et al [31]. GNM and NMA value are taken from the coursegrained C α results reported in Park et al [38]. HB & W B WExp Lor Both Exp Lor Both Exp Lor Both pFRI GNM NMASmall 0.87 0.84 0.94 0.74 0.72 0.85 0.74 0.73 0.86 0.59 0.54 0.48Medium 0.68 0.68 0.78 0.62 0.61 0.69 0.60 0.63 0.69 0.61 0.55 0.48Large 0.61 0.60 0.70 0.54 0.54 0.61 0.51 0.55 0.62 0.59 0.53 0.49Superset 0.65 0.64 0.73 0.58 0.58 0.65 0.55 0.59 0.66 0.63 0.57 NA
The Pearson correlation coefficients using least squares fitting for C α B factor prediction of small, medium,and large protein subsets are provided in Tables 12, 13, and 14 respectively. Results for the all proteins inthe dataset are provided in Table 15. The average Pearson correlation coefficients for small, medium, large,and superset data sets are provided in Table 6. Table 6 includes fitting results using only Bottleneck,only Wasserstein, and using both Bottleneck and Wasserstein metrics. We also include results usingonly exponential kernel, only a Lorentz kernel, or both an exponential and Lorentz kernel for fitting. Allresults reported here PH features generated with a cutoff of 11˚A and include three element-specific subsets(carbon-carbon, carbon-nitrogen, carbon-oxygen). Overall fitting methods using the various persistenthomology features performed similarly. The best results came from using features generated by bothexponential and Lorentz kernels and both Bottleneck and Wasserstein distances. Using both kernels andboth distance metrics resulted in an average correlation coefficient of 0.73 for the superset.
The aforementioned least squares fitting methods cannot predict the B factors of unknown proteins.Machine learning methods enable us to blindly predict B factors across proteins. In this section, weutilize both boosted gradient and convolutional neural network algorithms for the blind prediction of Bfactor across different proteins. Taken together, the entire dataset contains more than 620 000 atoms.We use a leave-one-protein out cross validation in our prediction. That is, for each protein, the datafrom a protein whose B factors will be predicted, is excluded from the training data. This gives rise to atraining set of roughly 600 000 data points for each protein (i.e., atoms and associated B factors). ThePearson correlation coefficients using boosted gradient (GBT), convolutional neural network (CNN), andconsensus method (CON) for C α B factor prediction of small, medium, and large protein subsets areprovided in Tables 8, 9, and 10 respectively. Parameters for GBT and CNN methods can be found inTables 2 and 3. The global and local features used for training and testing are provided in Section 2.3.Results for all proteins are provided in Table 11. The average Pearson correlation coefficients for small,medium, large, and superset data sets are provided in Table 7. All results reported here use a cutoff of11˚A and include three element-specific subsets (carbon-carbon, carbon-nitrogen, carbon-oxygen). Kernelparameters for both exponential and Lorentz kernels are provided in Table 5. Results from previouslyexisting C α B factor prediction methods are included for comparison in Table 7. Overall both GBT andCNN algorithms perform similarly. As expected, the CNN method outperforms the GBT with averagecorrelation coefficients over the superset of 0.60 and 0.59, respectively. The consensus method improvesupon both results with an average Pearson correlation coefficient of 0.61 over the superset. Table 7 showsthat the blind prediction machine learning models perform better than fitting models GNM and NMAand similar to the pFRI fitting model. 15
CONCLUSION
Table 7:
Average Pearson correlation coefficients C α B factor predictions for small-, medium-, and large-sizedprotein sets along with the entire superset of the 364 protein dataset. Gradient boosted tree (GBT), convolu-tional neural network, and consensus (CON) results are obtained by leave-one-protein-out (blind). The resultsof parameter-free flexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis(NMA) were obtained via the least squares fitting of individual proteins.
CNN GBT CON pFRI GNM NMASmall 0.63 0.58 0.62 0.59 0.54 0.48Medium 0.60 0.58 0.61 0.61 0.55 0.48Large 0.58 0.59 0.58 0.59 0.53 0.49Superset 0.60 0.59 0.61 0.63 0.57 NA
An essential component of the paradigm of protein dynamics is the correlation between protein flexibilityand protein function. The shear complexity and large number of degrees of freedom make quantitative un-derstanding of flexibility and function an inherently difficult problem. Several time-independent methodsfor predicting protein B factors exist. Examples include NMA [23,39,24,22], ENM [25], GNM [27,28,40],and FRI methods [30–32, 41]. None of the methods above are able to blindly predict protein B factors ofan unknown protein. We hypothesize that the intrinsic physics of proteins lie in a low-dimensional spaceembedded in a high-dimensional data space. Based on this hypothesis the authors previously introducedthe graph theory based multiscale weighted colored graph (MWCG) [33, 34]. The authors showed thatMWCG’s are able to successfully blindly predict cross-protein B factors.In this work we explore this hypothesis further by creating a B factor predictor using tools fromalgebraic topology. In order to construct localized topological representations for individual atoms fromglobal topological tools, we propose atom-specific topology and atom-specific persistent homology. Thisapproach creates two conjugated sets of atoms: the first set is centered around the given atom of interestwhile the other set is identical but excludes the atom of interest. Element-specific selections are fur-ther implemented to embed biological information into atom-specific persistent homology. The distancebetween the topological invariants generated from these conjugated sets of atoms is used to representthe atom of interest. Both Bottleneck and Wasserstein metrics are utilized to estimate the topologicaldistances between conjugated barcodes. The Vietoris-Rips complex is employed for topological barcodegeneration.To test the proposed method we use over 300 proteins or more than 600,000 B factors. Atom-specificpersistent homology features are generated using several element-specific interactions, kernel choices,parametrizations, and barcode distance metrics. First we employ topological features to fit protein Bfactors using linear least squares. Using topological features our fitting model outperformed previousfitting models with an average Pearson correlation coefficient of 0.73 over the superset of proteins. Nextwe considered using the topological features to blindly predict protein B factors of C α atoms. Wegenerated two machine learning models, a gradient boosted tree (GBT) and deep convolutional neuralnetwork (CNN). Additionally we averaged the C α prediction from the two models to generate a morerobust consensus model. A variety of local and global features were included in addition to the generatedtopological features. Our blind prediction consensus model outperformed both GNM and NMA fittingmodels and produced similar results to the pFRI fitting model.To the authors’ knowledge, this work is the first time persistent homology has been used to predictthe B factor of atoms in a protein. This approach is novel because topology is a global property and on itsown cannot be used to describe local atomic information. Our unique approach allows us to create localtopological representation with a variety of customizable parameters using a global mathematical tool.This allows the model to account for multiple spatial interaction scales and element specific interactions.Our results demonstrate that this is a accurate and robust approach. Moreover, the results could easily beimproved by including a larger dataset, fine tuning parameters, and exploring different machine learning16 APPENDIX approaches.This method can be applied to a variety of interesting applications related to protein dynamics. Exam-ples include allosteric site detection, computer-aided drug design, hinge detection, hot spot identification,and protein folding stability changes upon mutation. More generally this method may be amenable toproblems outside proteins such as network dynamics and social network centrality measure.
Acknowledgment
This work was supported in part by NSF Grants DMS-1721024 and DMS-1761320, and NIH grantGM126189.
Table 8:
Pearson correlation coefficients for cross protein C α atom blind B factor prediction obtained by boostedgradient (GBT), convolutional neural network (CNN), and consensus (CON) for the small-sized protein set. PDB ID N GBT CNN CON1AIE 31 0.75 0.7 0.781AKG 16 0.27 0.32 0.291BX7 51 0.74 0.74 0.761ETL 12 0.37 0.82 0.551ETM 12 0.37 0.63 0.431ETN 12 0.07 0.48 0.131FF4 65 0.61 0.66 0.641GK7 39 0.77 0.9 0.821GVD 56 0.71 0.55 0.691HJE 13 0.84 0.75 0.91KYC 15 0.62 0.69 0.661NOT 13 0.69 0.96 0.81O06 22 0.94 0.93 0.951P9I 29 0.73 0.73 0.741PEF 18 0.79 0.82 0.821PEN 16 0.36 0.74 0.441Q9B 44 0.59 0.85 0.671RJU 36 0.6 0.46 0.581U06 55 0.44 0.4 0.451UOY 64 0.72 0.7 0.761USE 47 0.05 0.32 0.121VRZ 13 0.54 0.34 0.541XY2 8 0.79 0.82 0.811YJO 6 0.7 -0.06 0.571YZM 46 0.69 0.64 0.72DSX 52 0.34 0.34 0.362JKU 38 0.57 0.71 0.662NLS 36 0.23 0.47 0.292OL9 6 0.94 0.85 0.946RXN 45 0.59 0.6 0.6117
APPENDIX
Table 9:
Pearson correlation coefficients for cross protein C α atom blind B factor prediction obtained by boostedgradient (GBT), convolutional neural network (CNN), and consensus (CON) for the medium-sized protein set. PDB ID N GBT CNN CON1ABA 87 0.73 0.71 0.741CYO 88 0.64 0.7 0.681FK5 93 0.59 0.6 0.611GXU 89 0.67 0.68 0.691I71 83 0.53 0.58 0.561LR7 73 0.62 0.61 0.641N7E 95 0.63 0.58 0.651NNX 93 0.78 0.79 0.81NOA 113 0.55 0.53 0.561OPD 85 0.42 0.34 0.411QAU 112 0.51 0.59 0.571R7J 90 0.71 0.77 0.751UHA 82 0.71 0.74 0.731ULR 87 0.54 0.53 0.561USM 77 0.73 0.72 0.751V05 96 0.6 0.64 0.631W2L 97 0.43 0.5 0.471X3O 80 0.41 0.43 0.441Z21 96 0.68 0.65 0.691ZVA 75 0.7 0.7 0.712BF9 35 0.48 0.79 0.582BRF 103 0.72 0.77 0.752CE0 109 0.6 0.66 0.642E3H 81 0.65 0.68 0.672EAQ 89 0.57 0.63 0.612EHS 75 0.62 0.67 0.652FQ3 85 0.77 0.82 0.812IP6 87 0.6 0.66 0.632MCM 112 0.71 0.77 0.752NUH 104 0.72 0.56 0.72PKT 93 0.01 -0.04 -0.012PLT 98 0.52 0.53 0.542QJL 107 0.54 0.57 0.562RB8 93 0.67 0.7 0.73BZQ 99 0.45 0.53 0.495CYT 103 0.39 0.34 0.3918
APPENDIX
Table 10:
Pearson correlation coefficients for cross protein C α atom blind B factor prediction obtained boostedgradient (GBT), convolutional neural network (CNN), and consensus (CON) for the large-sized protein set. PDB ID N GBT CNN CON1AHO 66 0.66 0.66 0.71ATG 231 0.55 0.51 0.551BYI 238 0.61 0.5 0.61CCR 109 0.55 0.6 0.591E5K 188 0.74 0.72 0.741EW4 106 0.59 0.6 0.611IFR 113 0.7 0.64 0.71NLS 238 0.55 0.57 0.571O08 221 0.49 0.47 0.491PMY 123 0.59 0.7 0.651PZ4 113 0.72 0.8 0.771QTO 122 0.53 0.48 0.541RRO 108 0.4 0.45 0.431UKU 102 0.75 0.76 0.771V70 105 0.63 0.62 0.641WBE 206 0.6 0.56 0.61WHI 122 0.59 0.56 0.61WPA 107 0.65 0.65 0.672AGK 233 0.67 0.63 0.672C71 225 0.57 0.6 0.62CG7 110 0.3 0.32 0.322CWS 235 0.61 0.47 0.62HQK 232 0.77 0.77 0.782HYK 237 0.65 0.63 0.652I24 113 0.44 0.46 0.462IMF 203 0.53 0.58 0.562PPN 122 0.64 0.54 0.632R16 185 0.44 0.49 0.462V9V 149 0.53 0.52 0.542VIM 114 0.44 0.47 0.472VPA 217 0.66 0.75 0.712VYO 207 0.6 0.63 0.633SEB 238 0.63 0.6 0.633VUB 101 0.59 0.55 0.59
Table 11:
Pearson correlation coefficients for cross protein C α atom blind B factor prediction obtained by boostedgradient (GBT), convolutional neural network (CNN), and consensus method (CON) for the Superset. PDB ID N GBT CNN CON PDB ID N GBT CNN CON
APPENDIX
Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON
APPENDIX
Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON
APPENDIX
Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON
APPENDIX
Table 11 – continued from previous pagePDB ID N GBT CNN CON PDB ID N GBT CNN CON
Table 15:
Pearson correlation coefficients of least squares fitting C α B factor prediction of all proteins using 11˚Acutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included.
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1ABA 87 0.67 0.67 0.76 0.54 0.62 0.68 0.56 0.63 0.701AHO 66 0.75 0.78 0.88 0.72 0.73 0.79 0.53 0.65 0.751AIE 31 0.97 0.88 0.99 0.78 0.64 0.90 0.90 0.77 0.961AKG 16 0.82 0.66 1.00 0.60 0.53 0.72 0.53 0.56 0.871ATG 231 0.50 0.50 0.61 0.45 0.47 0.53 0.38 0.48 0.511BGF 124 0.75 0.70 0.82 0.64 0.54 0.75 0.68 0.61 0.751BX7 51 0.86 0.74 0.89 0.79 0.68 0.82 0.81 0.69 0.821BYI 238 0.50 0.51 0.58 0.41 0.46 0.49 0.44 0.48 0.541CCR 109 0.65 0.66 0.71 0.53 0.56 0.65 0.43 0.58 0.631CYO 88 0.71 0.69 0.78 0.66 0.58 0.68 0.65 0.59 0.671DF4 57 0.93 0.92 0.97 0.92 0.89 0.95 0.88 0.91 0.941E5K 188 0.67 0.68 0.74 0.66 0.67 0.68 0.63 0.67 0.6923
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1ES5 260 0.58 0.57 0.65 0.51 0.55 0.58 0.44 0.56 0.601ETL 12 1.00 1.00 1.00 0.68 0.87 1.00 0.95 0.98 1.001ETM 12 1.00 1.00 1.00 0.45 0.74 0.86 0.70 0.83 1.001ETN 12 1.00 1.00 1.00 0.96 0.92 0.99 0.70 0.92 1.001EW4 106 0.58 0.60 0.73 0.52 0.51 0.55 0.55 0.55 0.621F8R 1932 0.61 0.63 0.70 0.59 0.62 0.63 0.50 0.62 0.651FF4 65 0.77 0.72 0.80 0.70 0.65 0.75 0.68 0.68 0.761FK5 93 0.53 0.59 0.71 0.49 0.50 0.58 0.49 0.50 0.551GCO 1044 0.63 0.64 0.66 0.59 0.63 0.63 0.53 0.63 0.651GK7 39 0.95 0.94 0.98 0.91 0.93 0.95 0.88 0.92 0.941GVD 56 0.75 0.68 0.84 0.67 0.63 0.69 0.61 0.62 0.661GXU 89 0.75 0.78 0.82 0.72 0.61 0.75 0.69 0.72 0.771H6V 2927 0.29 0.31 0.33 0.28 0.29 0.30 0.23 0.29 0.301HJE 13 1.00 1.00 1.00 0.72 0.79 1.00 0.67 0.57 1.001I71 83 0.44 0.66 0.76 0.41 0.46 0.56 0.38 0.58 0.591IDP 441 0.48 0.47 0.55 0.43 0.45 0.47 0.39 0.46 0.481IFR 113 0.65 0.59 0.73 0.56 0.54 0.65 0.47 0.53 0.621K8U 87 0.72 0.74 0.85 0.67 0.64 0.71 0.65 0.67 0.751KMM 1499 0.57 0.54 0.59 0.49 0.53 0.54 0.36 0.53 0.571KNG 144 0.52 0.51 0.61 0.43 0.47 0.51 0.43 0.50 0.531KR4 107 0.57 0.48 0.60 0.39 0.47 0.53 0.45 0.45 0.541KYC 15 0.96 0.99 1.00 0.92 0.93 0.99 0.88 0.88 1.001LR7 73 0.61 0.62 0.71 0.57 0.55 0.63 0.46 0.56 0.581MF7 194 0.56 0.59 0.67 0.55 0.57 0.59 0.50 0.58 0.591N7E 95 0.67 0.71 0.80 0.54 0.68 0.72 0.54 0.63 0.731NKD 59 0.73 0.69 0.89 0.56 0.58 0.63 0.55 0.65 0.751NLS 238 0.81 0.78 0.86 0.75 0.65 0.83 0.80 0.72 0.821NNX 93 0.84 0.84 0.88 0.81 0.79 0.83 0.81 0.81 0.861NOA 113 0.63 0.65 0.72 0.60 0.57 0.63 0.53 0.57 0.591NOT 13 1.00 1.00 1.00 0.82 0.86 1.00 0.86 0.81 1.001O06 22 0.98 0.97 1.00 0.96 0.92 0.97 0.97 0.94 0.981O08 221 0.46 0.48 0.56 0.44 0.42 0.50 0.37 0.45 0.481OB4 5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001OB7 5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001OPD 85 0.35 0.29 0.57 0.25 0.21 0.36 0.29 0.19 0.361P9I 29 0.89 0.88 0.98 0.87 0.82 0.92 0.87 0.84 0.891PEF 18 0.96 0.97 1.00 0.88 0.94 0.96 0.92 0.94 0.961PEN 16 0.96 0.90 1.00 0.60 0.67 0.83 0.47 0.73 0.941PMY 123 0.71 0.70 0.76 0.62 0.59 0.67 0.68 0.69 0.711PZ4 113 0.88 0.82 0.93 0.86 0.74 0.89 0.85 0.76 0.881Q9B 44 0.79 0.76 0.94 0.58 0.59 0.69 0.69 0.57 0.711QAU 112 0.59 0.61 0.66 0.57 0.55 0.58 0.55 0.57 0.581QKI 3912 0.38 0.42 0.45 0.34 0.38 0.41 0.32 0.38 0.401QTO 122 0.59 0.59 0.65 0.48 0.46 0.53 0.55 0.52 0.561R29 122 0.71 0.56 0.76 0.55 0.35 0.69 0.69 0.43 0.721R7J 90 0.88 0.86 0.91 0.83 0.76 0.87 0.81 0.79 0.861RJU 36 0.81 0.74 0.91 0.75 0.69 0.81 0.62 0.65 0.721RRO 108 0.39 0.35 0.56 0.31 0.23 0.45 0.33 0.19 0.4524
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1SAU 123 0.76 0.75 0.81 0.70 0.73 0.75 0.68 0.74 0.761TGR 111 0.77 0.76 0.83 0.72 0.70 0.74 0.74 0.73 0.751TZV 157 0.76 0.78 0.83 0.73 0.71 0.77 0.69 0.70 0.741U06 55 0.50 0.52 0.72 0.37 0.36 0.52 0.46 0.39 0.551U7I 259 0.71 0.71 0.73 0.62 0.68 0.70 0.53 0.67 0.711U9C 220 0.66 0.65 0.74 0.61 0.57 0.64 0.61 0.60 0.671UHA 82 0.70 0.75 0.82 0.69 0.68 0.74 0.67 0.69 0.731UKU 102 0.80 0.81 0.84 0.78 0.80 0.80 0.74 0.80 0.801ULR 87 0.56 0.53 0.68 0.49 0.50 0.59 0.44 0.50 0.611UOY 64 0.73 0.72 0.83 0.65 0.66 0.69 0.65 0.69 0.731USE 47 0.66 0.75 0.91 0.50 0.52 0.72 0.46 0.53 0.641USM 77 0.62 0.61 0.81 0.57 0.53 0.66 0.61 0.58 0.651UTG 70 0.57 0.53 0.68 0.51 0.49 0.60 0.49 0.49 0.561V05 96 0.67 0.66 0.72 0.60 0.61 0.65 0.52 0.61 0.651V70 105 0.64 0.65 0.75 0.56 0.60 0.66 0.51 0.58 0.621VRZ 13 1.00 1.00 1.00 0.92 0.92 1.00 0.77 0.85 1.001W2L 97 0.72 0.72 0.79 0.60 0.63 0.69 0.56 0.61 0.691WBE 206 0.53 0.47 0.63 0.43 0.38 0.55 0.36 0.42 0.481WHI 122 0.57 0.55 0.63 0.42 0.44 0.57 0.34 0.43 0.551WLY 322 0.62 0.64 0.67 0.59 0.62 0.63 0.54 0.62 0.641WPA 107 0.70 0.69 0.79 0.61 0.52 0.71 0.66 0.56 0.701X3O 80 0.66 0.66 0.72 0.62 0.60 0.65 0.62 0.64 0.671XY1 16 0.97 0.96 1.00 0.73 0.66 0.87 0.81 0.89 0.991XY2 8 1.00 1.00 1.00 0.99 0.95 1.00 0.91 0.91 1.001Y6X 86 0.56 0.53 0.62 0.50 0.49 0.59 0.50 0.52 0.561YJO 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001YZM 46 0.87 0.90 0.95 0.82 0.72 0.88 0.86 0.84 0.901Z21 96 0.70 0.73 0.82 0.61 0.63 0.64 0.64 0.69 0.721ZCE 139 0.84 0.83 0.88 0.83 0.77 0.85 0.81 0.78 0.821ZVA 75 0.85 0.85 0.94 0.84 0.78 0.92 0.83 0.81 0.862A50 469 0.64 0.63 0.70 0.54 0.60 0.67 0.41 0.58 0.672AGK 233 0.65 0.65 0.69 0.61 0.64 0.65 0.55 0.63 0.672AH1 939 0.45 0.47 0.49 0.42 0.45 0.46 0.33 0.46 0.482B0A 191 0.59 0.60 0.69 0.50 0.58 0.62 0.48 0.59 0.632BCM 415 0.46 0.41 0.50 0.39 0.39 0.40 0.35 0.39 0.452BF9 35 0.94 0.73 0.97 0.70 0.65 0.78 0.89 0.71 0.922BRF 103 0.74 0.73 0.76 0.74 0.71 0.74 0.72 0.72 0.752C71 225 0.45 0.38 0.56 0.29 0.33 0.42 0.23 0.30 0.482CE0 109 0.77 0.79 0.86 0.75 0.73 0.80 0.71 0.77 0.792CG7 110 0.32 0.44 0.63 0.29 0.31 0.36 0.30 0.33 0.412COV 534 0.66 0.64 0.70 0.63 0.64 0.67 0.57 0.64 0.672CWS 235 0.59 0.55 0.66 0.53 0.52 0.54 0.40 0.52 0.552D5W 1214 0.52 0.52 0.54 0.49 0.52 0.52 0.41 0.52 0.532DKO 253 0.75 0.72 0.79 0.72 0.69 0.75 0.68 0.69 0.722DPL 565 0.35 0.36 0.41 0.30 0.32 0.35 0.24 0.33 0.372DSX 52 0.54 0.50 0.78 0.37 0.30 0.56 0.41 0.36 0.552E10 439 0.60 0.59 0.65 0.51 0.58 0.61 0.43 0.57 0.622E3H 81 0.66 0.71 0.82 0.62 0.69 0.76 0.56 0.69 0.7825
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both2EAQ 89 0.81 0.77 0.86 0.78 0.72 0.81 0.77 0.76 0.822EHP 246 0.63 0.65 0.71 0.58 0.62 0.65 0.52 0.62 0.642EHS 75 0.75 0.73 0.81 0.72 0.71 0.74 0.69 0.71 0.732ERW 53 0.62 0.41 0.84 0.33 0.26 0.60 0.31 0.28 0.492ETX 390 0.54 0.54 0.57 0.52 0.53 0.56 0.47 0.51 0.542FB6 129 0.71 0.66 0.76 0.67 0.63 0.69 0.65 0.63 0.742FG1 176 0.55 0.56 0.62 0.54 0.52 0.58 0.52 0.54 0.572FN9 560 0.51 0.49 0.62 0.44 0.47 0.55 0.41 0.46 0.552FQ3 85 0.78 0.76 0.82 0.75 0.75 0.79 0.68 0.75 0.782G69 99 0.59 0.65 0.76 0.42 0.50 0.66 0.47 0.45 0.602G7O 68 0.89 0.91 0.95 0.85 0.79 0.88 0.76 0.82 0.872G7S 206 0.63 0.60 0.66 0.59 0.58 0.63 0.54 0.59 0.632GKG 150 0.77 0.71 0.83 0.74 0.65 0.78 0.76 0.67 0.782GOM 121 0.47 0.52 0.64 0.42 0.42 0.45 0.44 0.47 0.532GXG 140 0.74 0.72 0.79 0.71 0.68 0.72 0.69 0.68 0.732GZQ 203 0.45 0.40 0.60 0.38 0.34 0.48 0.24 0.29 0.312HQK 232 0.80 0.79 0.83 0.70 0.74 0.80 0.68 0.76 0.812HYK 237 0.59 0.58 0.63 0.51 0.55 0.59 0.43 0.54 0.602I24 113 0.47 0.44 0.69 0.40 0.40 0.48 0.45 0.40 0.492I49 399 0.54 0.53 0.62 0.43 0.51 0.56 0.41 0.49 0.582IBL 108 0.69 0.71 0.75 0.66 0.67 0.70 0.65 0.68 0.712IGD 61 0.67 0.72 0.84 0.61 0.64 0.74 0.61 0.66 0.742IMF 203 0.61 0.65 0.71 0.59 0.56 0.60 0.59 0.59 0.642IP6 87 0.72 0.66 0.82 0.66 0.58 0.73 0.64 0.64 0.782IVY 89 0.43 0.53 0.69 0.35 0.45 0.48 0.34 0.42 0.572J32 244 0.77 0.72 0.85 0.73 0.68 0.77 0.73 0.68 0.772J9W 203 0.59 0.60 0.70 0.55 0.59 0.64 0.51 0.59 0.622JKU 38 0.89 0.75 0.95 0.85 0.65 0.88 0.83 0.60 0.882JLI 112 0.87 0.81 0.90 0.82 0.70 0.85 0.85 0.78 0.862JLJ 121 0.78 0.75 0.80 0.71 0.65 0.74 0.74 0.71 0.762MCM 112 0.80 0.80 0.85 0.78 0.77 0.81 0.75 0.77 0.822NLS 36 0.75 0.66 0.88 0.61 0.32 0.76 0.49 0.47 0.692NR7 193 0.75 0.75 0.79 0.74 0.72 0.76 0.71 0.73 0.772NUH 104 0.77 0.74 0.85 0.73 0.63 0.81 0.75 0.66 0.802O6X 309 0.74 0.75 0.78 0.70 0.73 0.75 0.65 0.73 0.752OA2 140 0.63 0.64 0.70 0.55 0.49 0.60 0.60 0.63 0.672OHW 257 0.35 0.39 0.48 0.29 0.32 0.35 0.27 0.34 0.382OKT 377 0.43 0.37 0.49 0.31 0.36 0.40 0.22 0.33 0.462OL9 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.002OLX 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.002PKT 93 0.44 0.39 0.69 0.40 0.35 0.55 0.36 0.36 0.432PLT 98 0.66 0.63 0.72 0.57 0.59 0.67 0.52 0.59 0.662PMR 83 0.69 0.68 0.80 0.59 0.62 0.68 0.65 0.65 0.692POF 428 0.62 0.56 0.66 0.48 0.55 0.60 0.44 0.54 0.632PPN 122 0.57 0.61 0.74 0.51 0.59 0.63 0.44 0.57 0.632PSF 608 0.43 0.45 0.53 0.41 0.44 0.45 0.37 0.42 0.442PTH 193 0.71 0.71 0.77 0.65 0.70 0.73 0.61 0.69 0.722Q4N 1208 0.65 0.62 0.68 0.58 0.55 0.59 0.55 0.57 0.6126
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both2Q52 3296 0.65 0.66 0.70 0.62 0.56 0.64 0.63 0.57 0.652QJL 107 0.45 0.52 0.63 0.42 0.46 0.50 0.41 0.49 0.512R16 185 0.50 0.51 0.66 0.46 0.45 0.51 0.45 0.46 0.522R6Q 149 0.71 0.72 0.76 0.66 0.68 0.70 0.62 0.65 0.672RB8 93 0.81 0.78 0.84 0.78 0.75 0.80 0.74 0.76 0.812RE2 249 0.64 0.65 0.70 0.57 0.59 0.61 0.59 0.60 0.632RFR 166 0.73 0.66 0.80 0.68 0.57 0.74 0.72 0.59 0.742V9V 149 0.60 0.51 0.66 0.53 0.48 0.56 0.55 0.50 0.622VE8 515 0.46 0.48 0.55 0.42 0.41 0.44 0.40 0.43 0.472VH7 94 0.59 0.54 0.68 0.52 0.49 0.63 0.42 0.49 0.542VIM 114 0.38 0.33 0.52 0.29 0.28 0.41 0.24 0.31 0.402VPA 217 0.73 0.75 0.78 0.72 0.71 0.73 0.68 0.73 0.742VQ4 106 0.56 0.54 0.64 0.43 0.49 0.56 0.35 0.46 0.582VY8 162 0.47 0.46 0.58 0.38 0.42 0.46 0.38 0.42 0.492VYO 207 0.68 0.70 0.77 0.64 0.66 0.72 0.59 0.68 0.702W1V 551 0.69 0.67 0.77 0.63 0.63 0.70 0.56 0.64 0.682W2A 350 0.60 0.59 0.65 0.57 0.56 0.59 0.54 0.57 0.602W6A 139 0.59 0.59 0.64 0.51 0.52 0.54 0.52 0.56 0.602WJ5 110 0.63 0.55 0.79 0.59 0.52 0.68 0.59 0.53 0.642WUJ 103 0.69 0.68 0.79 0.62 0.52 0.65 0.67 0.59 0.712WW7 161 0.44 0.48 0.60 0.40 0.42 0.50 0.33 0.43 0.492WWE 120 0.71 0.71 0.83 0.62 0.62 0.75 0.61 0.58 0.732X1Q 240 0.48 0.44 0.54 0.38 0.39 0.46 0.34 0.37 0.472X25 167 0.62 0.61 0.73 0.56 0.57 0.64 0.57 0.57 0.642X3M 175 0.61 0.61 0.69 0.60 0.55 0.64 0.57 0.57 0.602X5Y 185 0.67 0.63 0.71 0.60 0.59 0.64 0.53 0.58 0.692X9Z 266 0.50 0.42 0.54 0.37 0.38 0.42 0.38 0.39 0.512XHF 310 0.62 0.62 0.67 0.58 0.56 0.60 0.55 0.62 0.632Y0T 111 0.69 0.68 0.83 0.60 0.61 0.68 0.56 0.64 0.702Y72 183 0.71 0.71 0.78 0.69 0.69 0.72 0.66 0.70 0.712Y7L 323 0.68 0.70 0.72 0.66 0.68 0.69 0.58 0.69 0.692Y9F 149 0.75 0.72 0.78 0.65 0.69 0.71 0.58 0.70 0.742YLB 418 0.55 0.52 0.63 0.46 0.49 0.52 0.34 0.49 0.592YNY 326 0.63 0.67 0.75 0.60 0.62 0.63 0.56 0.63 0.662ZCM 348 0.42 0.39 0.49 0.34 0.35 0.40 0.24 0.32 0.432ZU1 360 0.61 0.61 0.68 0.53 0.58 0.63 0.45 0.58 0.633A0M 146 0.74 0.76 0.84 0.68 0.70 0.72 0.61 0.73 0.783A7L 128 0.69 0.61 0.78 0.52 0.45 0.59 0.62 0.54 0.673AMC 614 0.54 0.53 0.64 0.47 0.50 0.54 0.37 0.51 0.573AUB 124 0.36 0.41 0.53 0.31 0.26 0.41 0.32 0.32 0.373B5O 249 0.55 0.58 0.66 0.52 0.56 0.63 0.46 0.55 0.573BA1 312 0.67 0.66 0.72 0.64 0.65 0.68 0.60 0.65 0.703BED 262 0.61 0.55 0.67 0.53 0.53 0.56 0.44 0.53 0.613BQX 136 0.52 0.50 0.54 0.47 0.48 0.51 0.41 0.46 0.513BZQ 99 0.57 0.62 0.69 0.50 0.55 0.61 0.47 0.55 0.593BZZ 103 0.60 0.63 0.68 0.51 0.58 0.61 0.45 0.50 0.593DRF 567 0.32 0.32 0.38 0.27 0.29 0.33 0.22 0.30 0.343DWV 359 0.67 0.63 0.69 0.62 0.62 0.66 0.54 0.62 0.6527
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both3E5T 268 0.55 0.52 0.60 0.51 0.51 0.56 0.38 0.50 0.553E7R 40 0.81 0.86 0.96 0.78 0.77 0.81 0.73 0.82 0.883EUR 150 0.49 0.46 0.53 0.39 0.43 0.47 0.31 0.42 0.473F2Z 148 0.76 0.78 0.84 0.75 0.76 0.78 0.69 0.77 0.783F7E 261 0.66 0.65 0.71 0.61 0.64 0.65 0.47 0.63 0.693FCN 185 0.60 0.65 0.75 0.56 0.59 0.64 0.54 0.59 0.673FE7 89 0.69 0.65 0.76 0.58 0.60 0.67 0.54 0.63 0.703FKE 250 0.47 0.42 0.52 0.40 0.36 0.49 0.34 0.36 0.453FMY 75 0.71 0.69 0.79 0.66 0.64 0.70 0.66 0.66 0.713FOD 48 0.48 0.47 0.82 0.42 0.33 0.55 0.38 0.35 0.483FSO 238 0.82 0.82 0.85 0.77 0.74 0.77 0.77 0.81 0.823FTD 257 0.60 0.57 0.67 0.49 0.52 0.59 0.41 0.52 0.603G1S 418 0.44 0.51 0.68 0.41 0.45 0.51 0.38 0.45 0.493GBW 170 0.77 0.78 0.84 0.64 0.74 0.79 0.51 0.71 0.813GHJ 129 0.71 0.71 0.81 0.65 0.67 0.72 0.65 0.68 0.723HFO 216 0.75 0.72 0.82 0.70 0.63 0.75 0.65 0.69 0.743HHP 1314 0.61 0.62 0.68 0.57 0.59 0.62 0.52 0.59 0.633HNY 170 0.59 0.56 0.64 0.47 0.52 0.57 0.42 0.49 0.563HP4 201 0.60 0.61 0.72 0.57 0.54 0.64 0.43 0.56 0.623HWU 155 0.60 0.69 0.81 0.57 0.61 0.63 0.50 0.61 0.683HYD 8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.003HZ8 200 0.58 0.59 0.66 0.55 0.53 0.56 0.52 0.54 0.583I2V 127 0.57 0.58 0.66 0.51 0.53 0.61 0.40 0.48 0.533I2Z 140 0.58 0.59 0.65 0.52 0.54 0.56 0.56 0.57 0.613I4O 154 0.63 0.64 0.73 0.58 0.59 0.60 0.56 0.63 0.663I7M 145 0.58 0.62 0.71 0.53 0.55 0.58 0.49 0.58 0.643IHS 173 0.62 0.67 0.74 0.58 0.54 0.60 0.58 0.60 0.623IVV 168 0.80 0.80 0.89 0.75 0.76 0.83 0.68 0.74 0.793K6Y 227 0.53 0.53 0.60 0.48 0.49 0.52 0.42 0.50 0.553KBE 166 0.62 0.61 0.65 0.57 0.60 0.62 0.52 0.60 0.613KGK 190 0.79 0.80 0.84 0.77 0.79 0.81 0.68 0.79 0.803KZD 94 0.79 0.72 0.83 0.55 0.68 0.77 0.47 0.66 0.783L41 219 0.61 0.62 0.71 0.59 0.60 0.66 0.57 0.59 0.673LAA 176 0.70 0.66 0.80 0.68 0.56 0.76 0.69 0.60 0.773LAX 118 0.81 0.81 0.86 0.80 0.76 0.83 0.77 0.78 0.823LG3 846 0.40 0.38 0.41 0.36 0.37 0.40 0.32 0.37 0.413LJI 270 0.53 0.53 0.62 0.47 0.52 0.58 0.45 0.52 0.563M3P 244 0.47 0.44 0.69 0.40 0.40 0.58 0.25 0.35 0.483M8J 178 0.74 0.72 0.75 0.69 0.69 0.73 0.67 0.70 0.733M9J 250 0.57 0.56 0.59 0.53 0.54 0.56 0.39 0.53 0.563M9Q 190 0.53 0.52 0.59 0.50 0.51 0.53 0.46 0.50 0.513MAB 180 0.57 0.56 0.62 0.52 0.47 0.55 0.56 0.51 0.563MD4 13 1.00 1.00 1.00 0.91 0.94 1.00 0.93 0.99 1.003MD5 14 1.00 1.00 1.00 0.98 0.93 1.00 0.94 0.92 1.003MEA 170 0.58 0.58 0.68 0.57 0.57 0.64 0.48 0.57 0.593MGN 277 0.33 0.32 0.47 0.26 0.28 0.30 0.16 0.29 0.393MRE 446 0.40 0.38 0.45 0.32 0.36 0.40 0.24 0.35 0.413N11 325 0.43 0.45 0.51 0.42 0.44 0.45 0.38 0.44 0.4528
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both3NE0 208 0.77 0.79 0.84 0.75 0.70 0.77 0.70 0.76 0.823NGG 97 0.80 0.81 0.85 0.72 0.74 0.78 0.74 0.76 0.803NPV 500 0.44 0.44 0.50 0.40 0.42 0.44 0.36 0.43 0.473NVG 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.003NZL 70 0.68 0.61 0.84 0.53 0.49 0.66 0.59 0.55 0.673O0P 197 0.62 0.64 0.71 0.59 0.62 0.64 0.53 0.62 0.643O5P 147 0.64 0.60 0.71 0.55 0.57 0.60 0.53 0.56 0.643OBQ 150 0.59 0.59 0.66 0.46 0.49 0.58 0.53 0.56 0.583OQY 236 0.71 0.66 0.73 0.63 0.64 0.70 0.60 0.64 0.723P6J 145 0.75 0.73 0.81 0.69 0.71 0.73 0.61 0.71 0.753PD7 216 0.65 0.66 0.72 0.62 0.60 0.65 0.60 0.61 0.653PES 166 0.70 0.72 0.79 0.58 0.63 0.70 0.52 0.60 0.663PID 387 0.50 0.49 0.56 0.44 0.48 0.53 0.37 0.46 0.513PIW 161 0.66 0.67 0.78 0.60 0.63 0.70 0.56 0.63 0.723PKV 229 0.50 0.52 0.63 0.43 0.48 0.53 0.35 0.50 0.573PSM 94 0.83 0.78 0.88 0.79 0.77 0.83 0.68 0.76 0.793PTL 289 0.50 0.50 0.53 0.49 0.49 0.50 0.43 0.49 0.503PVE 363 0.45 0.45 0.59 0.37 0.39 0.44 0.41 0.42 0.453PZ9 357 0.51 0.45 0.57 0.36 0.38 0.42 0.34 0.39 0.503PZZ 12 1.00 1.00 1.00 0.95 0.90 1.00 0.94 0.80 1.003Q2X 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.003Q6L 131 0.39 0.44 0.56 0.33 0.31 0.37 0.34 0.37 0.423QDS 284 0.63 0.62 0.69 0.59 0.59 0.65 0.51 0.59 0.643QPA 212 0.68 0.66 0.78 0.45 0.45 0.47 0.59 0.59 0.653R6D 222 0.65 0.66 0.73 0.62 0.63 0.65 0.53 0.64 0.693R87 148 0.48 0.47 0.55 0.41 0.44 0.48 0.40 0.45 0.473RQ9 165 0.51 0.47 0.61 0.41 0.44 0.52 0.39 0.45 0.563RY0 128 0.44 0.45 0.54 0.40 0.40 0.47 0.41 0.42 0.473RZY 151 0.65 0.65 0.84 0.59 0.54 0.65 0.57 0.51 0.593S0A 132 0.39 0.43 0.52 0.33 0.34 0.38 0.32 0.31 0.373SD2 100 0.65 0.67 0.77 0.64 0.63 0.69 0.56 0.63 0.673SEB 238 0.63 0.66 0.77 0.62 0.61 0.68 0.61 0.62 0.673SED 126 0.39 0.45 0.55 0.28 0.29 0.38 0.33 0.33 0.403SO6 157 0.67 0.71 0.78 0.63 0.69 0.73 0.55 0.64 0.703SR3 657 0.45 0.44 0.48 0.43 0.41 0.45 0.39 0.43 0.443SUK 254 0.53 0.54 0.64 0.46 0.48 0.54 0.47 0.49 0.573SZH 753 0.53 0.53 0.57 0.51 0.51 0.52 0.45 0.52 0.533T0H 209 0.76 0.73 0.78 0.72 0.69 0.74 0.68 0.71 0.763T3K 122 0.66 0.66 0.72 0.55 0.62 0.68 0.48 0.60 0.683T47 145 0.54 0.54 0.78 0.45 0.45 0.62 0.43 0.47 0.543TDN 359 0.47 0.43 0.53 0.43 0.42 0.44 0.38 0.43 0.493TOW 155 0.66 0.65 0.74 0.58 0.61 0.66 0.53 0.60 0.653TUA 226 0.57 0.55 0.63 0.52 0.50 0.55 0.45 0.52 0.543TYS 78 0.78 0.58 0.86 0.67 0.48 0.73 0.70 0.46 0.753U6G 276 0.44 0.39 0.54 0.39 0.37 0.45 0.27 0.35 0.483U97 85 0.78 0.78 0.84 0.77 0.73 0.80 0.77 0.76 0.803UCI 72 0.67 0.64 0.72 0.48 0.53 0.57 0.55 0.56 0.633UR8 637 0.52 0.53 0.60 0.49 0.51 0.55 0.45 0.52 0.5329
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both3US6 159 0.60 0.56 0.67 0.55 0.49 0.62 0.53 0.46 0.593V1A 59 0.74 0.57 0.95 0.51 0.53 0.77 0.39 0.46 0.683V75 294 0.50 0.49 0.57 0.48 0.46 0.53 0.47 0.47 0.533VN0 193 0.87 0.88 0.90 0.86 0.87 0.88 0.79 0.88 0.893VOR 219 0.64 0.58 0.70 0.56 0.52 0.63 0.53 0.55 0.633VUB 101 0.65 0.60 0.71 0.60 0.56 0.61 0.61 0.57 0.643VVV 112 0.64 0.64 0.79 0.55 0.48 0.65 0.57 0.49 0.583VZ9 163 0.65 0.64 0.70 0.60 0.55 0.63 0.60 0.60 0.673W4Q 826 0.61 0.60 0.68 0.56 0.59 0.61 0.47 0.60 0.643ZBD 213 0.36 0.47 0.74 0.24 0.28 0.34 0.25 0.31 0.363ZIT 157 0.51 0.47 0.59 0.36 0.39 0.47 0.47 0.41 0.523ZRX 241 0.56 0.56 0.63 0.49 0.52 0.53 0.46 0.52 0.563ZSL 165 0.39 0.39 0.54 0.28 0.22 0.40 0.31 0.24 0.373ZZP 74 0.40 0.30 0.47 0.19 0.27 0.31 0.12 0.22 0.403ZZY 226 0.65 0.67 0.69 0.63 0.63 0.64 0.59 0.63 0.644A02 169 0.61 0.56 0.66 0.49 0.52 0.57 0.31 0.51 0.604ACJ 182 0.55 0.59 0.75 0.55 0.58 0.61 0.51 0.59 0.604AE7 189 0.69 0.67 0.74 0.63 0.61 0.65 0.63 0.65 0.694AM1 359 0.57 0.54 0.59 0.53 0.52 0.53 0.46 0.53 0.554ANN 210 0.50 0.48 0.57 0.42 0.43 0.48 0.36 0.42 0.474AVR 189 0.57 0.57 0.70 0.53 0.51 0.59 0.49 0.53 0.574AXY 56 0.55 0.60 0.76 0.47 0.48 0.63 0.47 0.50 0.624B6G 559 0.70 0.71 0.75 0.67 0.69 0.72 0.60 0.69 0.734B9G 292 0.81 0.82 0.85 0.78 0.80 0.81 0.71 0.82 0.834DD5 412 0.60 0.63 0.71 0.57 0.59 0.63 0.51 0.61 0.664DKN 423 0.59 0.58 0.63 0.52 0.54 0.56 0.42 0.55 0.614DND 93 0.75 0.66 0.82 0.67 0.64 0.75 0.61 0.64 0.744DPZ 113 0.68 0.70 0.79 0.65 0.64 0.67 0.62 0.64 0.694DQ7 338 0.45 0.46 0.51 0.37 0.44 0.49 0.29 0.40 0.464DT4 170 0.76 0.74 0.78 0.70 0.68 0.72 0.70 0.70 0.734EK3 313 0.58 0.63 0.65 0.55 0.56 0.58 0.53 0.59 0.604ERY 318 0.61 0.60 0.67 0.59 0.59 0.64 0.52 0.59 0.654ES1 96 0.76 0.77 0.86 0.69 0.73 0.78 0.57 0.74 0.834EUG 225 0.61 0.61 0.67 0.54 0.60 0.62 0.51 0.58 0.624F01 459 0.38 0.37 0.47 0.32 0.34 0.37 0.22 0.34 0.394F3J 143 0.57 0.63 0.66 0.52 0.59 0.61 0.47 0.58 0.604FR9 145 0.65 0.62 0.78 0.63 0.58 0.70 0.58 0.57 0.644G14 5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.004G2E 155 0.75 0.64 0.85 0.59 0.61 0.74 0.68 0.61 0.804G5X 584 0.71 0.69 0.80 0.69 0.64 0.74 0.64 0.67 0.724G6C 676 0.43 0.44 0.50 0.40 0.44 0.46 0.24 0.43 0.454G7X 216 0.53 0.47 0.61 0.41 0.31 0.47 0.51 0.37 0.534GA2 183 0.55 0.56 0.70 0.52 0.53 0.57 0.49 0.53 0.604GMQ 94 0.73 0.77 0.84 0.68 0.66 0.72 0.67 0.63 0.724GS3 90 0.65 0.68 0.74 0.60 0.64 0.68 0.51 0.66 0.704H4J 278 0.67 0.67 0.82 0.63 0.64 0.75 0.57 0.66 0.694H89 175 0.39 0.50 0.67 0.33 0.37 0.39 0.35 0.40 0.424HDE 167 0.63 0.55 0.75 0.59 0.52 0.69 0.59 0.51 0.6730
APPENDIX
Table 15 – continued from previous page
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both4HJP 308 0.62 0.61 0.65 0.57 0.55 0.59 0.58 0.58 0.624HWM 129 0.69 0.66 0.71 0.66 0.60 0.68 0.68 0.63 0.704IL7 99 0.63 0.63 0.65 0.60 0.59 0.62 0.57 0.61 0.624J11 377 0.66 0.63 0.68 0.62 0.61 0.63 0.63 0.61 0.664J5O 268 0.77 0.76 0.82 0.71 0.62 0.77 0.75 0.66 0.774J5Q 162 0.65 0.63 0.75 0.57 0.56 0.66 0.59 0.57 0.644J78 305 0.48 0.48 0.56 0.43 0.44 0.50 0.38 0.47 0.534JG2 202 0.63 0.63 0.74 0.61 0.61 0.64 0.58 0.60 0.634JVU 207 0.67 0.64 0.75 0.57 0.58 0.66 0.59 0.60 0.674JYP 550 0.59 0.60 0.69 0.52 0.57 0.61 0.38 0.58 0.614KEF 145 0.52 0.49 0.65 0.40 0.42 0.49 0.27 0.45 0.565CYT 103 0.53 0.52 0.65 0.49 0.46 0.54 0.43 0.48 0.506RXN 45 0.74 0.63 0.86 0.59 0.48 0.76 0.49 0.49 0.7631
APPENDIX
Table 12:
Pearson correlation coefficients of least squares fitting C α B factor prediction of small proteins using11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included.
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1AIE 31 0.97 0.88 0.99 0.78 0.64 0.90 0.90 0.77 0.961AKG 16 0.82 0.66 1.00 0.60 0.53 0.72 0.53 0.56 0.871BX7 51 0.86 0.74 0.89 0.79 0.68 0.82 0.81 0.69 0.821ETL 12 1.00 1.00 1.00 0.68 0.87 1.00 0.95 0.98 1.001ETM 12 1.00 1.00 1.00 0.45 0.74 0.86 0.70 0.83 1.001ETN 12 1.00 1.00 1.00 0.96 0.92 0.99 0.70 0.92 1.001FF4 65 0.77 0.72 0.80 0.70 0.65 0.75 0.68 0.68 0.761GK7 39 0.95 0.94 0.98 0.91 0.93 0.95 0.88 0.92 0.941GVD 56 0.75 0.68 0.84 0.67 0.63 0.69 0.61 0.62 0.661HJE 13 1.00 1.00 1.00 0.72 0.79 1.00 0.67 0.57 1.001KYC 15 0.96 0.99 1.00 0.92 0.93 0.99 0.88 0.88 1.001NOT 13 1.00 1.00 1.00 0.82 0.86 1.00 0.86 0.81 1.001O06 22 0.98 0.97 1.00 0.96 0.92 0.97 0.97 0.94 0.981P9I 29 0.89 0.88 0.98 0.87 0.82 0.92 0.87 0.84 0.891PEF 18 0.96 0.97 1.00 0.88 0.94 0.96 0.92 0.94 0.961PEN 16 0.96 0.90 1.00 0.60 0.67 0.83 0.47 0.73 0.941Q9B 44 0.79 0.76 0.94 0.58 0.59 0.69 0.69 0.57 0.711RJU 36 0.81 0.74 0.91 0.75 0.69 0.81 0.62 0.65 0.721U06 55 0.50 0.52 0.72 0.37 0.36 0.52 0.46 0.39 0.551UOY 64 0.73 0.72 0.83 0.65 0.66 0.69 0.65 0.69 0.731USE 47 0.66 0.75 0.91 0.50 0.52 0.72 0.46 0.53 0.641VRZ 13 1.00 1.00 1.00 0.92 0.92 1.00 0.77 0.85 1.001XY2 8 1.00 1.00 1.00 0.99 0.95 1.00 0.91 0.91 1.001YJO 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001YZM 46 0.87 0.90 0.95 0.82 0.72 0.88 0.86 0.84 0.902DSX 52 0.54 0.50 0.78 0.37 0.30 0.56 0.41 0.36 0.552JKU 38 0.89 0.75 0.95 0.85 0.65 0.88 0.83 0.60 0.882NLS 36 0.75 0.66 0.88 0.61 0.32 0.76 0.49 0.47 0.692OL9 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.006RXN 45 0.74 0.63 0.86 0.59 0.48 0.76 0.49 0.49 0.7632
APPENDIX
Table 13:
Pearson correlation coefficients of least squares fitting C α B factor prediction of medium proteinsusing 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included.
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1ABA 87 0.67 0.67 0.76 0.54 0.62 0.68 0.56 0.63 0.701CYO 88 0.71 0.69 0.78 0.66 0.58 0.68 0.65 0.59 0.671FK5 93 0.53 0.59 0.71 0.49 0.50 0.58 0.49 0.50 0.551GXU 89 0.75 0.78 0.82 0.72 0.61 0.75 0.69 0.72 0.771I71 83 0.44 0.66 0.76 0.41 0.46 0.56 0.38 0.58 0.591LR7 73 0.61 0.62 0.71 0.57 0.55 0.63 0.46 0.56 0.581N7E 95 0.67 0.71 0.80 0.54 0.68 0.72 0.54 0.63 0.731NNX 93 0.84 0.84 0.88 0.81 0.79 0.83 0.81 0.81 0.861NOA 113 0.63 0.65 0.72 0.60 0.57 0.63 0.53 0.57 0.591OPD 85 0.35 0.29 0.57 0.26 0.21 0.36 0.29 0.19 0.361QAU 112 0.59 0.61 0.66 0.57 0.55 0.58 0.55 0.57 0.581R7J 90 0.88 0.86 0.91 0.83 0.76 0.87 0.81 0.79 0.861UHA 82 0.70 0.75 0.82 0.69 0.68 0.74 0.67 0.69 0.731ULR 87 0.56 0.53 0.68 0.49 0.50 0.59 0.44 0.50 0.611USM 77 0.62 0.61 0.81 0.57 0.53 0.66 0.61 0.58 0.651V05 96 0.67 0.66 0.72 0.60 0.61 0.65 0.52 0.61 0.651W2L 97 0.72 0.72 0.79 0.60 0.63 0.69 0.56 0.61 0.691X3O 80 0.66 0.66 0.72 0.62 0.60 0.65 0.62 0.64 0.671Z21 96 0.70 0.73 0.82 0.61 0.63 0.64 0.64 0.69 0.721ZVA 75 0.85 0.85 0.94 0.84 0.78 0.92 0.83 0.81 0.862BF9 35 0.94 0.73 0.97 0.70 0.65 0.78 0.89 0.71 0.922BRF 103 0.74 0.73 0.76 0.74 0.71 0.74 0.72 0.72 0.752CE0 109 0.77 0.79 0.86 0.75 0.73 0.80 0.71 0.77 0.792E3H 81 0.66 0.71 0.82 0.62 0.69 0.76 0.56 0.69 0.782EAQ 89 0.81 0.77 0.86 0.79 0.72 0.81 0.77 0.76 0.822EHS 75 0.75 0.73 0.81 0.72 0.71 0.74 0.69 0.71 0.732FQ3 85 0.78 0.76 0.82 0.75 0.75 0.79 0.68 0.75 0.782IP6 87 0.72 0.66 0.82 0.67 0.58 0.73 0.64 0.64 0.782MCM 112 0.80 0.80 0.85 0.78 0.77 0.81 0.75 0.77 0.822NUH 104 0.77 0.74 0.85 0.73 0.63 0.81 0.75 0.66 0.802PKT 93 0.44 0.39 0.69 0.39 0.35 0.55 0.36 0.36 0.432PLT 98 0.66 0.63 0.72 0.57 0.59 0.67 0.52 0.59 0.662QJL 107 0.45 0.52 0.63 0.42 0.46 0.50 0.41 0.49 0.512RB8 93 0.81 0.78 0.84 0.78 0.75 0.80 0.74 0.76 0.813BZQ 99 0.57 0.62 0.69 0.50 0.55 0.61 0.47 0.55 0.595CYT 103 0.53 0.52 0.65 0.49 0.46 0.54 0.43 0.48 0.5033
APPENDIX
Table 14:
Pearson correlation coefficients of least squares fitting C α B factor prediction of large proteins using11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included.
B & W B WPDB ID N Exp Lor Both Exp Lor Both Exp Lor Both1AHO 66 0.75 0.78 0.88 0.72 0.73 0.79 0.53 0.65 0.751ATG 231 0.50 0.50 0.61 0.45 0.47 0.53 0.38 0.48 0.511BYI 238 0.50 0.51 0.58 0.41 0.46 0.49 0.44 0.48 0.541CCR 109 0.65 0.66 0.71 0.53 0.56 0.65 0.43 0.58 0.631E5K 188 0.67 0.68 0.74 0.66 0.67 0.68 0.63 0.67 0.691EW4 106 0.58 0.60 0.73 0.52 0.51 0.55 0.55 0.55 0.621IFR 113 0.65 0.59 0.73 0.56 0.54 0.65 0.47 0.53 0.621NLS 238 0.81 0.78 0.86 0.75 0.65 0.83 0.80 0.72 0.821O08 221 0.46 0.48 0.56 0.44 0.42 0.50 0.37 0.45 0.481PMY 123 0.71 0.70 0.76 0.62 0.59 0.67 0.68 0.69 0.711PZ4 113 0.88 0.82 0.93 0.86 0.74 0.89 0.85 0.76 0.881QTO 122 0.59 0.59 0.65 0.48 0.46 0.53 0.55 0.52 0.561RRO 108 0.39 0.35 0.56 0.31 0.23 0.45 0.33 0.19 0.451UKU 102 0.80 0.81 0.84 0.78 0.80 0.80 0.74 0.80 0.801V70 105 0.64 0.65 0.75 0.56 0.60 0.66 0.51 0.58 0.621WBE 206 0.53 0.47 0.63 0.43 0.38 0.55 0.36 0.42 0.481WHI 122 0.57 0.55 0.63 0.42 0.44 0.57 0.34 0.43 0.551WPA 107 0.70 0.69 0.79 0.61 0.52 0.71 0.66 0.56 0.702AGK 233 0.65 0.65 0.69 0.61 0.64 0.65 0.55 0.63 0.672C71 225 0.45 0.38 0.56 0.29 0.33 0.42 0.23 0.30 0.482CG7 110 0.32 0.44 0.63 0.29 0.31 0.36 0.30 0.33 0.412CWS 235 0.59 0.55 0.66 0.53 0.52 0.54 0.40 0.52 0.552HQK 232 0.80 0.79 0.83 0.70 0.74 0.80 0.68 0.76 0.812HYK 237 0.59 0.58 0.63 0.51 0.55 0.59 0.43 0.54 0.602I24 113 0.47 0.44 0.69 0.40 0.40 0.48 0.45 0.40 0.492IMF 203 0.61 0.65 0.71 0.59 0.56 0.60 0.59 0.59 0.642PPN 122 0.57 0.61 0.74 0.51 0.59 0.63 0.44 0.57 0.632R16 185 0.50 0.51 0.66 0.46 0.45 0.51 0.45 0.46 0.522V9V 149 0.60 0.51 0.66 0.53 0.48 0.56 0.55 0.50 0.622VIM 114 0.38 0.33 0.52 0.29 0.28 0.41 0.24 0.31 0.402VPA 217 0.73 0.75 0.78 0.72 0.71 0.73 0.68 0.73 0.742VYO 207 0.68 0.70 0.77 0.64 0.66 0.72 0.59 0.68 0.703SEB 238 0.63 0.66 0.77 0.62 0.61 0.68 0.61 0.62 0.673VUB 101 0.65 0.60 0.71 0.60 0.56 0.61 0.61 0.57 0.6434
EFERENCES REFERENCES
References [1] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, flexibility and folding,”International Journal for Numerical Methods in Biomedical Engineering, vol. 30, pp. 814–844, 2014.[2] M. Gameiro, Y. Hiraoka, S. Izumi, M. Kramar, K. Mischaikow, and V. Nanda, “Topological measure-ment of protein compressibility via persistence diagrams,” Japan Journal of Industrial and AppliedMathematics, vol. 32, pp. 1–17, 2014.[3] K. L. Xia and G. W. Wei, “Persistent topology for cryo-EM data analysis,” International Journalfor Numerical Methods in Biomedical Engineering, vol. 31, p. e02719, 2015.[4] Z. X. Cang, L. Mu, K. Wu, K. Opron, K. Xia, and G.-W. Wei, “A topological approach to proteinclassification,” Molecular based Mathematical Biology, vol. 3, pp. 140–162, 2015.[5] V. Kovacev-Nikolic, P. Bubenik, D. Nikoli´c, and G. Heo, “Using persistent homology and dynamicaldistances to analyze protein binding,” Stat. Appl. Genet. Mol. Biol., vol. 15, no. 1, pp. 19–38, 2016.[6] K. Xia, “Persistent homology analysis of ion aggregations and hydrogen-bonding networks,” PhysicalChemistry Chemical Physics, vol. 20, no. 19, pp. 13448–13460, 2018.[7] P. Frosini and C. Landi, “Size theory as a topological tool for computer vision,” Pattern Recognitionand Image Analysis, vol. 9, no. 4, pp. 596–603, 1999.[8] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,”Discrete Comput. Geom., vol. 28, pp. 511–533, 2002.[9] A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete Comput. Geom.,vol. 33, pp. 249–274, 2005.[10] A. Zomorodian and G. Carlsson, “Localized homology,” Computational Geometry - Theory andApplications, vol. 41, no. 3, pp. 126–148, 2008.[11] Y. Yao, J. Sun, X. Huang, G. R. Bowman, G. Singh, M. Lesnick, L. J. Guibas, V. S. Pande, andG. Carlsson, “Topological methods for exploring low-density states in biomolecular folding path-ways,” The Journal of chemical physics, vol. 130, no. 14, p. 04B614, 2009.[12] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutationby element specific persistent homology,” Bioinformatics, vol. 33, pp. 3549–3557, 2017.[13] Z. X. Cang and G. W. Wei, “Integration of element specific persistent homology and machine learningfor protein-ligand binding affinity prediction ,” International Journal for Numerical Methods inBiomedical Engineering, vol. 34(2), p. DOI: 10.1002/cnm.2914, 2018.[14] D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and Y. Mileyko, “Lipschitz functions have L p -stablepersistence,” Foundations of computational mathematics, vol. 10, no. 2, pp. 127–139, 2010.[15] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,” Discrete &Computational Geometry, vol. 37, no. 1, pp. 103–120, 2007.[16] Z. X. Cang and G. W. Wei, “TopologyNet: Topology based deep convolutional and multi-taskneural networks for biomolecular property predictions,” PLOS Computational Biology, vol. 13(7),pp. e1005690, 2017.[17] K. Wu and G. W. Wei, “Quantitative Toxicity Prediction Using Topology Based Multitask DeepNeural Networks,” Journal of Chemical Information and Modeling, vol. 58, pp. 520–531, 2018.35 EFERENCES REFERENCES [18] K. Wu, Z. Zhao, R. Wang, and G. W. Wei, “TopP-S: Persistent Homology-Based Multi-Task DeepNeural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility ,”Journal of Computational Chemistry, vol. 39, pp. 1444–1454, 2018.[19] Z. X. Cang, L. Mu, and G. W. Wei, “Representability of algebraic topology for biomolecules inmachine learning based scoring and virtual screening ,” PLOS Computational Biology, vol. 14(1),pp. e1005929, https://doi.org/10.1371/journal.pcbi.1005929, 2018.[20] J. P. Ma, “Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecularcomplexes.,” Structure, vol. 13, pp. 373 – 180, 2005.[21] H. Frauenfelder, S. G. Slihar, and P. G. Wolynes, “ The energy landsapes and motion of proteins ,”Science, vol. 254, pp. 1598–1603, DEC 13 1991.[22] M. Tasumi, H. Takenchi, S. Ataka, A. M. Dwidedi, and S. Krimm, “Normal vibrations of proteins:Glucagon,” Biopolymers, vol. 21, pp. 711 – 714, 1982.[23] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. States, S. Swaminathan, and M. Karplus,“Charmm: A program for macromolecular energy, minimization, and dynamics calculations,” J.Comput. Chem., vol. 4, pp. 187–217, 1983.[24] M. Levitt, C. Sander, and P. S. Stern, “Protein normal-mode dynamics: Trypsin inhibitor, crambin,ribonuclease and lysozyme.,” J. Mol. Biol., vol. 181, no. 3, pp. 423 – 447, 1985.[25] M. M. Tirion, “Large amplitude elastic motions in proteins from a single-parameter, atomic analy-sis.,” Phys. Rev. Lett., vol. 77, pp. 1905 – 1908, 1996.[26] A. R. Atilgan, S. R. Durrell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar, “Anisotropyof fluctuation dynamics of proteins with an elastic network model.,” Biophys. J., vol. 80, pp. 505 –515, 2001.[27] I. Bahar, A. R. Atilgan, and B. Erman, “Direct evaluation of thermal fluctuations in proteins usinga single-parameter harmonic potential.,” Folding and Design, vol. 2, pp. 173 – 181, 1997.[28] I. Bahar, A. R. Atilgan, M. C. Demirel, and B. Erman, “Vibrational dynamics of proteins: Signifi-cance of slow and fast modes in relation to function and stability.,” Phys. Rev. Lett, vol. 80, pp. 2733– 2736, 1998.[29] T. Haliloglu, I. Bahar, and B. Erman, “Gaussian dynamics of folded proteins,” Physical reviewletters, vol. 79, no. 16, p. 3090, 1997.[30] K. L. Xia and G. W. Wei, “A stochastic model for protein flexibility analysis,” Physical Review E,vol. 88, p. 062709, 2013.[31] K. Opron, K. L. Xia, and G. W. Wei, “Fast and anisotropic flexibility-rigidity index for proteinflexibility and fluctuation analysis,” Journal of Chemical Physics, vol. 140, p. 234105, 2014.[32] K. Opron, K. L. Xia, and G. W. Wei, “Communication: Capturing protein multiscale thermalfluctuations,” Journal of Chemical Physics, vol. 142, no. 211101, 2015.[33] D. Bramer and G. W. Wei, “Weighted multiscale colored graphs for protein flexibility and rigidityanalysis,” Journal of Chemical Physics, vol. 148, p. 054103, 2018.[34] D. Bramer and G. W. Wei, “Blind prediction of protein b-factor and flexibility,” Journal of ChemicalPhysics, vol. 149, p. 021837, 2018.[35] K. L. Xia and G. W. Wei, “Multidimensional persistence in biomolecular data,” Journal ofComputational Chemistry, vol. 36, pp. 1502–1520, 2015.36