[PDF] Deep Learning of Protein Structural Classes: Any Evidence for an 'Urfold'?

Abstract

Recent computational advances in the accurate prediction of protein three-dimensional (3D) structures from amino acid sequences now present a unique opportunity to decipher the interrelationships between proteins. This task entails--but is not equivalent to--a problem of 3D structure comparison and classification. Historically, protein domain classification has been a largely manual and subjective activity, relying upon various heuristics. Databases such as CATH represent significant steps towards a more systematic (and automatable) approach, yet there still remains much room for the development of more scalable and quantitative classification methods, grounded in machine learning. We suspect that re-examining these relationships via a Deep Learning (DL) approach may entail a large-scale restructuring of classification schemes, improved with respect to the interpretability of distant relationships between proteins. Here, we describe our training of DL models on protein domain structures (and their associated physicochemical properties) in order to evaluate classification properties at CATH's "homologous superfamily" (SF) level. To achieve this, we have devised and applied an extension of image-classification methods and image segmentation techniques, utilizing a convolutional autoencoder model architecture. Our DL architecture allows models to learn structural features that, in a sense, 'define' different homologous SFs. We evaluate and quantify pairwise 'distances' between SFs by building one model per SF and comparing the loss functions of the models. Hierarchical clustering on these distance matrices provides a new view of protein interrelationships--a view that extends beyond simple structural/geometric similarity, and towards the realm of structure/function properties.

Full PDF

DDeep Learning of Protein Structural Classes:Any Evidence for an ‘

Urfold ? st Menuka Jaiswal

School of Data ScienceUniversity of Virginia

[email protected] st Saad Saleem

School of Data ScienceUniversity of Virginia

[email protected] st Yonghyeon Kweon

School of Data ScienceUniversity of Virginia

[email protected] nd Eli J Draizen

Department of Biomedical EngineeringUniversity of Virginia

[email protected] nd Stella Veretnik

Department of Biomedical EngineeringUniversity of Virginia

[email protected] nd Cameron Mura

Department of Biomedical EngineeringUniversity of Virginia

[email protected] nd Philip E Bourne

School of Data ScienceUniversity of Virginia

[email protected]

Abstract —Recent computational advances in the accurate pre-diction of protein three-dimensional (3D) structures from aminoacid sequences now present a unique opportunity to decipherthe interrelationships between proteins. This task entailsbut isnot equivalent toa problem of 3D structure comparison andclassiﬁcation. Historically, protein domain classiﬁcation has beena largely manual and subjective activity, relying upon variousheuristics. Databases such as CATH represent signiﬁcant stepstowards a more systematic (and automatable) approach, yet therestill remains much room for the development of more scalableand quantitative classiﬁcation methods, grounded in machinelearning. We suspect that re-examining these relationships viaa Deep Learning (DL) approach may entail a large-scale re-structuring of classiﬁcation schemes, improved with respect tothe interpretability of distant relationships between proteins.Here, we describe our training of DL models on protein domainstructures (and their associated physicochemical properties) inorder to evaluate classiﬁcation properties at CATHs homologoussuperfamily (SF) level. To achieve this, we have devised andapplied an extension of image-classiﬁcation methods and imagesegmentation techniques, utilizing a convolutional autoencodermodel architecture. Our DL architecture allows models to learnstructural features that, in a sense, ‘deﬁne’ different homologousSFs. We evaluate and quantify pairwise ‘distances’ betweenSFs by building one model per SF and comparing the lossfunctions of the models. Hierarchical clustering on these distancematrices provides a new view of protein interrelationshipsa viewthat extends beyond simple structural/geometric similarity, andtowards the realm of structure/function properties.

Index Terms —Autoencoders; CNNs; CATH; deep learning;protein domain classiﬁcation; protein structure

I. I

NTRODUCTION

Proteins are key biological macromolecules that consist oflong, unbranched chains of amino acids (AAs) linked viapeptide bonds [1]; they perform most of the physiologicalfunctions of cellular life (enzymes, receptors, etc.). Differentsequences of the 20 naturally-occurring AAs can fold intotertiary structures with variable degrees of geometric similar-ity. At the sequence level, the differences between any twoproteins can range from relatively modest single-point edits (“point mutations” or ‘substitutions’) to larger-scale changessuch as reorganization of entire segments of a polypeptidechain. Such changes are critical because they inﬂuence 3Dstructure, and protein function stems from 3D structure [2].Indeed, our ability to elucidate protein function and evolutionis intimately linked to our knowledge of protein structure.Equally important, interrelationships between structures deﬁnea map of the protein universe [3]. Thus, it is paramountto have a robust classiﬁcation system for categorically orga-nizing protein structures based upon their similarities. Evenmore basically, what does ‘similarity’ mean in this context(geometrically, chemically, etc.)? And, are there particularformulations of ‘similarity’ that are more salient than others?Ideally, any system of comparison would also take into accountfunctional propertiesi.e., not just raw geometric shape, but alsoproperties such as acidity/basicity, physicochemical features ofthe surface, “binding pockets”, and so on.An unprecedented volume of protein structural and func-tional data is now available, largely because of exponentialgrowth in genome sequencing and efforts such as structuralgenomics [4]; alongside these data, novel computational ap-proaches can now discover subtle signals in sets of sequences[5]. These advances have yielded a vast trove of potentialprotein sequences and structures, but the utility of the informa-tion has been limited because the mapping from sequence tostructure (or ‘fold’) has remained enigmatic; this decades-longgrand challenge is known as the “protein folding problem”[6].As computational approaches to the folding problem continu-ally improve, increasingly we will compute 3D structures fromprotein sequences de novo . Therefore, we expect to see thedemand for identifying and cataloging new protein structuresgrow at an ever-increasing pace with the rise in the numberof 3D structures, both experimentally determined as well ascomputationally predicted (via physics-based simulations [7]or via Deep Learning approaches [8]).Historically, efforts to catalog protein structures have been a r X i v : . [ q - b i o . B M ] M a y largely manual and painstaking process, fraught with heuris-tics. There has been a shift in the paradigm since the intro-duction of methodical hierarchical database structures, suchas CATH, which engender more robust classiﬁcation schemesinto which new protein structures can be incorporated [9]. Thisis critical as we expand our knowledge of known protein struc-tures. The CATH database has seen phenomenal growth, goingfrom 53M protein domains classiﬁed into 2737 homologousSFs in 2016 [10] to a current 95M protein domains classiﬁedinto 6119 SFs. Despite being one of the most comprehensiveresources, the CATH database (like any) is not without itslimitations, in terms of its underlying assumptions about therelationships between entities. For instance, it was recentlyargued that there exists an intermediate level of structuralgranularity that lies between the CATH hierarchys architecture(A) and topology (T) strata; dubbed the Urfold [11], thisrepresentational level is thought to capture the phenomenonof “ architectural similarity despite topological variability ”,as exhibited by the structure/function similarity in a deeply-varying collection of proteins that contain a characteristicsmall β -barrel (SBB) domain [12].With recent advances in computing power, Deep Learningmethods have begun to be applied to protein structures,in terms of predictions, similarity assessment and classiﬁ-cation [13], [14]. However deep neural networks (DNNs),and speciﬁcally 3D CNNs, have not yet seen widespreaduse with protein structures. This is likely the case because:(1) there is no single, ‘standard’/canonical orientation acrossthe set of all protein structures [15], which is problematicfor CNNs (which are not rotationally invariant), and (2)computational demands to train such models are exorbitant[16]. In this paper, we present a new Deep Learning methodto quantify similarities between protein domains using a 3D-CNN based autoencoder architecture. Our approach treatsprotein structures as 3D images, with any associated properties(physicochemical, evolutionary, etc.) incorporated as regionalattributes (on a voxel-by-voxel basis). To obviate the problemof angular dependence, we apply random rotations of a givenprotein; yielding multiple copies of the protein domain, notethat these geometric transformations are essentially a formof data augmentation [17]. In this work, we adapted currentdeep learning architectures, such as the CNNs used in imagesegmentation and classiﬁcation tasks [18], for applicationto our 3D protein classiﬁcation problem. A beneﬁt of ourapproach, as it pertains to proteins, is that 3D protein structuresare rather sparse, in terms of the fractional occupancy of voxelsin a region of 3D space to which the CNN is applied; thisfeature can be leveraged for rapid computation via so-calledsparse CNNs [19]. Past work with 3D medical images hasshown the viability of using sparse CNN architectures forclassiﬁcation and cellular morphological analysis [20].II. M ETHODS

A. Datasets and Initial Featurization

The primary source of data for this project was the CATHprotein structure classiﬁcation database [9]. CATH is a hierar- chical classiﬁcation scheme that organizes all known protein3D structures (from the Protein Data Bank [PDB; [21]]) bytheir similarity, with implicit inclusion of some evolutionaryinformation. The PDB houses nearly 180,000 biomolecularstructures, determined via experimental means such as X-ray crystallography, nuclear magnetic resonance (NMR) spec-troscopy, and cryo-electron microscopy (cryo-EM). CATHuses both automated and manual methods to parse eachpolypeptide chain in a PDB structure into individual domains.Domain-level entities are then classiﬁed within a structuralhierarchy: Class, Architecture, Topology and Homologoussuperfamily (see also Fig 1 in[11] for more on this). If thereis compelling evidence that two domains are evolutionarilyrelated (i.e., homologous, based on sequence similarity), thenthey are classiﬁed within the same superfamily. For each do-main, we obtain 3D structures and corresponding homologoussuperfamily labels from CATH. Next, we compute a hostof derived properties for each domain in CATH (Draizen etal., in prep )—including (i) purely geometric/structural quan-tities, e.g. secondary structure [21], solvent accessibility [22],(ii) physicochemical properties, e.g. hydrophobicity, partialcharges, electrostatic potentials [23])), and (iii) basic chemicaldescriptors (atom and residue types). As the initial use-casesthat are reported here, we examined three homologous super-families of particular interest to us: namely, immunoglobulins(2.60.40.10), the SH3 domain (2.30.30.100), and the OB fold(2.40.50.140). Our models were built using 5583, 2834 and585 domain structures of Ig, SH3 and OB, respectively.

B. Preprocessing; Further Data Engineering

We began by considering the aforementioned features foreach atom in the primary sequence of each domain. Our ﬁrststep was to ’clean’ the data by eliminating the features thatcontained more than 25 percent missing values. Next, we con-verted continuous-valued numerical features into binary. Forexample, hydrophobicity values were mapped to 1 if positiveand 0 otherwise. Next, we examined potential correlationsamong features and eliminated from further consideration anywhich were redundant (i.e., highly correlated with an existingfeature). At this point, our main concern was the computationalexpense of the problem at hand, and our consideration thatit would be cost-ineffective to train convolutional modelsthat incorporate detailed protein structural features whichare redundant (e.g., expected training times of several dayson a cloud infrastructure). At the end of the preprocessingstep, we were left with 38 features that included one-hotencoded representations of (i) atom type (C, CA, N, O,OH), (ii) element type (C elem, N elem, O elem, S elem),and (iii) residue type (ALA, CYS, ASP, GLU, PHE, GLY,HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER,THR, VAL, TRP, TYR). We also included physicochemical,secondary structural, and residue-associated binary propertiesat the atomic level (e.g,. is hydrophobic, is electronegative,positive charge, atom is buried, residue is buried, is helix,is sheet). Next, we represented protein domains as voxels (3Dvolumetric pixels) using an in-house discretization approachDraizen et al.; in prep). Brieﬂy, our method centers proteindomains in a 256 cube (to allow large domains), andeach atom is mapped to a 111 voxel using a k-d tree datastructure (the search ball radius is set to the van der Waalradius of the atom). If two atoms share the space in a givenvoxel, the maximum between their feature vectors is usedbecause they all contain binary values. Because a signiﬁcantfraction of voxels do not contain any atoms (proteins arenot cube-shaped!), protein domain structures can be encodedvia a sparse representation; this substantially mitigates thecomputational costs of our deep learning workﬂow. C. Model Design and Training

A 3D-CNN autoencoder was built for each of the threehomologous superfamilies considered here (Ig, SH3, OB). Ournetwork architecture is inspired by U-Net[18], a convolutionalnetwork used in biomedical image segmentation. The U-Net,which attempts to recreate the input after passing it through thenetwork, consists of a contractive path and an expansive path,giving the eponymous U-shaped architecture. In the contrac-tive path, each hidden layer contains two 333 convolutionsfollowed by a rectiﬁed linear unit (ReLU), and then a 222max pooling layer with strides of two in each dimension.During the contraction path, the spatial information is reduced(down-sampled), while feature information is increased. In theexpansive path, each layer consists of an up-convolution of222, by strides of two in each dimension, followed by two333 convolutions and, ﬁnally, each of those followed by ReLUnodes. In the convolutional layers, we utilized submanifoldsparse convolution operations [24]an approach that can ex-ploit sparsity of the data (the case for protein domains) inbuilding computationally efﬁcient sparse networks. The sparseimplementation of U-Net replaces the max pooling operationwith another convolutional operation. Our network has 32ﬁlters in the ﬁrst layer, and double the number of ﬁlterseach time the data is downsampled; there are ﬁve layers ofdownsampling (Figure 1). We use a linear activation functionin the ﬁnal layer. The convolutional blocks in our networkutilize an approach from the Oxford Robotics Institutes VisualGeometry Group (VGG); VGG-style blocks have been shownto work well in object recognition [25]. To avoid overﬁtting,we performed extensive data augmentation and added dropoutlayers with a rate of 0.5. For data augmentation, we appliedrandom rotations (mentioned above) to each protein structure;these rotations were in the form of orthogonal transformationmatrices drawn from the Haar distribution, which is theuniform distribution on the 3D rotation group (i.e.,

SO(3) ;[26]). In our implementation, we do not concatenate thehigh resolution features from the contracting path with theupsampled output as done in[18]. This ensures that the networkdoes not inadvertently learn to skip the lower network blocks,which would effectively short-circuit itself and contribute tooverﬁtting.We optimize against sum-squared errors in the output ofour model. We use sum-squared errors because of the relativeease of optimization; however, this may not be ideal for our

Figure 1. Illustrations of our ﬁnal network architecture. Dark blue boxesrepresent sparse convolutions, orange boxes represent size-2, stride-2 down-sampling convolutions; light blue boxes represent the deconvolutions. Thegreyed out arrows indicate no concatenation of features from the contractingpath with the upsampled output. task of binary classiﬁcation at the level of each voxel, andthis could be a direction to pursue in future implementationsof our model. We used stochastic gradient descent (SGD)as the optimization algorithm, with a momentum of 0.9 and0.0001 weight decay. We began with a learning rate of 0.01and decreased its value by a factor of e ((1 − epoch ) ∗ decay ) ineach training epoch, using 0.04 as the learning rate decayfactor (as suggested in [24]). Our ﬁnal network has around5M parameters in total and all the networks were trained for100 epochs, using a batch size of 16. We used the open-sourcePyTorch framework for all training and inferences [27]. D. Evaluation of Model Performance

For an autoencoder model with binary input features, theoutput is expected to be binary as well. Hence, we treatour problem as a binary classiﬁcation task at the level ofeach voxel and each feature. We calculate the area under theReceiver Operating Characteristic curve (AUC) [28] as theprimary measure for evaluating the performance of our deeplearning models. The average AUC of the model trained andtested on the Ig superfamily was 0.81, while for SH3 and OBit was 0.88 and 0.89, respectively. This indicates that the SH3and OB structures were more readily learned, versus the Igsuperfamily structures.III. R

ESULT AND D ISCUSSION

The approach developed and implemented in this work, asillustrated by the initial results described here, can help usvalidate and otherwise assess existing classiﬁcation schemes(e.g., CATH, SCOP [29], ECOD [30]). Perhaps most long-term, we believe our methodology can lay a broad foundationfor a robust, quantitative, and automatable/scalable mechanismfor protein structure classiﬁcation. This capability, in turn,would represent an advance on many frontsfor example, asa basis for improved processing pipelines for biomolecularstructural data and, even more fundamentally, as regards ourunderstanding of biomolecular evolution. Ultimately, can deeplearning help us discover more ’natural’ groupings of proteins?he remainder of this section describes our initial ﬁndings,using the Ig, SH3 and OB folds as intriguing use-cases.

A. Feature Importance via Analysis of ROC Curves

The determination and extraction of ‘optimal’ features playsa critical role in areas such as protein sequence analysis, aswell as in the prediction of protein structures, functions andinteractions [31]. Knowing the most ‘important’ features (e.g.,those with the greatest predictive power) becomes especiallyimportant when there exists abundant data, but there also existsevere limitations in terms of either computational resourcesor else helpful models/abstractions with which to computeon the data. Therefore, our initial analyses were concernedwith obtaining the most important (predictive) features forour task of protein classiﬁcation. As mentioned above, weextracted four categories of features: type of atom, type ofamino acid, corresponding physicochemical properties, andsecondary structural properties. To evaluate the impact of eachfeature group on the reconstructbity of our models, we indi-vidually utilize each set of features to build the autoencoderand evaluate the area under the ROC curve. Figure 2 showsthe ROC curve and AUC values for the top six features; notethat the plots in this ﬁgure are the averages of independentROC curves obtained by submitting all protein domains of onevariety (e.g., “all Ig domains”) into the respective superfamilymodel (e.g., “the Ig-only model”) that was trained on asingle feature. This ROC plot reveals that the most important(accuracy-determining) features are (i) atom type, (ii) certainphysicochemical properties (burial, electronegativity), and (iii)secondary structural class ( is sheet ). Figure 2. The ROC curve and AUC values for the top six features. ROCcurves were obtained by training six models, one model per feature. Thevalues in parentheses besides each feature name are the corresponding AUCvalues.

B. Reconstruction-based Clustering of Protein Superfamilies

To obtain potential clusters of similar protein domains (‘simi-lar’ under our 3D-CNN/autoencoder approach), we ﬁrst trainedthree autoencoder models, one for each of the Ig, SH3 andOB homologous superfamilies. Next, randomly selected out-of-family proteins were passed into a family-speciﬁc model (e.g., a random Ig passed into the SH3-only model), andreconstruction AUC values for each sample were generated.The basic idea is to utilize the reconstruction AUC as a metricof similarity between (i) a randomly-chosen trial structure,and (ii) the superfamily that is represented (and hopefullyaccurately ‘captured’) by a given model. In this way, 150random samples of domains were selected (50 from each ofthe SH3, OB and Ig superfamilies), and the reconstructionAUC from the three superfamily models were generated foreach of these 150 domains. Using these reconstruction AUCvectors as features, hierarchical agglomerative clustering wasperformed using the Python scikit-learn package [32]. Thedendrogram in Figure 3 shows the clusters obtained via single-linkage (“nearest neighbor”) clustering using the Euclideandistance, or else using Wards method, as the parameters in ourhierarchical clustering tasks. The dendrogram clearly indicatesthe existence of two dominant clusters in the data. The resultsof cluster assignment remained consistent with the use of k -means clustering, with a silhouette score [33] of 0.56. (Thesilhouette measures intra-cluster ‘cohesion’ versus inter-cluster‘separation’; it ranges from -1 [poor grouping] to +1 [goodgrouping].) Table I is the confusion matrix of superfamily vsobtained cluster labels. As can be seen from the dendrogramand the confusion matrix, the domain clustering that wecomputed (i.e., { Ig, { SH3,OB }} ) differs in some detail fromthat provided by the CATH classiﬁcation system. Figure 3. Hierarchical agglomerative clustering based on the AUC scores forvarious proteins from the Ig, OB and SH3 superfamilies. Note the relativelycohesive grouping of the Ig fold in this dendrogram, while instances of theOB and SH3 folds do not segregate nearly as ‘cleanly’.Table ISuperfamily Cluster1 Cluster2Ig 1 49Sh3 19 31Ob 24 26

In particular, the SH3 and OB superfolds “cross-contaminate” one another, whereas the majority of Ig domainsare cleanly separated into distinct clusters. Our ﬁnding that theSH3 and OB do not cleanly cluster (vs the separation exhibitedy Ig) is consistent with the recent proposal of an

Urfold [11]level of protein structurenamely, the notion that there may existentities that can exhibit similarity at intermediate levels ofstructural granularity, i.e. between the clear-cut A (cid:33) T (cid:33) H, etc. levels of CATHs hierarchical classiﬁcation scheme.We also examined the average template modeling (TM)score [34] for pairs within the clusters and for pairs within thesuperfamilies. In our manual spot-checking of a few cases, weinvariably found that the average TM-score computed underour clustering scheme was lower than the average TM-scorecomputed within each of the CATH superfamilies. This isnotable, as it indicates that the clusters obtained using the AUCreconstruction-based score are not driven purely by geometricproperties of the domains but instead that other properties (e.g.,physicochemical) of the domains also inﬂuence the goodnessof ﬁt to our superfamily model.

C. Future directions

To further validate and establish the initial results reportedhere, our methodology can be applied to broader sets ofprotein domains, sampled across many more homologoussuperfamilies (and other levels in the C-A-T-H hierarchy). Ourﬁndings thus far, though limited to only the Ig, SH3 and OBsuperfamilies, are nevertheless quite interesting: we believethat application of our approach to greater numbers of super-families can yield even greater insight into the nature of proteininterrelationships (groupings). To improve the performanceof our 3D-CNN-based deep learning models, we suspectthat probabilistic frameworks such as variational autoencoders(VAEs; [35]) can be fruitfully employed [36]. VAEs haverecently emerged as a powerful approach for unsupervisedlearning of complicated distributions (i.e., highly intricateinput → output mappings [latent spaces], with little to nocausal information available). For instance, VAEs using con-cepts from Bayesian machine learning have proven effectivein semantic segmentation and visual scene understanding tasks[37], and we anticipate that coupling our 3D-CNN approachto a VAE (versus the simple AE used here) could enhance ourprotein domain classiﬁcation methodology. Speciﬁc beneﬁts ofapplying VAEs to our protein problem could include a relaxedneed to rely on labelled data (e.g., superfamily labels), and alsothe capacity to discover relationships between two ’group-ings’ of proteins, A and B, that have been hitherto entirelyunknowna type of result that would have deep evolutionaryimplications.To try and elucidate the basis for the classiﬁcation decisionsof our models, i.e., demystify the typical AI “black box” bymaking our DNN interpretable, we plan to implenent the layer-wise relevance propagation (LRP) [38] algorithm; this methodis prominent in the realm of explainable machine learning[39]. The LRP algorithmby explicitly tracking the patterns oflearned weights (dependencies) from layer to layer (akin tothe backpropagation algorithm)effectively provides a ’picture’of what elements from the input domain map to particularfeatures of the models output; this, in turn, can afford immenseexplanatory power, particularly in the context of a geometric object like a 3D protein structure. For our model, LRP hasthe potential to highlight which voxels in the structure of aprotein domain contribute to the reconstruction of the output,and to what extent, perhaps extracting the ‘urfold.’IV. C ONCLUSION

In this paper, we proposed a new approach for measuringsimilarity between protein domains and assessing existing pro-tein classiﬁcation schemes. We deﬁned an autoencoder modelthat incorporates residue and constituent atom information of3D-structures of protein domains as well as their structuraland physicochemical properties. We successfully implementedthe proposed model and we empirically demonstrated thatit allows us to effectively and efﬁciently learn the deﬁningproperties of protein 3D structures, at least at the super-family level. Speciﬁc results from our calculations showedthat considering secondary structural and physicochemicalinformation, in addition to pure geometric information, greatlyimproves our ability to cluster protein domains. Interestingly,one of the most important features identiﬁed by our modelsis the is beta sheet feature (which deﬁnes a major structuralelement in proteins). Since structural motifs α, β are closelyassociated with protein fold, this result supports the ideathat inclusion of secondary motif information alone in our3D structural similarity classiﬁcation task disproportionatelyimproved the performance of our models [40]. From ourreconstruction-based clustering, the SH3 and OB superfolds“cross-contaminate” one another, whereas the majority of Igdomains are cleanly separated into distinct clusters. Basedon the hierarchical classiﬁcation scheme of CATH, whichis predicated largely on 3D structure but also accounts forsequence similarity (H level), superfamilies which belong tothe Ig, SH3, and OB folds do indeed differ at the level of ar-chitecture (A). Thus, our ﬁnding– speciﬁcally, of SH3 and OBco-clustering implies that there may well exist more importantfactors, beyond purely geometric and structural similarities,to map the relationship between protein superfamilies; thisconcept is epitomized by the recent notion of an Urfold[11]level of structural granularity, consisting of entities betweenCATHs A and T levels. Future directions will further explorethis promising ﬁnding.A

CKNOWLEDGEMENTS

We thank Loreto Peter Alonzi for assistance with high-performance computing resources, and Gerard Learmonth andAbigail Flower for feedback and guidance. Portions of thiswork were supported by the University of Virginia School ofData Science and NSF C

AREER award MCB1350957.R

EFERENCES[1] J. Kuriyan, B. Konforti, and D. Wemmer, The Molecules of Life:Physical and Chemical Principles, 1 edition. New York, NY: GarlandScience, 2012.[2] J. M. Berg, J. L. Tymoczko, and L. Stryer, Protein Structure andFunction, in Biochemistry. 5th edition, W H Freeman, 2002.[3] W. R. Taylor, Exploring Protein Fold Space, Biomolecules, vol. 10, no.2, Jan. 2020, doi: 10.3390/biom10020193.4] T. C. Terwilliger, D. Stuart, and S. Yokoyama, Lessons from struc-tural genomics, Annu. Rev. Biophys., vol. 38, pp. 371383, 2009, doi:10.1146/annurev.biophys.050708.133740.[5] T. L. Bailey, N. Williams, C. Misleh, and W. W. Li, MEME: dis-covering and analyzing DNA and protein sequence motifs, NucleicAcids Res., vol. 34, no. suppl 2, pp. W369W373, Jul. 2006, doi:10.1093/nar/gkl198.[6] B. Kuhlman and P. Bradley, Advances in protein structure prediction anddesign, Nat. Rev. Mol. Cell Biol., vol. 20, no. 11, pp. 681697, 2019,doi: 10.1038/s41580-019-0163-x.[7] C. Mura and C. E. McAnany, An introduction to biomolecular simu-lations and docking, Mol. Simul., vol. 40, no. 1011, pp. 732764, Aug.2014, doi: 10.1080/08927022.2014.935372.[8] A. W. Senior et al., Improved protein structure prediction using poten-tials from deep learning, Nature, vol. 577, no. 7792, pp. 706710, Jan.2020, doi: 10.1038/s41586-019-1923-7.[9] M. Knudsen and C. Wiuf, The CATH database, Hum. Genomics, vol.4, no. 3, p. 207, Feb. 2010, doi: 10.1186/1479-7364-4-3-207.[10] N. L. Dawson et al., CATH: an expanded resource to predict proteinfunction through structure and sequence, Nucleic Acids Res., vol. 45,no. D1, pp. D289D295, Jan. 2017, doi: 10.1093/nar/gkw1098.[11] C. Mura, S. Veretnik, and P. E. Bourne, The Urfold: Structural similarityjust above the superfold level?, Protein Sci. Publ. Protein Soc., vol. 28,no. 12, pp. 21192126, Dec. 2019, doi: 10.1002/pro.3742.[12] P. Youkharibache, S. Veretnik, Q. Li, K. A. Stanek, C. Mura, and P.E. Bourne, The Small ββ