Deep Learning of Protein Structural Classes: Any Evidence for an 'Urfold'?
Menuka Jaiswal, Saad Saleem, Yonghyeon Kweon, Eli J Draizen, Stella Veretnik, Cameron Mura, Philip E. Bourne
DDeep Learning of Protein Structural Classes:Any Evidence for an ‘
Urfold ? st Menuka Jaiswal
School of Data ScienceUniversity of Virginia
[email protected] st Saad Saleem
School of Data ScienceUniversity of Virginia
[email protected] st Yonghyeon Kweon
School of Data ScienceUniversity of Virginia
[email protected] nd Eli J Draizen
Department of Biomedical EngineeringUniversity of Virginia
[email protected] nd Stella Veretnik
Department of Biomedical EngineeringUniversity of Virginia
[email protected] nd Cameron Mura
Department of Biomedical EngineeringUniversity of Virginia
[email protected] nd Philip E Bourne
School of Data ScienceUniversity of Virginia
Abstract —Recent computational advances in the accurate pre-diction of protein three-dimensional (3D) structures from aminoacid sequences now present a unique opportunity to decipherthe interrelationships between proteins. This task entailsbut isnot equivalent toa problem of 3D structure comparison andclassification. Historically, protein domain classification has beena largely manual and subjective activity, relying upon variousheuristics. Databases such as CATH represent significant stepstowards a more systematic (and automatable) approach, yet therestill remains much room for the development of more scalableand quantitative classification methods, grounded in machinelearning. We suspect that re-examining these relationships viaa Deep Learning (DL) approach may entail a large-scale re-structuring of classification schemes, improved with respect tothe interpretability of distant relationships between proteins.Here, we describe our training of DL models on protein domainstructures (and their associated physicochemical properties) inorder to evaluate classification properties at CATHs homologoussuperfamily (SF) level. To achieve this, we have devised andapplied an extension of image-classification methods and imagesegmentation techniques, utilizing a convolutional autoencodermodel architecture. Our DL architecture allows models to learnstructural features that, in a sense, ‘define’ different homologousSFs. We evaluate and quantify pairwise ‘distances’ betweenSFs by building one model per SF and comparing the lossfunctions of the models. Hierarchical clustering on these distancematrices provides a new view of protein interrelationshipsa viewthat extends beyond simple structural/geometric similarity, andtowards the realm of structure/function properties.
Index Terms —Autoencoders; CNNs; CATH; deep learning;protein domain classification; protein structure
I. I
NTRODUCTION
Proteins are key biological macromolecules that consist oflong, unbranched chains of amino acids (AAs) linked viapeptide bonds [1]; they perform most of the physiologicalfunctions of cellular life (enzymes, receptors, etc.). Differentsequences of the 20 naturally-occurring AAs can fold intotertiary structures with variable degrees of geometric similar-ity. At the sequence level, the differences between any twoproteins can range from relatively modest single-point edits (“point mutations” or ‘substitutions’) to larger-scale changessuch as reorganization of entire segments of a polypeptidechain. Such changes are critical because they influence 3Dstructure, and protein function stems from 3D structure [2].Indeed, our ability to elucidate protein function and evolutionis intimately linked to our knowledge of protein structure.Equally important, interrelationships between structures definea map of the protein universe [3]. Thus, it is paramountto have a robust classification system for categorically orga-nizing protein structures based upon their similarities. Evenmore basically, what does ‘similarity’ mean in this context(geometrically, chemically, etc.)? And, are there particularformulations of ‘similarity’ that are more salient than others?Ideally, any system of comparison would also take into accountfunctional propertiesi.e., not just raw geometric shape, but alsoproperties such as acidity/basicity, physicochemical features ofthe surface, “binding pockets”, and so on.An unprecedented volume of protein structural and func-tional data is now available, largely because of exponentialgrowth in genome sequencing and efforts such as structuralgenomics [4]; alongside these data, novel computational ap-proaches can now discover subtle signals in sets of sequences[5]. These advances have yielded a vast trove of potentialprotein sequences and structures, but the utility of the informa-tion has been limited because the mapping from sequence tostructure (or ‘fold’) has remained enigmatic; this decades-longgrand challenge is known as the “protein folding problem”[6].As computational approaches to the folding problem continu-ally improve, increasingly we will compute 3D structures fromprotein sequences de novo . Therefore, we expect to see thedemand for identifying and cataloging new protein structuresgrow at an ever-increasing pace with the rise in the numberof 3D structures, both experimentally determined as well ascomputationally predicted (via physics-based simulations [7]or via Deep Learning approaches [8]).Historically, efforts to catalog protein structures have been a r X i v : . [ q - b i o . B M ] M a y largely manual and painstaking process, fraught with heuris-tics. There has been a shift in the paradigm since the intro-duction of methodical hierarchical database structures, suchas CATH, which engender more robust classification schemesinto which new protein structures can be incorporated [9]. Thisis critical as we expand our knowledge of known protein struc-tures. The CATH database has seen phenomenal growth, goingfrom 53M protein domains classified into 2737 homologousSFs in 2016 [10] to a current 95M protein domains classifiedinto 6119 SFs. Despite being one of the most comprehensiveresources, the CATH database (like any) is not without itslimitations, in terms of its underlying assumptions about therelationships between entities. For instance, it was recentlyargued that there exists an intermediate level of structuralgranularity that lies between the CATH hierarchys architecture(A) and topology (T) strata; dubbed the Urfold [11], thisrepresentational level is thought to capture the phenomenonof “ architectural similarity despite topological variability ”,as exhibited by the structure/function similarity in a deeply-varying collection of proteins that contain a characteristicsmall β -barrel (SBB) domain [12].With recent advances in computing power, Deep Learningmethods have begun to be applied to protein structures,in terms of predictions, similarity assessment and classifi-cation [13], [14]. However deep neural networks (DNNs),and specifically 3D CNNs, have not yet seen widespreaduse with protein structures. This is likely the case because:(1) there is no single, ‘standard’/canonical orientation acrossthe set of all protein structures [15], which is problematicfor CNNs (which are not rotationally invariant), and (2)computational demands to train such models are exorbitant[16]. In this paper, we present a new Deep Learning methodto quantify similarities between protein domains using a 3D-CNN based autoencoder architecture. Our approach treatsprotein structures as 3D images, with any associated properties(physicochemical, evolutionary, etc.) incorporated as regionalattributes (on a voxel-by-voxel basis). To obviate the problemof angular dependence, we apply random rotations of a givenprotein; yielding multiple copies of the protein domain, notethat these geometric transformations are essentially a formof data augmentation [17]. In this work, we adapted currentdeep learning architectures, such as the CNNs used in imagesegmentation and classification tasks [18], for applicationto our 3D protein classification problem. A benefit of ourapproach, as it pertains to proteins, is that 3D protein structuresare rather sparse, in terms of the fractional occupancy of voxelsin a region of 3D space to which the CNN is applied; thisfeature can be leveraged for rapid computation via so-calledsparse CNNs [19]. Past work with 3D medical images hasshown the viability of using sparse CNN architectures forclassification and cellular morphological analysis [20].II. M ETHODS
A. Datasets and Initial Featurization
The primary source of data for this project was the CATHprotein structure classification database [9]. CATH is a hierar- chical classification scheme that organizes all known protein3D structures (from the Protein Data Bank [PDB; [21]]) bytheir similarity, with implicit inclusion of some evolutionaryinformation. The PDB houses nearly 180,000 biomolecularstructures, determined via experimental means such as X-ray crystallography, nuclear magnetic resonance (NMR) spec-troscopy, and cryo-electron microscopy (cryo-EM). CATHuses both automated and manual methods to parse eachpolypeptide chain in a PDB structure into individual domains.Domain-level entities are then classified within a structuralhierarchy: Class, Architecture, Topology and Homologoussuperfamily (see also Fig 1 in[11] for more on this). If thereis compelling evidence that two domains are evolutionarilyrelated (i.e., homologous, based on sequence similarity), thenthey are classified within the same superfamily. For each do-main, we obtain 3D structures and corresponding homologoussuperfamily labels from CATH. Next, we compute a hostof derived properties for each domain in CATH (Draizen etal., in prep )—including (i) purely geometric/structural quan-tities, e.g. secondary structure [21], solvent accessibility [22],(ii) physicochemical properties, e.g. hydrophobicity, partialcharges, electrostatic potentials [23])), and (iii) basic chemicaldescriptors (atom and residue types). As the initial use-casesthat are reported here, we examined three homologous super-families of particular interest to us: namely, immunoglobulins(2.60.40.10), the SH3 domain (2.30.30.100), and the OB fold(2.40.50.140). Our models were built using 5583, 2834 and585 domain structures of Ig, SH3 and OB, respectively.
B. Preprocessing; Further Data Engineering
We began by considering the aforementioned features foreach atom in the primary sequence of each domain. Our firststep was to ’clean’ the data by eliminating the features thatcontained more than 25 percent missing values. Next, we con-verted continuous-valued numerical features into binary. Forexample, hydrophobicity values were mapped to 1 if positiveand 0 otherwise. Next, we examined potential correlationsamong features and eliminated from further consideration anywhich were redundant (i.e., highly correlated with an existingfeature). At this point, our main concern was the computationalexpense of the problem at hand, and our consideration thatit would be cost-ineffective to train convolutional modelsthat incorporate detailed protein structural features whichare redundant (e.g., expected training times of several dayson a cloud infrastructure). At the end of the preprocessingstep, we were left with 38 features that included one-hotencoded representations of (i) atom type (C, CA, N, O,OH), (ii) element type (C elem, N elem, O elem, S elem),and (iii) residue type (ALA, CYS, ASP, GLU, PHE, GLY,HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER,THR, VAL, TRP, TYR). We also included physicochemical,secondary structural, and residue-associated binary propertiesat the atomic level (e.g,. is hydrophobic, is electronegative,positive charge, atom is buried, residue is buried, is helix,is sheet). Next, we represented protein domains as voxels (3Dvolumetric pixels) using an in-house discretization approachDraizen et al.; in prep). Briefly, our method centers proteindomains in a 256 cube (to allow large domains), andeach atom is mapped to a 111 voxel using a k-d tree datastructure (the search ball radius is set to the van der Waalradius of the atom). If two atoms share the space in a givenvoxel, the maximum between their feature vectors is usedbecause they all contain binary values. Because a significantfraction of voxels do not contain any atoms (proteins arenot cube-shaped!), protein domain structures can be encodedvia a sparse representation; this substantially mitigates thecomputational costs of our deep learning workflow. C. Model Design and Training
A 3D-CNN autoencoder was built for each of the threehomologous superfamilies considered here (Ig, SH3, OB). Ournetwork architecture is inspired by U-Net[18], a convolutionalnetwork used in biomedical image segmentation. The U-Net,which attempts to recreate the input after passing it through thenetwork, consists of a contractive path and an expansive path,giving the eponymous U-shaped architecture. In the contrac-tive path, each hidden layer contains two 333 convolutionsfollowed by a rectified linear unit (ReLU), and then a 222max pooling layer with strides of two in each dimension.During the contraction path, the spatial information is reduced(down-sampled), while feature information is increased. In theexpansive path, each layer consists of an up-convolution of222, by strides of two in each dimension, followed by two333 convolutions and, finally, each of those followed by ReLUnodes. In the convolutional layers, we utilized submanifoldsparse convolution operations [24]an approach that can ex-ploit sparsity of the data (the case for protein domains) inbuilding computationally efficient sparse networks. The sparseimplementation of U-Net replaces the max pooling operationwith another convolutional operation. Our network has 32filters in the first layer, and double the number of filterseach time the data is downsampled; there are five layers ofdownsampling (Figure 1). We use a linear activation functionin the final layer. The convolutional blocks in our networkutilize an approach from the Oxford Robotics Institutes VisualGeometry Group (VGG); VGG-style blocks have been shownto work well in object recognition [25]. To avoid overfitting,we performed extensive data augmentation and added dropoutlayers with a rate of 0.5. For data augmentation, we appliedrandom rotations (mentioned above) to each protein structure;these rotations were in the form of orthogonal transformationmatrices drawn from the Haar distribution, which is theuniform distribution on the 3D rotation group (i.e.,
SO(3) ;[26]). In our implementation, we do not concatenate thehigh resolution features from the contracting path with theupsampled output as done in[18]. This ensures that the networkdoes not inadvertently learn to skip the lower network blocks,which would effectively short-circuit itself and contribute tooverfitting.We optimize against sum-squared errors in the output ofour model. We use sum-squared errors because of the relativeease of optimization; however, this may not be ideal for our
Figure 1. Illustrations of our final network architecture. Dark blue boxesrepresent sparse convolutions, orange boxes represent size-2, stride-2 down-sampling convolutions; light blue boxes represent the deconvolutions. Thegreyed out arrows indicate no concatenation of features from the contractingpath with the upsampled output. task of binary classification at the level of each voxel, andthis could be a direction to pursue in future implementationsof our model. We used stochastic gradient descent (SGD)as the optimization algorithm, with a momentum of 0.9 and0.0001 weight decay. We began with a learning rate of 0.01and decreased its value by a factor of e ((1 − epoch ) ∗ decay ) ineach training epoch, using 0.04 as the learning rate decayfactor (as suggested in [24]). Our final network has around5M parameters in total and all the networks were trained for100 epochs, using a batch size of 16. We used the open-sourcePyTorch framework for all training and inferences [27]. D. Evaluation of Model Performance
For an autoencoder model with binary input features, theoutput is expected to be binary as well. Hence, we treatour problem as a binary classification task at the level ofeach voxel and each feature. We calculate the area under theReceiver Operating Characteristic curve (AUC) [28] as theprimary measure for evaluating the performance of our deeplearning models. The average AUC of the model trained andtested on the Ig superfamily was 0.81, while for SH3 and OBit was 0.88 and 0.89, respectively. This indicates that the SH3and OB structures were more readily learned, versus the Igsuperfamily structures.III. R
ESULT AND D ISCUSSION
The approach developed and implemented in this work, asillustrated by the initial results described here, can help usvalidate and otherwise assess existing classification schemes(e.g., CATH, SCOP [29], ECOD [30]). Perhaps most long-term, we believe our methodology can lay a broad foundationfor a robust, quantitative, and automatable/scalable mechanismfor protein structure classification. This capability, in turn,would represent an advance on many frontsfor example, asa basis for improved processing pipelines for biomolecularstructural data and, even more fundamentally, as regards ourunderstanding of biomolecular evolution. Ultimately, can deeplearning help us discover more ’natural’ groupings of proteins?he remainder of this section describes our initial findings,using the Ig, SH3 and OB folds as intriguing use-cases.
A. Feature Importance via Analysis of ROC Curves
The determination and extraction of ‘optimal’ features playsa critical role in areas such as protein sequence analysis, aswell as in the prediction of protein structures, functions andinteractions [31]. Knowing the most ‘important’ features (e.g.,those with the greatest predictive power) becomes especiallyimportant when there exists abundant data, but there also existsevere limitations in terms of either computational resourcesor else helpful models/abstractions with which to computeon the data. Therefore, our initial analyses were concernedwith obtaining the most important (predictive) features forour task of protein classification. As mentioned above, weextracted four categories of features: type of atom, type ofamino acid, corresponding physicochemical properties, andsecondary structural properties. To evaluate the impact of eachfeature group on the reconstructbity of our models, we indi-vidually utilize each set of features to build the autoencoderand evaluate the area under the ROC curve. Figure 2 showsthe ROC curve and AUC values for the top six features; notethat the plots in this figure are the averages of independentROC curves obtained by submitting all protein domains of onevariety (e.g., “all Ig domains”) into the respective superfamilymodel (e.g., “the Ig-only model”) that was trained on asingle feature. This ROC plot reveals that the most important(accuracy-determining) features are (i) atom type, (ii) certainphysicochemical properties (burial, electronegativity), and (iii)secondary structural class ( is sheet ). Figure 2. The ROC curve and AUC values for the top six features. ROCcurves were obtained by training six models, one model per feature. Thevalues in parentheses besides each feature name are the corresponding AUCvalues.
B. Reconstruction-based Clustering of Protein Superfamilies
To obtain potential clusters of similar protein domains (‘simi-lar’ under our 3D-CNN/autoencoder approach), we first trainedthree autoencoder models, one for each of the Ig, SH3 andOB homologous superfamilies. Next, randomly selected out-of-family proteins were passed into a family-specific model (e.g., a random Ig passed into the SH3-only model), andreconstruction AUC values for each sample were generated.The basic idea is to utilize the reconstruction AUC as a metricof similarity between (i) a randomly-chosen trial structure,and (ii) the superfamily that is represented (and hopefullyaccurately ‘captured’) by a given model. In this way, 150random samples of domains were selected (50 from each ofthe SH3, OB and Ig superfamilies), and the reconstructionAUC from the three superfamily models were generated foreach of these 150 domains. Using these reconstruction AUCvectors as features, hierarchical agglomerative clustering wasperformed using the Python scikit-learn package [32]. Thedendrogram in Figure 3 shows the clusters obtained via single-linkage (“nearest neighbor”) clustering using the Euclideandistance, or else using Wards method, as the parameters in ourhierarchical clustering tasks. The dendrogram clearly indicatesthe existence of two dominant clusters in the data. The resultsof cluster assignment remained consistent with the use of k -means clustering, with a silhouette score [33] of 0.56. (Thesilhouette measures intra-cluster ‘cohesion’ versus inter-cluster‘separation’; it ranges from -1 [poor grouping] to +1 [goodgrouping].) Table I is the confusion matrix of superfamily vsobtained cluster labels. As can be seen from the dendrogramand the confusion matrix, the domain clustering that wecomputed (i.e., { Ig, { SH3,OB }} ) differs in some detail fromthat provided by the CATH classification system. Figure 3. Hierarchical agglomerative clustering based on the AUC scores forvarious proteins from the Ig, OB and SH3 superfamilies. Note the relativelycohesive grouping of the Ig fold in this dendrogram, while instances of theOB and SH3 folds do not segregate nearly as ‘cleanly’.Table ISuperfamily Cluster1 Cluster2Ig 1 49Sh3 19 31Ob 24 26
In particular, the SH3 and OB superfolds “cross-contaminate” one another, whereas the majority of Ig domainsare cleanly separated into distinct clusters. Our finding that theSH3 and OB do not cleanly cluster (vs the separation exhibitedy Ig) is consistent with the recent proposal of an
Urfold [11]level of protein structurenamely, the notion that there may existentities that can exhibit similarity at intermediate levels ofstructural granularity, i.e. between the clear-cut A (cid:33) T (cid:33) H, etc. levels of CATHs hierarchical classification scheme.We also examined the average template modeling (TM)score [34] for pairs within the clusters and for pairs within thesuperfamilies. In our manual spot-checking of a few cases, weinvariably found that the average TM-score computed underour clustering scheme was lower than the average TM-scorecomputed within each of the CATH superfamilies. This isnotable, as it indicates that the clusters obtained using the AUCreconstruction-based score are not driven purely by geometricproperties of the domains but instead that other properties (e.g.,physicochemical) of the domains also influence the goodnessof fit to our superfamily model.
C. Future directions
To further validate and establish the initial results reportedhere, our methodology can be applied to broader sets ofprotein domains, sampled across many more homologoussuperfamilies (and other levels in the C-A-T-H hierarchy). Ourfindings thus far, though limited to only the Ig, SH3 and OBsuperfamilies, are nevertheless quite interesting: we believethat application of our approach to greater numbers of super-families can yield even greater insight into the nature of proteininterrelationships (groupings). To improve the performanceof our 3D-CNN-based deep learning models, we suspectthat probabilistic frameworks such as variational autoencoders(VAEs; [35]) can be fruitfully employed [36]. VAEs haverecently emerged as a powerful approach for unsupervisedlearning of complicated distributions (i.e., highly intricateinput → output mappings [latent spaces], with little to nocausal information available). For instance, VAEs using con-cepts from Bayesian machine learning have proven effectivein semantic segmentation and visual scene understanding tasks[37], and we anticipate that coupling our 3D-CNN approachto a VAE (versus the simple AE used here) could enhance ourprotein domain classification methodology. Specific benefits ofapplying VAEs to our protein problem could include a relaxedneed to rely on labelled data (e.g., superfamily labels), and alsothe capacity to discover relationships between two ’group-ings’ of proteins, A and B, that have been hitherto entirelyunknowna type of result that would have deep evolutionaryimplications.To try and elucidate the basis for the classification decisionsof our models, i.e., demystify the typical AI “black box” bymaking our DNN interpretable, we plan to implenent the layer-wise relevance propagation (LRP) [38] algorithm; this methodis prominent in the realm of explainable machine learning[39]. The LRP algorithmby explicitly tracking the patterns oflearned weights (dependencies) from layer to layer (akin tothe backpropagation algorithm)effectively provides a ’picture’of what elements from the input domain map to particularfeatures of the models output; this, in turn, can afford immenseexplanatory power, particularly in the context of a geometric object like a 3D protein structure. For our model, LRP hasthe potential to highlight which voxels in the structure of aprotein domain contribute to the reconstruction of the output,and to what extent, perhaps extracting the ‘urfold.’IV. C ONCLUSION
In this paper, we proposed a new approach for measuringsimilarity between protein domains and assessing existing pro-tein classification schemes. We defined an autoencoder modelthat incorporates residue and constituent atom information of3D-structures of protein domains as well as their structuraland physicochemical properties. We successfully implementedthe proposed model and we empirically demonstrated thatit allows us to effectively and efficiently learn the definingproperties of protein 3D structures, at least at the super-family level. Specific results from our calculations showedthat considering secondary structural and physicochemicalinformation, in addition to pure geometric information, greatlyimproves our ability to cluster protein domains. Interestingly,one of the most important features identified by our modelsis the is beta sheet feature (which defines a major structuralelement in proteins). Since structural motifs α, β are closelyassociated with protein fold, this result supports the ideathat inclusion of secondary motif information alone in our3D structural similarity classification task disproportionatelyimproved the performance of our models [40]. From ourreconstruction-based clustering, the SH3 and OB superfolds“cross-contaminate” one another, whereas the majority of Igdomains are cleanly separated into distinct clusters. Basedon the hierarchical classification scheme of CATH, whichis predicated largely on 3D structure but also accounts forsequence similarity (H level), superfamilies which belong tothe Ig, SH3, and OB folds do indeed differ at the level of ar-chitecture (A). Thus, our finding– specifically, of SH3 and OBco-clustering implies that there may well exist more importantfactors, beyond purely geometric and structural similarities,to map the relationship between protein superfamilies; thisconcept is epitomized by the recent notion of an Urfold[11]level of structural granularity, consisting of entities betweenCATHs A and T levels. Future directions will further explorethis promising finding.A
CKNOWLEDGEMENTS
We thank Loreto Peter Alonzi for assistance with high-performance computing resources, and Gerard Learmonth andAbigail Flower for feedback and guidance. Portions of thiswork were supported by the University of Virginia School ofData Science and NSF C
AREER award MCB1350957.R
EFERENCES[1] J. Kuriyan, B. Konforti, and D. Wemmer, The Molecules of Life:Physical and Chemical Principles, 1 edition. New York, NY: GarlandScience, 2012.[2] J. M. Berg, J. L. Tymoczko, and L. Stryer, Protein Structure andFunction, in Biochemistry. 5th edition, W H Freeman, 2002.[3] W. R. Taylor, Exploring Protein Fold Space, Biomolecules, vol. 10, no.2, Jan. 2020, doi: 10.3390/biom10020193.4] T. C. Terwilliger, D. Stuart, and S. Yokoyama, Lessons from struc-tural genomics, Annu. Rev. Biophys., vol. 38, pp. 371383, 2009, doi:10.1146/annurev.biophys.050708.133740.[5] T. L. Bailey, N. Williams, C. Misleh, and W. W. Li, MEME: dis-covering and analyzing DNA and protein sequence motifs, NucleicAcids Res., vol. 34, no. suppl 2, pp. W369W373, Jul. 2006, doi:10.1093/nar/gkl198.[6] B. Kuhlman and P. Bradley, Advances in protein structure prediction anddesign, Nat. Rev. Mol. Cell Biol., vol. 20, no. 11, pp. 681697, 2019,doi: 10.1038/s41580-019-0163-x.[7] C. Mura and C. E. McAnany, An introduction to biomolecular simu-lations and docking, Mol. Simul., vol. 40, no. 1011, pp. 732764, Aug.2014, doi: 10.1080/08927022.2014.935372.[8] A. W. Senior et al., Improved protein structure prediction using poten-tials from deep learning, Nature, vol. 577, no. 7792, pp. 706710, Jan.2020, doi: 10.1038/s41586-019-1923-7.[9] M. Knudsen and C. Wiuf, The CATH database, Hum. Genomics, vol.4, no. 3, p. 207, Feb. 2010, doi: 10.1186/1479-7364-4-3-207.[10] N. L. Dawson et al., CATH: an expanded resource to predict proteinfunction through structure and sequence, Nucleic Acids Res., vol. 45,no. D1, pp. D289D295, Jan. 2017, doi: 10.1093/nar/gkw1098.[11] C. Mura, S. Veretnik, and P. E. Bourne, The Urfold: Structural similarityjust above the superfold level?, Protein Sci. Publ. Protein Soc., vol. 28,no. 12, pp. 21192126, Dec. 2019, doi: 10.1002/pro.3742.[12] P. Youkharibache, S. Veretnik, Q. Li, K. A. Stanek, C. Mura, and P.E. Bourne, The Small ββ