VeRNAl: Mining RNA Structures for Fuzzy Base Pairing Network Motifs
Carlos Oliver, Vincent Mallet, Pericles Philippopoulos, William L. Hamilton, Jerome Waldispuhl
VVeRNAl : A Tool for Mining Fuzzy Network Motifs in RNA
Carlos Oliver ∗ † ‡ , Vincent Mallet ∗ § ¶ , Pericles Philippopoulos (cid:107) , William L. Hamilton † ‡ and Jérôme Waldispühl §† School of Computer Science, McGill University ‡ Montreal Institute for Learning Algorithms § Structural Bioinformatics Unit, Pasteur Institute ¶ CBIO, Les Mines-Paristech (cid:107)
Department of Physics, McGill University ∗ Both authors contributed equally
ABSTRACTMotivation :
RNAs are ubiquitous molecules involved in manyregulatory and catalytic processes. Their ability to form complexstructures is often key to support these functions. Remarkably,RNA 3D structures are articulated around smaller 3D sub-unitsreferred as RNA 3D motifs that can be found in unrelatedmolecules. The classification of these 3D motifs is thus essential tocharacterize RNA structures, but current methods can only retrievemotifs with identical base interaction patterns.
Results :
Here, we relax this constraint by posing the motiffinding problem as a graph representation learning and clusteringtask. This framing takes advantage of the continuous nature ofgraph representations to model the flexibility of RNA motifs whileretaining the convenient encoding of RNAs as graphs. We proposea set of node similarity functions, clustering methods, and motifconstruction algorithms to recover flexible RNA motifs. We showthat our methods are able to retrieve and expand known classes ofmotifs, but also to identify new motifs. Our tool,
VeRNAl can beeasily customized by users to desired levels of motif flexibility,abundance and size.
Availability and Implementation :
The source code, data and awebserver are available at vernal.cs.mcgill.caContact : [email protected]
Supplementary Information :
All supplementary files areavailable online
1. Introduction
Non-coding functions of ribonucleic acids (RNAs) are fre-quently determined by their 3D structure and folding dynamics [1].The linear chain of nucleotides (A, U, C, G) builds first canonical(Watson-Crick and Wobble) and non-canonical (all the others) basepairs [2], which serves as a scaffold for the formation of the fulltertiary structure. The conservation of these base pairs is thusessential to preserve the folding properties of the RNA and offersa robust signature for the functional classification of RNAs [3].The comparison of experimentally determined RNA structuresrevealed the occurrence of highly similar 3D sub-units, called RNA3D motifs, that are characterized by similar base pair networks andrepeated across unrelated RNA [4]. A complete library of RNA 3Dmotifs would be a valuable source of information for evolutionarystudies and also boost structure prediction methods. Efficient and automated methods to compare databases of RNAstructures are essential to achieve this goal [5], [6], [7]. Their resultscontributed to the advancement of sequence-structure predictiontools [8], [9], and showed promises for interpreting of functionprediction algorithms based on RNA 3D networks [10].
RNA motif mining methods can be broadly classified in twocategories: 3D-based and graph-based. 3D-based tools seek toidentify families of related structures by performing alignmentsand clustering of atomic coordinates . RNA3dmotifAtlas [7], RNABricks [11], and RNA MCS [12] illustrate this approach. Sincesimilarity can be conveniently defined directly in Euclidean spacefor atomic coordinates, the notion of structural proximity of motifsidentified by these tools naturally accommodate some degree ofvariability. However, these methods require a decomposition of RNAinto rigid sub-units to be compared to each other (i.e., comparingall internal loops to each other), which limits the scope of possiblemotifs to be found.On the other hand, network-based tools aim to identify similari-ties at the base pairing level. This approach is computationally moreefficient and effective because the base pair networks provide arobust signature of the 3D structure. More formally, for any RNA 3Dstructure (set of atomic coordinates) we can build a multi-relationalgraph where nodes correspond to nucleotides and base pairing edgesare labeled with one of 12 possible nucleotide pairing geometries,as described in Westhof et al. [2]. In this set of 12 geometries, wecan find the standard Watson-Crick (A-U, C-G, G-U) pairs, alsoknown as “canonical base pairs”, which are the most abundant class.However, when interpreting 3D motifs, the remaining 11 geometries,also known as “non canonical” are typically of great interest [2].Covalent connections between nucleotides are assigned a non-basepairing edge type. The edge labels are thus a discretization ofrelative spatial orientation of the paired nucleotides and provideinformation close to the true 3D geometry.Of course, identifying motifs requires a combinatorial searchand thus strong limitations on the common subgraph mining have tobe imposed. Among these is the ability to include variability withinmotifs (non-isomorphic instances of the same motif). RNA3dmotif[5] was among the first to propose a solution by searching forexact motifs only within certain known structural elements. Morerecently, CaRNAval [6] attempted to expand the class of motifs byconsidering interactions that connect multiple secondary structure a r X i v : . [ q - b i o . M N ] S e p lements and proposing various heuristics, again only retrievingisomorphic motif instances.The notion of a flexible motif has been very well studiedin the sequence domain [13] where certain DNA sequences areaccepted to be related while their nucleotide composition can vary.Not surprisingly, the same applies in the RNA structural domainwhere well-known motifs such as the A-minor are known to admitvariability in their connectivity pattern [14], and thus methods whichrely on strict isomorphism would fail to identify such instances, aswell as miss motifs entirely. Indeed, none of these methods conductsearches for motifs occurring in any context, while at the sametime allowing for flexibility in the motifs found. In this work, we leverage the state of the art in graph represen-tation learning to build continuous embeddings of RNA structuresand identify structurally conserved yet variable neighborhoods. Wethen propose two algorithms leveraging these graph representationsto perform graph queries and identify novel motifs. We are ableto retrieve known and novel instances of existing motifs. Usingour second algorithm, we are able to infer novel motifs, while alsoidentifying established ones.
2. Datasets
We extract motifs from the set of experimentally determinedRNA crystal structures [15]. To ensure that the frequency ofa motif is not biased by redundant crystal structures, we usethe representative set at 4 Angstroms provided by BGSU [7].We then build RNA networks for each RNA using the FR3Dannotations provided by the same framework. This results in atotal of 1297 RNAs and 671232 nodes (nucleotides). In order toachieve approximately constant batch sizes at training time, ourtraining set consists of chopped graphs in constant chunks of RNAof approximately 50 nucleotides, as is detailed in
SupplementaryAlgorithm 4 . Once the model is trained we perform all motiffinding operations on the whole graphs. Our validation sets consistof motifs identified by RNA 3D Motif Atlas [7], RNA3dmotif [5],and CaRNAval [6].
3. Methods
We introduce
VeRNAl , an algorithm that first decomposesRNA networks into small structural building blocks of RNA andthen aggregates these blocks based on their co-occurrence in thegraphs. The extraction step introduces custom structural comparisonfunctions (Section 3.2) which are used to build a space of continuousembeddings for efficient clustering (Section 3.2.2). Finally weintroduce a custom Graph Edit Distance for RNA to use as a metricfor model selection and evaluation (Section 3.2.3).We then combine information from the embedding space andconnectivity in the graph space into a meta-graph data structure(Section 3.3). We leverage this data structure to retrieve graphs sim-ilar to a query (Section 3.4), and to streamline frequent substructuresearches and thus identify fuzzy motifs (Section 3.5).
We start with a set of multi-relational graphs G = ( V , E ) (whole RNA structures) as described above. We define a motif asa set of subgraphs M = { g , g , .. } , drawn from G , such that thefollowing properties hold: 1) Similar:
For any pair ( g i , g j ) ∈ M , SIM ( g i , g j ) ≥ γ ,where SIM is a similarity function on graphs. We allowthe user to set γ .2) Connected: ∀ g i ∈ M . g i is a connected subgraph.3) Frequent: the number of subgraphs of M should be abovesome user-defined threshold : |M| > δ The motif finding problem is thus to identify all motifs M thatfit the above criteria. An exact solution to this problem would implyenumerating all subsets (search for subgraphs) of G and ensuringthat these criteria are satisfied (compare graphs). In the mostgeneral case, both procedures admit exponential time algorithms[16]. Previous works set γ = 1 , so that the similarity constraintbecomes another graph problem, known as the maximal graphisomorphism problem [6]. Additionally, the search step is oftenalso limited by considering only certain substructures. Here, weallow for non-identity γ (fuzziness) and search remove constraintson secondary structure context [6], [7]. Recent advances in Graph Representation Learning provide effi-cient tools for embedding structural objects in Euclidean space [17].This allows us to naturally encode the notion of structural similarityand perform efficient comparisons necessary for identifying fuzzymotifs. More formally, given a parametric function φ : u → R d (typically a graph neural network) which maps elements u of agraph (nodes, edges, subgraphs, or whole graphs) to real vectors,and a similarity function s G on these objects, we can train φ viabackpropagation to approximate s G (Equation 2).In this manner, the output of the model is an embedding (orrepresentation) of a graph element in Euclidean space, such thatdistances in this space reflect distances in the graph space inwhich s G operates. Conveniently, while s G can be expensive tocompute, the resulting feature map φ , once sufficiently trained, actsinductively and can be cheaply applied to new data [18]. Since motifs can besubgraphs of arbitrary size, we first decompose G into fundamentalunits on which we apply φ and then reconstruct larger motifs. Wechoose to decompose G as a set of rooted subgraphs centeredat individual nucleotides. A rooted subgraph g u is the inducedsubgraph on the set of nodes u (cid:48) ∈ g such that d ( u, u (cid:48) ) ≤ r where d and d is the shortest length path between two nodes, and r isa user-defined threshold. This is a natural building block of RNAstructure which allows for reconstruction of any structural motif.We note that for RNA motifs, we are only interested inconsidering edge type, and graph structure, and ignore any nodeinformation (this can be easily introduced if needed). Notably, it isknown that certain relation types (base pairing geometries) sharestructural similarities. Stombaugh [19] computed the geometricdiscrepancy between all pairs of relation types. This phenomenonis known as isostericity (Shown in Figure A.8 ). In order to performfast comparisons and clustering of rooted subgraphs, we introducevarious similarity functions which induce a continuous notion ofstructural similarity.Here, we define a similarity function between a pair of rootedsubgraphs, g u and g v . The function s G operates on the outputof a function f : u → Ω which decomposes a rooted subgraphinto a set of objects Ω . These can be a set of nodes, edges, orsmaller subgraphs such as graphlets [20]. These objects can thenbe assigned structural and locality compatibilities. We let C ω,ω (cid:48) be the structural compatibility between objects ω, ω (cid:48) , for example,edge isostericity. Next, D ω,ω (cid:48) assigns a cost on pairs of objectsdepending on the relative path distance to their respective root odes. We propose various similarity functions, based on optimalmatching of these objects with the most general form being: s G ( g u , g (cid:48) v ) := min X (cid:88) ω ∈ Ω (cid:88) ω (cid:48) ∈ Ω (cid:48) ( α C ω,ω (cid:48) + β D ω,ω (cid:48) ) X ω,ω (cid:48) (1)where X is a binary matrix describing a matching from the elementsof Ω to Ω (cid:48) , α and β are user-defined weights for emphasizinglocality vs structural compatibility. We solve for the optimalmatching between two sets of structural objects using the Hungarianalgorithm [21]. All our similarity functions are described in detailin Supplementary Section C. In order to identify related groups ofrooted subgraphs using a only the similarity function we wouldhave to perform and store N operations. When working with anorder of nodes, this quickly becomes prohibitive. Once nodesin each graph are embedded into a vector space, searches andcomparisons are much cheaper as they are vector operations.We therefore approximate the s G function over all pairs ofrooted subgraphs using node embeddings Z ∈ R d with a learnedfeature map. We use a Relational Graph Convolutional Network(RGCN) model [22] as parametric node embedding function φ ( u ) → R d which maps nodes to a vector space. The network isimplemented in Pytorch [23] and DGL [24]. Given a similaritymatrix K induced by s G , this function is trained to minimize : L = (cid:107)(cid:104) φ ( u ) , φ ( v ) (cid:105) − K ( u, v ) (cid:107) , (2)To make embeddings more focused on subgraphs that containnon canonical nodes and avoid the loss to be flooded by thecanonical interactions (Watson Crick pairs), we then scale thisloss based on the presence of non canonical interactions in theneighborhood of each node being compared. Given the frequencyof non canonical interactions f and u an indicator functionthat denotes the presence of non-canonical interactions in theneighborhood of node u , the scale s u,v of the u, v term writes as : S u,v = (1 + u f )(1 + v f ) , L scaled = S (cid:12) L (3)We can then perform a clustering in the embedding space usingany linear clustering algorithm and this yields the aforementionedstructural blocks of RNA. We denote such clusters as 1-motifs andplot them in Figure 3 . Thechoice of similarity function is application specific. One can use afunction which maximizes performance on a downstream supervisedlearning task, or one can choose a similarity function whichbest encodes structural identity [17]. Since supervised learningdata for RNA 3D structures is scarce, we opt for the latterand propose the Graph Edit Distance (GED) (or its similarityanalog exp [ − GED ] ) between rooted subgraphs, as this is widelyaccepted yet computationally intensive gold standard for structurecomparison [25]. Interestingly, GED is a generalization of thesubgraph isomorphism problem [26] which is at the core of previousRNA motif works such as CaRNAval and RNA3dmotif.In a nutshell, the GED between two graphs g , h is the minimumcost set of modifications that can be made to g in order to makeit isomorphic to h . This naturally encodes a notion of similaritysince similar pairs will require few and inexpensive modifications,and vice versa. We have adapted this algorithm to RNA data. Adetailed description of the algorithm is available in SupplementarySection B. We use the isostericity matrix for edge substitutions,and do not apply a penalty to node substitutions. Let E ( . ) be a function that returns the edge label for a givenedge, and ISO the isostericity function which returns the similaritybetween edge types. We define an RNA cost function over pair ofedges p and q as follows : c ( p → q ) = ISO ( E ( p ) , E ( q )) c ( p → ∅ ) = α backbone β canonical θ non-canonicalWe propose a simple modification to allow for comparison of rootedgraphs ( Algorithm 5 ), and use the general version of GED tovalidate the ultimate full subgraph-level quality of our identifiedmotifs.
While there is no limit to the size of a real-world motif, ourrooted subgraph embeddings are currently only aware of a fixed-size neighborhood. For this reason, 1-motifs only identify motifsas large as the number of layers in the similarity function/RGCN.However, we can extend these to k -motifs by aggregating severalclusters based on co-occurrence in the original graph. This allowus to aggregate heterogeneous rooted subgraphs into larger motifswhile preserving the property of co-occurrence.To guide this aggregation, we introduce a meta-graph datastructure G , whose meta-nodes are composed of regions of theembedding space and whose edge are based on the connectivityin the RNA graphs between those regions. Hence, the meta-graphsimultaneously encodes structural proximity and locality in thegraph in one object. To get the meta-nodes we simply cluster theoriginal nodes embeddings in V and use the clusters as meta-nodes: C i = { n ∈ V , cluster ( n ) = i } . The number of clusters andtheir spread are a parameter that modulates the fuzziness and thesensitivity of the induced methods. We associate to each node itsmeta-node, or cluster ID, and its distance to the cluster center.Meta-edges E i,j = { ( n i , n j ) ∈ ( C i × C j ) ∩ E } store the edges inRNA graphs that go from one cluster to another. This process isillustrated in Figure 1 .The meta-graph data structure enables an efficient implemen-tation of the following algorithms as well as an easier way todescribe and visualize them. Building the meta-graph requiresRGCN inference on all nodes and clustering, and iterating throughall edges in G . With linear-time clustering techniques, building themeta-graph is therefore done in time O ( | V | + | E | ) . The first use of the meta-graph data structure is to retrievesubgraphs similar to a query subgraph. Such an algorithm couldidentify subgraphs that resemble known motifs but which were notidentified by tools imposing strict isomorphism [27].The idea of the algorithm is to use the alignment of the RNAgraphs induced by the embeddings (
Fig. 1 ) to efficiently search forsimilar structures. Using the RGCN, we place the query graph inthe embedding space which creates a query multigraph G q whosenodes correspond to specific clusters, and whose edges of G q arein line with the connectivity of the query. Since each query nodeis assigned a specific embedding vector, we can directly obtaina “score” inversely related to the distance between a query and ahit node’s embeddings. In this sense, a “hit” can be any elementof the set of all possible connected subgraphs of G q . The taskthen becomes to identify the highest scoring of these subgraphs. To get these connected subgraphs, we start from the set M of allnodes involved in the query. We then iterate through the edges ofthe G and try to merge any two elements of M that fall alongthe current edge. Merging is not trivial, because the graph is nota geometric one : two meta-nodes linked to the same neighborare not necessarily connected. We implement a merging algorithmpresented in Algorithm 1 to address this problem. Any mergeoperation increases the score of the resulting set by summing thescore of the merged elements This retrieval procedure is detailedin
Algorithm 2 . Algorithm 1:
Merging Algorithm
Data: • S : a set of RNA subgraphs. • Meta-node C , meta-edge E Result: T , an expanded S to include C through E T ← ∅ foreach e ∈ E do Get g, the graph e is part of. foreach set in g ∩ S do if set ∆ e = node then T ← T ∪ { set ∪ node } end end end return T Finding all relevant clusters and looping through the edges listis facilitated by the meta-graph structure : we can see the successiveedge merging as a walk in the meta-graph. If a hit encompassesthe full query, it will have undergone the most merging operationsand obtain a maximal score. However, if one node is missing orif the structure is a somewhat different, we still retrieve it with asub-optimal but high score.The algorithm remains tractable thanks to the sparsity of themeta-graph that allows efficient iteration through edges, efficientset operations to expand motifs and graph-based separation of thecandidate hits. A theoretical analysis of the complexity depends
Algorithm 2:
Motif Instances Retrieval
Data: • Meta-graph ( C, E ) , original RNA graphs ( V , E ) • Query multi-graph G q Result: M : Motif instances candidates : a set of setsof nodes and their associated scores M ← (cid:83) C ∈G q C foreach E in G q do C , C = E T ← merge ( M, C , E ) T ← merge ( M, C , E ) M ← M ∪ T ∪ T end return M heavily on both the topology of the meta-graph and of the query-graph and is explained further in Supplementary section F. We canrely on empirical complexity to say that this algorithm runs in anaverage of 10s on a single core. We can leverage a similar strategy to the retrieve procedurewhen mining motifs de novo . The basic intuition of our algorithm,Motif Aggregation Algorithm (MAA) is that the set of nodesassigned to a given cluster can be considered to be a motif ofcardinality 1 (a 1-motif). We can then use the meta-graph to identifyclusters with connections to the current motif set to build largermotifs. Because we lack the guidance of the query, instead ofmerging just along one edge, we merge along all edges in the meta-graph and filter results based on a user-defined minimal frequency δ . As an example, starting with a 1-motif e.g. the set of subgraphsin cluster A , we can create 2-motifs by merging each other clusterin its meta-graph neighborhood, X ∈ N ( A ) . We then identify ofthe new 2-motifs from their constituent meta-nodes. This process can then be iterated to discover k -motifs. This is illustrated in Figure 2 and outlined in detail in
Algorithm 3 .At each iteration t , a motif can be extended by a total of O (∆( G ) × t ) , meta-nodes where ∆ yields the maximum degree ofthe meta-graph. This is because each rooted subgraph in the currentmotif can potentially form an extending connection. Processingall motifs at a given time step takes O (cid:0)(cid:0) Ct (cid:1) × ∆( G ) × t (cid:1) in theworst case. Of course, in practice the sparsity of the meta-graphlimits the growth of the first term since some clusters do not shareconnections and will thus not be considered as possible extensions.Once all nodes are processed, we can repeat the same search toobtain higher-order motifs. Naturally, as we obtain larger motifs, thenumber of instances decreases, and the search abandons motifs withnumber of instances below a user-defined threshold δ . Empiricalcomplexity depends strongly on hyperparameter choices but is ison average of a few minutes on a single core.
4. Results
Our tool relies on graph representation methods to drasticallyimprove the scalability of motif mining and facilitate fuzzy matchingof motifs. Thus, we first evaluate the quality of our RNA-specificsimilarity functions and subsequent RGCN-based embedding model(Section 4.1) and show that structural information is faithfullyencoded. Following this, we show that our approach can consistentlyretrieve existing motifs (Section 4.2) while also uncovering newfuzzy motifs (Section 4.3). Throughout the evaluation of the tools,we use GED as an external (and costly) oracle to select a similarityfunction, assess embedding quality, and motif consistency. We
Algorithm 3: M otif A ggregation A lgorithm (MAA).At each step, t , the algorithm iterates through edges ( m, m (cid:48) ) of the meta-graph, applying Algorithm 1 toconstruct a t +1 motif µ . The updated meta-connectivityis stored as new meta-edges. Data: • Meta-Graph G , • Minimum density δ • Number of steps T Result:
List of meta-graphs M ← list () E ← G .edges () foreach t ∈ { , .., T } do E (cid:48) ← ∅ M [ t ] ← list () while E do m, m (cid:48) ← E .pop () µ ← merge ( m.subgraphs, m (cid:48) , ( m, m (cid:48) )) if | µ | > δ then M [ t ] .append ( µ ) /* Connect new node toadjacent clusters */ foreach c (cid:48) ∈ G.N ei ( µ ) do E (cid:48) .add (( µ, c (cid:48) )) end end E ← E (cid:48) end end return M emphasize that the focus of subsequent analysis is on the soundnessof the tool, and in-depth biological interpretation of discoveredmotifs is left for future work. We sample 200 rooted subgraphs of radius 1 and 2 uniformlyat random from G . We recall that the radius of a graph is themaximum length shortest path between any two nodes in this graph.Next, we compute all-to-all GED on this sample, yielding 200,000non trivial values for each radius. We then compute similarities onthe same set of subgraphs using various choices of s G and φ . In Table 1 we summarize the resulting Pearson correlation values.Under these metrics, the best performing method is the graphlets+ hungarian methods with an almost perfect correlation at a radiusof one and 0.52 at a radius of two. When computing the samecorrelation on pairs of graphs with low GED to each other ( r threshold ),we obtain higher correlations of 0.637 on the radius-two subgraphs.Since we consider fuzzy motifs to consist of graphs with slightvariations such metrics are more relevant.Thus we train an RGCN using the graphlet + hungarian similarity function and measure the agreement between embeddingsand GED. A correlation value of 0.74 for the thresholded 2-layersRGCN embeddings enables us to claim that the dot product inthe embedding space approximates well the structural similaritiesespecially between similar subgraphs. We even obtain better results ethod depth decay normalization r r_threshold r_nc r_nc_threshold time (s) edge hist. 1 0.300 None 0.633 0.627 0.320 0.334 <0.001edge hist. + iso 1 0.500 None 0.769 0.783 0.526 0.533 <0.001edge hungarian + iso 1 − None 0.791 0.800 0.612 0.612 <0.001graphlets hist. 1 0.800 None 0.967 0.973 0.948 0.962 0.029graphlet hungarian 1 − sqrt 0.996 0.997 0.992 0.995 0.030graphlet hungarian 2 − sqrt 0.568 0.637 0.437 0.518 0.4331hop RGCN 1 − − − − TABLE 1: Correlation with the GED for different kernels and embedding settings
Method Success Rate Relative ratio in the hit listTrue query 0.89 0.18Decoy query 0.15 0.92
TABLE 2: Comparison of the performance of the retrievealgorithm when used with a query instance vs. a randomone. for the 2-hop version, which can be explained by regularization ofthe learned model over inputs where the similarity values and theGED were very different. Moreover, we note that the run time ofa comparison becomes negligible, as it amounts to a dot product.Full results are available in
Supplementary Table A.5 .We complete our report of the performance assessment of theembeddings with a visual representation of the results. In
Figure3 , we generate a 2D projection of the local RNA structures fromthe the learned embedding with t-SNE [28]. We draw examplesubgraphs corresponding to a sample of clusters.Visually, we observe that similar subgraphs lie in the sameclusters. Additional quantitative metrics are provided in the Supple-mental Section E. This validation provides us the structural buildingblocks to assemble and retrieve motifs.
Next, we turn to the validation of the retrieval algorithm. Givena query graph, the retrieve algorithm returns a list of subgraphs indecreasing order of compatibility to the query, also known as “hits”.We run the algorithm on a selected set of motifs in RNA3dMotifs[5], filtered for sparsity (more than 3 instances) and size (more than4 nodes). For a given known motif, we perform a retrieve with twotypes of queries: a true instance of the motif, and an instance of arandomly chosen motif (decoy). We show in
Table 2 the resultingranks.We see that when queried with one instance, the algorithmretrieves other instances in 90% of the cases. Annotation errors cansometimes result in instances having different graphs and thus insome cases the instance is not retrieved. However, when queriedwith a decoy, the success rate drops to 15% with an average rank atthe 92nd percentile of the hit list, indicating that only very partialsolutions were retrieved.We can go further by analyzing the structure of the retrievedhits. A first way to do so is to plot several hits with increasing ranks(
Figure 4 ). A more quantitative way to do this is to compute themean GED value of hits at fixed ranks compared to their respectivequeries. The results are presented in
Table 3 .A visual inspection of the results indicates that the retrievedgraphs differ more and more as we plot hits with decreasing scores.This quantitative experiment validates this result and indicates thaton average the ten first hits are almost isomorphic, and even the
Rank 1st 10-th 100-th 1000-th Random OtherMean GED ± ± ± ± ± TABLE 3: Mean GED values between several queries andtheir hits at fixed ranks. We also included mean GED valuesto other random motifs as a control.
100 first ones are often very similar. Based on both of these resultswe claim that our method is able to retrieve sets of subgraphs wherethe GED to the query correlates with the retrieval rank.The average number of instances of a motif across RNA3dmotif,RNA 3D Motif Atlas, and CaRNAval is only 22.3. Interestingly,the fact that we we are able to obtain up to 100 hits with a GEDbelow 2 indicates that many of these represent an ensemble ofhighly similar structures that are missed by existing tools. Thisobservation suggests that our method can not only be used to assessif we find known instances of a motif, but also to identify fuzzyinstances of these well known motifs.
Finally, we assess the quality of the MAA procedure to identify de novo motifs. Of course, there are many choices of hyperparam-eters which are ultimately application-dependent (fuzziness, motiffrequency, size, etc.). We select the number of clusters accordingto the Silhouette Score and several clustering metrics (See
FigureA.11 ) and require a minimum frequency of 100 instances per motif,as well as a maximum cluster spread of 0.4 in units of euclideandistance. We obtain a set of 1,665 motifs up to cardinality 6.
Supplementary Table A.7 shows the average number of instancesand number of motifs at each cardinality. To check for internalconsistency, we compute the intra- and inter-motif GED between arandom sample of 20 motifs and plot the results in
Figure 5 . Weobtain an intra-motif GED of 2.0 ± ± VeRNAl finds motifs with internal consistency.Next, we measure the degree to which our motif set agrees withexisting motif databases. In
Figure 6 , we plot the percentage of theknown motif’s nodes (RNA 3D Motif Atlas (BGSU), CaRNAval,and RNA3DMotif) that can be found in any of our motifs of thesame size. We find that a subset of our motifs aligns well (atleast 60% overlap) with all databases (
Table 4 ). At the same time,the
VeRNAl motifs that match known motifs feature many moreinstances, again suggesting that we are able to expand the set ofknown motif instances.Finally, we find that 1,148 of our 1,665 motifs do not haveany overlap with known motifs, indicating that our algorithm isuncovering novel motifs. An in-depth analysis of all individualinstances is out of the scope of this contribution, but we plot afew examples in
Figure 7 . Nonetheless, all the motifs identified ● □ ○ Query □ ○ ● ● □ ○ ● Figure 4: Hit graphs with increasing rank to the query.
Dataset Covered MissedBGSU [7] 60 13RNA3DMotif [5] 8 2CaRNAval [6] 89 19
TABLE 4: Nearly all motifs identified by three publishedRNA motif tools are a subset of the motifs found by
VeRNAl . by VeRNAl can be browsed and downloaded on our web serververnal.cs.mcgill.ca, and thus available to the community for furtheranalysis. . . . . . . . . . F r e q u e n c y interintra Figure 5: Structural similarity (computed with full GraphEdit Distance) is significantly higher for subgraphs withinthe same motif compared to subgraphs in different motifs.
5. Conclusions
We describe
VeRNAl , a novel pipeline for identifying fuzzynetwork motifs. We develop various node structure comparisonfunctions and approximate their feature map using an RGCN,embedding our graph dataset to a vector space for fast similaritycomputation between rooted subgraphs. We show that these compu-tations correlate well with the RNA GED while being significantlyfaster. This enables us to find small structural building blocks ofRNA and organize them into a meta-graph data structure. veRNAl motif.Figure 7: Four instances of a
VeRNAl
Using this custom data structure, we introduce two algorithmsto retrieve similar instances to a known query and to discovernew motifs. We show that the retrieval procedure enables us toefficiently identify other instances of known motifs but also to findsets of subgraphs similar but not identical to a query. The motifextraction algorithm is also successful in mining sets of subgraphswith low intra-cluster GED, re-discovering and expanding knownmotifs as well as introducing new ones. All together, our platform
VeRNAl is the first tool to propose fuzzy graph motif extraction.There are some limitations to
VeRNAl which can be addressedin the future. Since RGCNs perform convolutions of entire neigh-bourhoods, motifs without a wide-enough conserved core can be lost. Additionally, the motif building algorithm accepts all connectedsubgraphs belonging to a specific set of clusters as instances ofthe same motif. While the specific manner in which root nodes areconnected is not explicitly constrained, this can lead to instancesof the same motif with very different topologies. Additional graph-level hashing [ ? ] can eventually be used to distinguish these cases,or using larger embedding radius can mitigate the effects.The main focus of this work is to build and validate thealgorithm. Yet, a detailed exploration of the candidate motifs andthe impact of the hyperparameters (fuzziness, density, size, etc.)should be explored in future work.The algorithms introduced here are general and the field ofsubgraphs mining is still rapidly evolving. We believe VeRNAl could also be applied to other sources of data such as chemicalcompounds, protein networks, and gene expression networks toautomatically mine for novel generalized structural patterns.
6. Implementation
The source code is available at vernal.cs.mcgill.ca . We alsoprovide a flexible interface and a user-friendly webserver to browseand download our results.
7. Acknowledgements
The authors thank Vladimir Reinharz, Yann Ponty, Roman S.Gendron and Jacques Boitreaud for advice and support.
8. Funding
C.G. was funded by a PhD scholarship from Fonds deRecherche du Québec Nature et technologies. V. M. is funded by the NCEPTION project [PIA/ANR-16-CONV-0005] and benefits fromsupport from the CRI through "Ecole Doctorale FIRE – ProgrammeBettencourt". J.W. is supported by a Discovery grant from theNatural Sciences and Engineering Research Council of Canada.
References [1] Neocles B Leontis, Aurelie Lescoute, and Eric Westhof. The buildingblocks and motifs of rna architecture.
Current opinion in structuralbiology , 16(3):279–287, 2006.[2] Neocles B Leontis and Eric Westhof. Geometric nomenclature andclassification of rna base pairs.
Rna , 7(4):499–512, 2001.[3] Sam Griffiths-Jones, Alex Bateman, Mhairi Marshall, Ajay Khanna,and Sean R Eddy. Rfam: an rna family database.
Nucleic acidsresearch , 31(1):439–441, 2003.[4] Aurelie Lescoute, Neocles B Leontis, Christian Massire, and EricWesthof. Recurrent structural rna motifs, isostericity matrices andsequence alignments.
Nucleic acids research , 33(8):2395–2409, 2005.[5] Mahassine Djelloul.
Algorithmes de graphes pour la recherche demotifs récurrents dans les structures tertiaires d’ARN . PhD thesis,2009.[6] Vladimir Reinharz, Antoine Soulé, Eric Westhof, Jérôme Waldispühl,and Alain Denise. Mining for recurrent long-range interactions in rnastructures reveals embedded hierarchies in network families.
Nucleicacids research , 46(8):3841–3851, 2018.[7] Anton I Petrov, Craig L Zirbel, and Neocles B Leontis. Automatedclassification of rna 3d motifs and the rna 3d motif atlas.
Rna ,19(10):1327–1340, 2013.[8] Roman Sarrazin-Gendron, Vladimir Reinharz, Carlos G Oliver, NicolasMoitessier, and Jérôme Waldispühl. Automated, customizable andefficient identification of 3d base pair modules with bayespairing.
Nucleic acids research , 47(7):3321–3332, 2019.[9] James Roll, Craig L Zirbel, Blake Sweeney, Anton I Petrov, andNeocles Leontis. Jar3d webserver: Scoring and aligning rna loopsequences to known 3d motifs.
Nucleic acids research , 44(W1):W320–W327, 2016.[10] Carlos Oliver, Vincent Mallet, Roman Sarrazin Gendron, VladimirReinharz, William L Hamilton, Nicolas Moitessier, and JérômeWaldispühl. Augmented base pairing networks encode RNA-smallmolecule binding preferences.
Nucleic Acids Research , 07 2020.gkaa583.[11] Grzegorz Chojnowski, Tomasz Wale´n, and Janusz M Bujnicki. Rnabricks—a database of rna 3d motifs and their interactions.
NucleicAcids Research , 42(D1):D123–D131, 2014.[12] Ping Ge, Shahidul Islam, Cuncong Zhong, and Shaojie Zhang. De novodiscovery of structural motifs in rna 3d structures through clustering.
Nucleic acids research , 46(9):4783–4793, 2018.[13] Patrik D’haeseleer. What are dna sequence motifs?
Nature biotech-nology , 24(4):423–425, 2006.[14] Poul Nissen, Joseph A Ippolito, Nenad Ban, Peter B Moore, andThomas A Steitz. Rna tertiary interactions in the large ribosomalsubunit: the a-minor motif.
Proceedings of the National Academy ofSciences , 98(9):4899–4903, 2001.[15] Peter W Rose, Andreas Prli´c, Ali Altunkaya, Chunxiao Bi, Anthony RBradley, Cole H Christie, Luigi Di Costanzo, Jose M Duarte, Shuchis-mita Dutta, Zukang Feng, et al. The rcsb protein data bank: integrativeview of protein, gene and 3d structural information.
Nucleic acidsresearch , page gkw1000, 2016.[16] Zhiping Zeng, Anthony K. H. Tung, Jianyong Wang, Jianhua Feng,and Lizhu Zhou. Comparing stars: On approximating graph editdistance.
Proc. VLDB Endow. , 2:25–36, 2009.[17] William L Hamilton, Rex Ying, and Jure Leskovec. Representa-tion learning on graphs: Methods and applications. arXiv preprintarXiv:1709.05584 , 2017. [18] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive represen-tation learning on large graphs. In
Advances in neural informationprocessing systems , pages 1024–1034, 2017.[19] Jesse Stombaugh, Craig L Zirbel, Eric Westhof, and Neocles B Leontis.Frequency and isostericity of rna base pairs.
Nucleic acids research ,37(7):2294–2312, 2009.[20] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn,and Karsten Borgwardt. Efficient graphlet kernels for large graphcomparison. In
Artificial Intelligence and Statistics , pages 488–495,2009.[21] H. W. Kuhn and Bryn Yaw. The hungarian method for the assignmentproblem.
Naval Res. Logist. Quart , pages 83–97, 1955.[22] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne VanDen Berg, Ivan Titov, and Max Welling. Modeling relational data withgraph convolutional networks. In
European Semantic Web Conference ,pages 593–607. Springer, 2018.[23] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, JamesBradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.Pytorch: An imperative style, high-performance deep learning library.In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox,and R. Garnett, editors,
Advances in Neural Information ProcessingSystems 32 , pages 8024–8035. Curran Associates, Inc., 2019.[24] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye,Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. Deep graph library:Towards efficient and scalable deep learning on graphs. arXiv preprintarXiv:1909.01315 , 2019.[25] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey ofgraph edit distance.
Pattern Analysis and applications , 13(1):113–129,2010.[26] Horst Bunke and Kaspar Riesen. Graph classification based ondissimilarity space embedding. In
Joint IAPR International Workshopson Statistical Techniques in Pattern Recognition (SPR) and Structuraland Syntactic Pattern Recognition (SSPR) , pages 996–1007. Springer,2008.[27] Cuncong Zhong and Shaojie Zhang. Rnamotifscanx: a graph alignmentapproach for rna structural motif identification.
RNA , 21(3):333–346,2015.[28] Laurens van der Maaten and Geoffrey Hinton. Visualizing data usingt-sne.
Journal of machine learning research , 9(Nov):2579–2605, 2008. ppendix
1. RNA data
We present here the algorithm used to chop RNA into fixedmaximal size pieces. The idea of the algorithm is to recursivelycut the RNA in halves up until the maximum size is reached. Thisis detailed in
Algorithm 4 . Algorithm 4:
Chopper algorithm
Data:
Full RNA Graph g , Maximum number of nodes N Result:
List of sub-structures of maximum size N . if | g | ≤ N then return end g else g a , g b ← Split g in halves based on the PCA axes return chopper( g a ) return chopper( g b ) end We include a representation of the notion of isostericity betweenedge types in
Figure A.8 . R e l a t i o n s S i m il a r i t y Figure A.8: Isostericity matrix between relation types.
2. Graph Edit Distance
The Graph Edit Distance (GED) between two graphs G and H is defined as follows: GED ( G, H ) = min ( e ,...,e k ) ∈ Υ( G,H ) k (cid:88) i =1 c ( o i ) . (4)Where Υ is the set of all edit sequences which transform G into H . Edit operations include: node/edge matching, deletion, andinsertion. c ( o ) is the cost of performing edit operation o and c is known as the cost function.x Since we will be decomposingour graphs as rooted subgraphs, we define a slight modification to ● ▶ ● ▷ ● ○ Figure A.9: Examples of similar and dissimilar pairs accord-ing to the GED A* algorithm. the GED formulation which compares two graphs given that theirrespective roots must be matched to each other. This algorithm isdetailed in
Algorithm 5 . Algorithm 5:
Rooted A* GED
Data: • Pair of graphs G , H , (WLOG let G be the smallerof the two graphs.) • cost function c • heuristic h . • r G ∈ N ( G ) root in first graph • r H ∈ N ( G ) root in second graph Result:
Minimum cost rooted distance and alignmentbetween two graphs. OP EN ← priorityQueue () V G ← G.nodes () V H ← H.nodes () v ← first node in G OP EN.add (( r G , r (cid:48) H ) , c ( r G , r H ) + h ( v , v (cid:48) ) while OPEN do v min ← OP EN.pop () Let M k ← be partial mapping { ( v , v (cid:48) ) , .., ( v k , v (cid:48) k ) } if |M k | = | V G | then Mapping complete return M k end Add nodes at next depth foreach u ∈ V H \ v min do OP EN.add ( v min ∪ ( v k +1 , u ) , c ( v k +1 , u ) + h ( v k +1 , u )) end end We include in
Figure A.9 an example of GED values for twopairs of graphlets, illustrating how similar graphs get lower valuesof distance.
3. Similarity Functions
The first S functionswe consider are a weighted sum of a distance between their l -hopneighborhoods (aka rings). We let R u = { ( u (cid:48) , w ) : δ ( u, u (cid:48) ) = l ∀ ( u (cid:48) , w ) ∈ E } be the set of edges at distance k from the rootnode u . Let d be a normalized similarity function between two sets f edges. Let < λ < be a decay factor to assign higher weightto rings closer to the root nodes, and N − be a normalizationconstant to ensure the function saturates at 1. Then we can obtaina structural similarity for the rooted subgraphs around u and v as : k L ( u, v ) := 1 − N − L − (cid:88) l =0 λ l d ( R lu , R lv ) d ( R u , R v ) := min − (cid:88) e ∈ R u (cid:88) e (cid:48) ∈ R v S L ( e ) , L ( e (cid:48) ) X e,e (cid:48) The first function (
R_1 ) simply uses a delta function tocompare to different edges. This assignment problem thus reduces tocomputing the intersection over union score between the histograms f R of edge labels found at each ring. However, this function treatsall edge types equally and ignore the isostericity relationships. Thesecond function ( R_iso ) has a matching value of 1 for backboneedges matched with backbone edges, 0 for backbone matched withany non covalent bond and the isostericity value for the similarityvalue of two non covalent bond.
In the ring-based sim-ilarity functions, comparisons only happen between rings at thesame depth from the root. This can be a harsh constraint since tworooted subgraphs can have a similar global structure but the choiceof root node can shift the rings to yield a very low similarity. For thematching functions, we allow matches to occur across all distancesfrom the root with an additional cost matrix D ∈ R R , where D i,j contains the cost of matching elements whose distances to theirroots are i and j ( = e −| i − j | ). Instead of thinking of matching onlythe edges within a ring, we can think of matching tuples of edgelabels and distances. E u = { ( L ( u, w ) , d ( u, w )) ∀ w ∈ G | d ( u, w ) ≤ r } , where eachelement is a tuple composed of edge label and distance from theroot for all nodes within the radius. Now the cost of matching apair of edges is a function of their isostericity value as well as thedistance between them. d ( E u , E v ) := min (cid:88) e ∈ E u (cid:88) e (cid:48) ∈ E v ( α S L ( e ) , L ( e (cid:48) ) + β D e,e (cid:48) ) X e,e (cid:48) (5) We can generalize theedge ring similarity function by considering the neighborhood of anode to consist of a multi-set of smaller graphs (known as graphlets)instead of simply edge labels. This allows us to consider morecomplex structural features than sets of edge labels allows for.Since the degree of our graphs is strongly bounded (max degree5), we can define a graphlet as a rooted subgraph of radius 1and obtain a manageable number of possible graphlets. Moreoverthe rooted aspect and the small size of those graphs make theGED computation tractable. We can then make an analogy betweengraphlet identity and edge type to obtain the same formulation andsolution as the edge similarity function.While the GED computation is tractable for such small graphs,it is still expensive when repeated many times. For this reason, weimplement a solution caching strategy which stores the computedGED when it sees a new pair of graphlets, and looks up storedsolutions when it recognizes a previously seen pair
SupplementaryAlgorithm 6 .We can now define S from Equation 1 as S ij = exp [ − γ GED ( g i , g j )] . We apply an exponential to the distanceto bring the distances to the range [0,1], and convert them toa similarity. An optional scaling parameter γ is included to controlthe similarity penalty on more dissimilar graphs. We also note that the construction of S can be parallelized but we leave theimplementation for future work. We have experimented with severaladditional parameters. We tried including an Inverse DocumentFrequency (IDF) weighting to account for the higher frequency ofnon canonical interaction. This amounted to scale all comparisonvalue by the product of the IDF term they involved.We also tried adding a re-normalization scheme to give highervalues to matches of long rings. In particular, we want to expressthat a having a match of 9 out of 10 elements is stronger thanhaving a match of 2 out of 3. Let S be the raw matching score, S the normalized one and L be the length of the sequences, we havetried two normalization settings, the “sqrt” and “log” ones :sqrt : S = (cid:2) SL (cid:3) √ L log : S = (cid:2) SL (cid:3) L
4. Graphlets hashing and distributions
To do this, we build a hash function which maps isomorphicgraphlets to the same output, while assigning different outputsto non-isomorphic ones, allowing us to look up graphlet GEDvalues. This is done by building a sparse representation of anexplicit Weisfeiler-Lehman isomorphism kernel, with a twist thatedge labels are included in the neighborhood aggregation step.The resulting hash consists of counts over the whole graphlet ofhashed observed sequences of edge labels. We enforce the edgelabel hashing function to be permutation invariant by sorting theobserved label sequence. In this manner, isomorphic graphs aregiven identical hash values regardless of node ordering. Our hashingprocedure outlined in
Supplemental Algorithm 6 also allows usto study the distribution of graphlets composing RNA networks
Supplemental Fig A.10 , where we can observe a characteristicpower law distribution.
Algorithm 6:
Weisfeiler-Lehman Edge Graphlet Hash-ing
Data: • Graphlet g , • Maximum depth K • HASH, function from strings to integers • L function returning the label for an edge Result:
Hash code for graphlet h h ← counter () foreach k ∈ { , .., K } do foreach u ∈ g N do l ku ← HASH ( {L ( u, v ) ⊕ l k − v ∀ v ∈ N ( u ) } ) end h ← h ∪ counter ( { l u ∀ u ∈ g } ) end return h
5. Structural clusters
We present in
Figure A.11 some metrics on the clustering ofour structural embeddings. We see that these metrics suggest thatthe structures present in RNA do fall into well separated clusters. Number of occurences10 N u m b e r o f g r a ph l e t s (a) Graphlet frequency distribution(b) Most frequent graphlet (c) Example of a rare graphlet. Figure A.10: Graphlet distribution and examples. (a) Average distance to center. (b) Distance between ran-domly selected pairs of embed-dings.
500 250 0 250 500 750 1000 1250Number of Nodes0.000000.000250.000500.000750.001000.001250.001500.00175 (c) Number of nodes per clus-ter (removing clusters withmore than 5000 nodes (d) Nodes per cluster, unfil-tered.
Figure A.11: Clustering K-means k = 262
6. Retrieve complexity
In this section, we want to investigate the complexity of theretrieve algorithm. This complexity depends highly on both thetopology of the meta-graph, the query graph and the individualRNA-graphs.Let N p be the number of edges in each of the p parallel edgesof the query graph, and E p be the corresponding meta-edge. Weconsider the edges of the query graph in the order of the paralleledges, let p ( t ) = min k , (cid:80) k N k > t , the parallel edge consideredat time t.At each step t, the complexity bound is going to depend onthe number of candidate motifs inside each RNA graphs at time t-1as well as the number of possible additional edges to insert intothose candidates. If we denote as M g,t the number of candidatesin graph g at time t, and N g,t the number of edges in graphg that belong to the meta-edge E p ( t ) ,g , the complexity writesas O ( (cid:80) t (cid:80) g ∈G E p ( t ) ,g M g,t ) The term E p ( t ) ,g mostly acts as asparsity term, as it would not exceed ten but can very often be zeroif the graph does not include such an edge. Therefore, we introducethe notation G p , the set of RNA graphs that contain an edge in E p , to omit this term. We also introduce t g,t = (cid:80) l 7. Full results for the similarity function validations We include in this section the full results we got for a gridsearch validation in the absence of a better way to guide ourintuition. ethod depth decay normalization r_exp r_threshold r_nc r_nc_threshold timeR_1 1 0.500 None 0.630 0.630 0.386 0.397 0.000R_1 1 0.500 sqrt 0.630 0.630 0.386 0.397 0.000R_1 1 0.300 None 0.630 0.630 0.386 0.397 0.000R_1 1 0.800 sqrt 0.630 0.630 0.386 0.397 0.000R_1 1 0.300 sqrt 0.630 0.630 0.386 0.397 0.000R_1 1 0.800 None 0.630 0.630 0.386 0.397 0.000R_1 1 0.800 sqrt 0.633 0.627 0.320 0.334 0.000R_1 1 0.500 None 0.633 0.627 0.320 0.334 0.000R_1 1 0.800 None 0.633 0.627 0.320 0.334 0.000R_1 1 0.500 sqrt 0.633 0.627 0.320 0.334 0.000R_1 1 0.300 sqrt 0.633 0.627 0.320 0.334 0.000R_1 1 0.300 None 0.633 0.627 0.320 0.334 0.000hungarian 1 NaN sqrt 0.726 0.732 0.474 0.482 0.001R_iso 1 0.800 sqrt 0.757 0.758 0.494 0.503 0.000R_iso 1 0.500 sqrt 0.757 0.758 0.494 0.503 0.000R_iso 1 0.300 sqrt 0.757 0.758 0.494 0.503 0.000hungarian 1 NaN sqrt 0.758 0.764 0.527 0.530 0.001R_iso 1 0.300 sqrt 0.758 0.760 0.510 0.515 0.000R_iso 1 0.500 sqrt 0.758 0.760 0.510 0.515 0.000R_iso 1 0.800 sqrt 0.758 0.760 0.510 0.515 0.000R_iso 1 0.500 None 0.768 0.777 0.467 0.483 0.000R_iso 1 0.300 None 0.768 0.777 0.467 0.483 0.000R_iso 1 0.800 None 0.768 0.777 0.467 0.483 0.000R_iso 1 0.800 None 0.769 0.783 0.526 0.533 0.000R_iso 1 0.300 None 0.769 0.783 0.526 0.533 0.000R_iso 1 0.500 None 0.769 0.783 0.526 0.533 0.000hungarian 1 NaN None 0.790 0.799 0.488 0.499 0.001hungarian 1 NaN None 0.791 0.800 0.612 0.612 0.000R_graphlets 1 0.300 sqrt 0.940 0.942 0.890 0.894 0.029R_graphlets 1 0.800 sqrt 0.940 0.942 0.890 0.894 0.029R_graphlets 1 0.500 sqrt 0.940 0.942 0.890 0.894 0.029R_graphlets 1 0.500 None 0.967 0.973 0.948 0.962 0.029R_graphlets 1 0.300 None 0.967 0.973 0.948 0.962 0.029R_graphlets 1 0.800 None 0.967 0.973 0.948 0.962 0.029graphlet 1 NaN None 0.967 0.973 0.948 0.962 0.030graphlet 1 NaN sqrt 0.996 0.997 0.992 0.995 0.030 TABLE A.5: One hop correlation to GED13 ethod depth decay normalization r_exp r_threshold r_nc r_nc_threshold timeR_1 2 0.300 None 0.375 0.444 0.241 0.328 0.000R_1 2 0.300 sqrt 0.375 0.444 0.241 0.328 0.000R_1 2 0.300 sqrt 0.381 0.440 0.246 0.315 0.000R_1 2 0.300 None 0.381 0.440 0.246 0.315 0.000R_iso 2 0.300 None 0.385 0.469 0.236 0.336 0.000R_iso 2 0.300 None 0.402 0.481 0.233 0.337 0.000R_graphlets 2 0.300 None 0.405 0.462 0.314 0.378 0.352R_1 2 0.500 sqrt 0.407 0.480 0.264 0.356 0.000R_1 2 0.500 None 0.407 0.480 0.264 0.356 0.000R_iso 2 0.500 None 0.412 0.511 0.259 0.376 0.000R_iso 2 0.300 sqrt 0.414 0.490 0.259 0.343 0.000R_1 2 0.500 None 0.416 0.477 0.276 0.351 0.000R_1 2 0.500 sqrt 0.416 0.477 0.276 0.351 0.000R_iso 2 0.300 sqrt 0.416 0.492 0.256 0.344 0.000R_graphlets 2 0.500 None 0.429 0.496 0.334 0.406 0.356R_iso 2 0.500 None 0.434 0.525 0.261 0.385 0.000R_1 2 0.800 None 0.438 0.516 0.285 0.382 0.000R_1 2 0.800 sqrt 0.438 0.516 0.285 0.382 0.000R_iso 2 0.800 None 0.438 0.551 0.280 0.416 0.000R_iso 2 0.500 sqrt 0.451 0.540 0.288 0.391 0.000R_1 2 0.800 None 0.451 0.515 0.304 0.387 0.000R_1 2 0.800 sqrt 0.451 0.515 0.304 0.387 0.000R_iso 2 0.500 sqrt 0.454 0.543 0.286 0.395 0.000hungarian 2 NaN None 0.455 0.620 0.292 0.499 0.001R_graphlets 2 0.800 None 0.456 0.535 0.356 0.437 0.357R_iso 2 0.800 None 0.465 0.568 0.288 0.434 0.000graphlet 2 NaN None 0.470 0.574 0.347 0.462 0.430R_iso 2 0.800 sqrt 0.488 0.589 0.316 0.440 0.000hungarian 2 NaN None 0.491 0.622 0.261 0.437 0.001R_iso 2 0.800 sqrt 0.493 0.593 0.316 0.445 0.000R_graphlets 2 0.300 sqrt 0.505 0.533 0.405 0.417 0.353hungarian 2 NaN sqrt 0.533 0.666 0.320 0.521 0.001R_graphlets 2 0.500 sqrt 0.543 0.576 0.444 0.459 0.356hungarian 2 NaN sqrt 0.554 0.686 0.277 0.472 0.001graphlet 2 NaN sqrt 0.568 0.637 0.437 0.518 0.433R_graphlets 2 0.800 sqrt 0.587 0.626 0.487 0.506 0.356 TABLE A.6: Two hop Correlation