[PDF] AligNet: Alignment of Protein-Protein Interaction Networks

Abstract

One of the most difficult problems difficult problem in systems biology is to discover protein-protein interactions as well as their associated functions. The analysis and alignment of protein-protein interaction networks (PPIN), which are the standard model to describe protein-protein interactions, has become a key ingredient to obtain functional orthologs as well as evolutionary conserved pathways and protein complexes. Several methods have been proposed to solve the PPIN alignment problem, aimed to match conserved subnetworks or functionally related proteins. However, the right balance between considering network topology and biological information is one of the most difficult and key points in any PPIN alignment algorithm which, unfortunately, remains unsolved. Therefore, in this work, we propose AligNet, a new method and software tool for the pairwise global alignment of PPIN that produces biologically meaningful alignments and more efficient computations than state-of-the-art methods and tools, by achieving a good balance between structural matching and protein function conservation as well as reasonable running times.

Full PDF

AAligNet: Alignment of Protein-ProteinInteraction Networks

R. Alberich, A. Alcal´a, M. Llabr´es, F. Rossell´o and G. ValienteFebruary 20, 2019

Abstract

One of the most diﬃcult problems diﬃcult problem in systems biology isto discover protein-protein interactions as well as their associated func-tions. The analysis and alignment of protein-protein interaction networks(PPIN), which are the standard model to describe protein-protein interac-tions, has become a key ingredient to obtain functional orthologs as well asevolutionary conserved assembly pathways and protein complexes. Sev-eral methods have been proposed to solve the PPIN alignment problem,aimed to match conserved subnetworks or functionally related proteins.However, the right balance between considering network topology and bio-logical information is one of the most diﬃcult and key points in any PPINalignment algorithm which, unfortunately, remains unsolved. Therefore,in this work, we propose AligNet, a new method and software tool for thepairwise global alignment of PPIN that produces biologically meaningfulalignments and more eﬃcient computations than state-of-the-art methodsand tools, by achieving a good balance between structural matching andprotein function conservation as well as reasonable running times.

Background

The regulation of cellular processes is one of the most enigmatic topics in cellbiology [17]. The activity of cellular life relies on the proper functioning ofthe extremely complex networks of interactions among numerous intracellularconstituents. It is clear that proteins are the most active participants in thoseprocesses [14], and this is the reason why researchers invest so much eﬀort in thestudy of proteins, from their shape classiﬁcation to their interactions networks,trying to unveil their function.The increasing amount of available data in almost all biological researchareas and, in particular, on protein structure and protein-protein interactionnetworks (PPIN), encourages data analysis as an important tool to derive bio-logical meaning. To understand the mechanisms of protein-protein recognitionat the molecular level and to unravel the global picture of their interactionsin the cell, many experimental techniques have been developed so far. Somemethods characterize individual protein interactions, while others screen forinteractions at a genome-wide scale. Some good recent surveys on this topicare [31, 34, 16].In order to analyze and contrast the available data on PPIN, several pairwisealignment algorithms have been deﬁned in the last 15 years. The early alignment1 a r X i v : . [ q - b i o . M N ] F e b lgorithms detected similarities between small subnetworks [15, 19, 21, 22, 25].PathBLAST [15] is a tool developed to search for speciﬁc pathways in a PPIN.In contrast to PathBLAST, NetAlign [22] is a web-based tool designed to iden-tify the conserved network substructures between two PPIN. The method usedin this tool and in the algorithm presented in [25] is the matching of isomorphicsubgraphs. The pairwise alignment method introduced in [19], MaWISh, pro-duces a local alignment of two PPIN by evaluating the similarity of their graphstructures through a scoring function that accounts for evolutionary events,while the algorithm introduced in [21] to align PPIN is based on both protein se-quence similarity and network topology similarity, and it uses integer quadraticprogramming. In addition, a new and eﬃcient approach to obtain multiple lo-cal alignments is described in [9], based on conserved functional modules. Allthese methods and algorithms are able to detect and obtain similarities betweensubnetworks. Actually, the aim of local alignment algorithms is to ﬁnd regionswith the same network structure in the networks under comparison. For everyregion in one network, an alignment with some region in the other network maybe obtained, but it may happen that these local alignments are mutually incon-sistent, because the same protein in one network may be matched by diﬀerentlocal alignments to diﬀerent proteins in the other network: then, as a ﬁnal re-sult, it may happen that these local alignments cannot be extended to a globalalignment of the pair of PPIN.In contrast to these local alignment methods, a global alignment algorithmis aimed at ﬁnding the best overall alignment between whole PPIN [8]. Aglobal alignment is then a matching mapping between the sets of proteins oftwo PPIN, in such a way that each protein in one network is matched to one,and only one, protein in the other network. The motivation to perform a globalnetwork alignment is to compare interactomes, and to understand cross-speciesvariations [11]. Global network alignment is also related to the detection offunctional orthologs [35]. The identiﬁcation of orthologous groups is useful forgenome annotation, studies on gene/protein evolution, comparative genomics,and the identiﬁcation of taxonomically restricted sequences. Nevertheless, thereare often proteins in a network that have no biologically meaningful correspon-dence in another network and, thus, a meaningful alignment between a pair ofnetworks should not necessarily cover all of them.The ﬁrst algorithm for the global alignment of PPIN was IsoRANK [35].This algorithm produces a matching between a pair of input networks, basedon the idea that a protein in one network should be matched to a protein inthe other network if, and only if, the neighbors of the two proteins can also bematched. In order to obtain the matching, the algorithm associates a score witheach possible pair of nodes of the two networks, capturing the similarity of theirneighborhoods. Then, the highest scoring matching is obtained. Thus, the ﬁnalresult of the IsoRANK algorithm is a global alignment of two networks, butthe idea behind this algorithm is that two networks are similar if the networktopologies are similar. However, since nodes in these networks correspond toproteins, the protein similarity should also be taken into account when matchingtwo proteins; for instance, if two proteins share a similar sequence and have asimilar topology in the corresponding networks, then their matching probabilityshould be higher than for those with very diﬀerent sequences.The right balance between network topology and biological information isone of the most diﬃcult and key points in any PPIN alignment algorithm. In2act, IsoRANK considers the possibility of taking biological information of thenodes (proteins) into account, by tuning a parameter in the score values. Sev-eral other algorithms for the global alignment of PPIN have been proposedbased on the idea that “two nodes are similar if their corresponding neighborsare so,” and hence considering mainly network topology but also some bio-logical features [22, 1, 26, 28, 13]. As a result, some of them obtain a highnumber of conserved interactions but a very low functional consistence betweenthe matched proteins [26, 28]. Indeed, as it is stated in [7], after performinga comparison of existing algorithms for the pairwise alignment of PPIN, theevaluated algorithms have dramatic diﬀerences in the quality of the alignmentsthey produce, either yielding good topological or good biological matchings,but few of them do well in both aspects. In addition, they are not eﬃcientfrom the computational point of view: for some of them, the software systemis not well organized and they can run out of memory or spend a lot of time.Moreover, even if they do produce alignments, tend to be meaningless since thecoincidences among them are very poor.Consider also the analysis of eight recent aligners (NATALIE [18], SPINAL [1],PISwap [5], MAGNA[36], HubAlign [13], L-GRAAL [24], OPTNET[6], andModuleAlign[12]) performed in [23], where several topological scores (node cov-erage, topological coherence, induced conserved substructure and symmetricsub-structure) and diﬀerent biological coherence scores (KEGG pathway anno-tations, Gene Ontology annotation) are considered. The authors of this studyconclude that the agreement between the alignments produced by any two dif-ferent aligners is very low (around 20%) and also that the topological scoresare not in agreement with the biological coherence of the alignments. Evenmore, when the alignment process is guided by topological information only,they produce alignments with the highest topological coherence but the lowestbiological coherence. In contrast, when alignments are guided by sequence infor-mation only, they produce alignments with the highest biological coherence butthe lowest topological coherence. This becomes extremely inconvenient in thosealigners where the user has to choose the value of a parameter in order to specifythe desired balance between the topological and the sequence similarities.Therefore, the election of the right alignment tool depends on the purpose ofthe alignment itself. If the aim of the alignment is to infer biological informationabout relationships between the proteins in the networks, then the aligner withhighest functional coherence score must be considered. If the user is interestedin ﬁnding conserved network substructures, then the aligner with highest topo-logical score must be considered. However, it should be noticed that when wetry to obtain a global alignment between two topologically diﬀerent networks,like for instance a dense network with a high number of interactions and asparse network with a low number of interactions, the maximum expected valueof any topological score is very low. Thus, the topological score as a measureof alignment correctness should be considered mainly to detect small conservedsubnetworks, while, instead, the functional coherence score should be consideredas the best measure of correctness in a global network alignment, whose goal isthe detection of functional orthologs.Motivated by the lack of well-balanced and eﬃcient algorithms, we havedesigned AligNet, a parameter-free PPIN alignment algorithm aimed at ﬁllingthe gap between eﬃcient topologically and biologically meaningful matchings.The overall idea of the algorithm is to obtain many local alignments that are3ombined and extended into a meaningful global alignment. The ﬁnal alignmentcaptures the beneﬁts of considering both categories of alignments. With thelocal alignments we capture the topological similarity between the networksand we speed up the running time of the algorithm, while with the ﬁnal globalalignment we solve the inconsistencies among the local alignments and yield anoverall alignment of the pair of input PPIN. The results obtained with AligNetand with the best aligners compared in [7] and in [23] show that AligNet indeedachieves a good balance between topological and biological matching. In thetests reported in this paper, AligNet obtained the highest functional coherencescores in most of the alignments, which means that it maximized the functionalconsistence between aligned proteins, and also a reasonable fraction of conservedinteractions. In addition, HubAlign and AligNet had the best running timesamong all the aligners considered in the aforementioned tests. Methods

Protein-protein interaction networks as graphs

A protein-protein interaction network (PPIN) is modelled in a natural way asa graph, with its nodes representing the network’s proteins and its edges, theinteractions between them. Moreover, the interaction between two proteins isconsidered a symmetric property, that is, if a protein p interacts with anotherprotein p , then it is tacitly understood that p also interacts with p . Hence,PPIN are speciﬁcally modelled by means of undirected graphs. In this way, theproblem of aligning pairs of PPIN is translated into the problem of aligningpairs of undirected graphs with their nodes injectively labelled by proteins.Formally, an ( undirected ) graph is a structure G = ( V, E ) with V a ﬁnite setof nodes and E a family of 2-element subsets { u, v } of V , called the edges ofthe graph; recall that, as sets, { u, v } = { v, u } . We say that an edge e = { u, v } connects the nodes u and v , and also that e is incident to u and v . The nodes v such that { u, v } ∈ E are the neighbors of u . We shall denote by N G ( u ) theset of neighbors of u in G .We introduce now some further deﬁnitions and notation that will be usedthroughout this paper. Let G = ( V, E ) be an undirected graph. • The degree of a node u ∈ V is the number of edges incident to it; that is,the cardinal of N G ( u ). We denote it by deg( u ). • A path between two nodes u, v ∈ V is a sequence of pairwise diﬀerentedges { u, u } , { u , u } , . . . , { u k − , u k } , { u k , v } such that the ﬁrst and lastedges are incident to u and v , respectively, and every pair of consecutiveedges share a node (diﬀerent from u and v , in the case of the ﬁrst and lastedges, respectively). Two nodes are connected when there exists a pathbetween them. The length of a path is the number of edges forming it,and its intermediate nodes are u , . . . , u k . • For every pair of connected nodes u, v ∈ V , their distance in G is thelength of a shortest path connecting them. We denote it by d G ( u, v ). • The diameter of G is the maximum distance between any two connectednodes in G . We denote it by D ( G ).4igure 1 displays two toy PPIN that will be used as a running examplethroughout this section. The ﬁrst network consists of 8 nodes and 9 edges,while the second network consists of 9 nodes and 17 edges. dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206 Figure 1: This ﬁgure shows two small pieces of PPINs that we shall use to vi-sualize the performance of AligNet. The subnetworks belong to the

Drosophilamelanogaster (dme) and the

Homo sapiens (hsa) PPINs contained in theIsoBase database. The ﬁrst network has 8 proteins and 9 interactions, andthe second network has 9 proteins and 17 interactions. The diameter of the ﬁrstnetwork is 4 and the diameter of the second network is 3.

The structure of the AligNet algorithm

AligNet receives as input two graphs G and G (cid:48) representing two PPIN (in par-ticular, each node of them is labeled with a protein, in such a way that diﬀerentnodes in a graph correspond to diﬀerent proteins) and produces, as output,a similarity score for them and a local and a global alignment between them.AligNet has been implemented in R [30], and the implementation is freely avail-able from http://bioinfo.uib.es/~recerca/AligNet/ .The main steps in AligNet are:1 The computation of overlapping clusterings C ( G ) and C ( G (cid:48) ), respectively,of the input networks G and G (cid:48) .2 The computation of alignments between pairs of clusters in C ( G ) and C ( G (cid:48) ).3 The computation of a matching between C ( G ) and C ( G (cid:48) ).4 The computation of a local alignment of the input networks G and G (cid:48) .5 The extension of this local alignment to a meaningful global alignment.Throughout this section, G = ( V, E ) and G (cid:48) = ( V (cid:48) , E (cid:48) ) will denote twographs representing the input PPIN. We shall identify each node in any of thesegraphs with the protein it represents. 5 tep 1. Overlapping clusterings The ﬁrst step in AligNet consists in computing an overlapping clustering of eachinput network. These clusterings are based on a speciﬁc similarity score s ( u, v )between pairs of proteins (nodes) u, v in a PPIN, which is deﬁned as follows:for every pair of connected nodes u, v in a graph G representing a PPIN, s ( u, v ) = B ( u, v ) + D ( G )+1 − d G ( u,v ) D ( G )+1 • D ( G ) is the diameter of the G and d G ( u, v ) is the distance between u and v . • B ( u, v ) is the normalized bit score of the proteins associated to the nodes u and v , that is, a rescaled version of their alignment score obtained withBLAST+, which is independent of the size of the search space [4].If u, v are not connected by a path, then s ( u, v ) = 0.The intuition behind this similarity score is that two proteins are similar ifthey have similar sequences of nucleotides and they are relatively close to eachother in the graph. Recall that two proteins interact when there is an edgeconnecting them, that is, when their distance is 1. Therefore, the plausibilitythat two proteins have a related biological function increases when they are closeto each other in the graph.To obtain the overlapping clustering of an input network, we deﬁne a clustercentered at every node of the graph as follows. Let α be the third quartile ofthe distribution of the similarity score values of pairs of nodes: that is, α is thevalue for which only 25% of the pairs of nodes ( u, v ) are such that s ( u, v ) > α .Then, for every node u ∈ V , the cluster C u in G centered at u is C u = { v ∈ V | s ( u, v ) > α } . We denote by C ( G ) the set of clusters of a PPIN G .So, the ﬁrst step of AligNet computes the overlapping clusterings C ( G ) and C ( G (cid:48) ) of the input networks G and G (cid:48) . As a running example throughout thissection we will consider two small networks. Figure 2 displays the ﬁrst PPInetwork considered as a running example as well as its overlapping clustering.This ﬁrst network consists of 8 nodes and 9 edges, so there are 8 clusters.Figure 2 displays the second PPI network which consists of 9 nodes and 17edges, and its overlapping clustering has 9 clusters. Step 2. Alignments between pairs of clusters

In this second step, AligNet computes an alignment between every pair of clus-ters C u ∈ C ( G ) and C u (cid:48) ∈ C ( G (cid:48) ) such that B ( u, u (cid:48) ) >

0. That is, if we assumethat B ( u, u (cid:48) ) > u ∈ V and u (cid:48) ∈ V (cid:48) then AligNet computes | V | · | V (cid:48) | alignments. These alignments deﬁne an alignment score between everysuch a pair of clusters that will be used in the third step to compute a matchingbetween C ( G ) and C ( G (cid:48) ).The general idea to obtain the alignment between a pair of clusters C u ∈ C ( G ) and C u (cid:48) ∈ C ( G (cid:48) ) (with B ( u, u (cid:48) ) >

0) is the following: we ﬁrst match the6 m247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 dm247 dm6389 dm11070dm11454 dm11644 dm247 dm6389 dm11070dm11454 dm2171dm11644 dm247 dm6389 dm11070dm11454 dm2171dm11644 dm6389 dm11070dm11454 dm2171dm11644dm10450 dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450 dm11454 dm2171dm11644dm10450dm8158 dm11644dm10450dm8158 hs59 hs3857 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566 hs3857 hs553hs5433 hs6992hs12566 hs3857hs5638 hs553hs5433 hs6992hs12566hs3857hs5638 hs553hs5433 hs6992hs12566hs12206 hs399hs3857 hs3857hs5638 hs553hs5433 hs6992hs12566 hs3857hs5638 hs5433 hs6992hs12566 hs12566hs12206

Figure 2: This ﬁgure shows the overlapping clustering on the PPINs in Figure 1obtained by AligNet. We can see here the 8 clusters in the network in Figure 1on the left, and the 9 clusters in the network in Figure 1 on the right. The centerof every cluster is highlighted in blue. Since we have considered two small piecesof a PPIN, we obtain here that, the ﬁrst cluster on the left is the entire piece ofnetwork. In the right, we obtain also the entire piece of network in the secondcluster on the right. Notice that we obtain the whole piece of the network whenwe consider the cluster of a node that is in the center of the network..centers of the clusters, that is, we match u with u (cid:48) and then, we match theneighbors of u to the neighbors of u (cid:48) . To decide the neighbors matching, wetake into account their sequence similarity and their degrees. Thus, a neighborof u is matched to a neighbor of u (cid:48) provided that they have similar nucleotidesequences and also similar degrees. Following the same criteria, we match theneighbors of the neighbors of u with the neighbors of the neighbors of u (cid:48) . Weiterate this process until no unmatched neighbors are found. In the intermediatesteps we keep the node matching in a list of pairs denoted by L u,u (cid:48) . When thealgorithm terminates, L u,u (cid:48) provides a partial mapping between the nodes in C u and the nodes in C (cid:48) u .Thus, the alignment between a pair of clusters C u ∈ C ( G ) and C u (cid:48) ∈ C ( G (cid:48) )(with B ( u, u (cid:48) ) >

0) can be formally deﬁned as follows:(i) Match u with u (cid:48) . Set L u,u (cid:48) = (cid:8) ( u, u (cid:48) ) (cid:9) , L (1) u,u (cid:48) = { u } and L (2) u,u (cid:48) = { u (cid:48) } .(ii) For every v ∈ C u ∩ N G ( u ) and for every v (cid:48) ∈ C u (cid:48) ∩ N G (cid:48) ( u (cid:48) ), let F ( v, v (cid:48) ) = | deg( v ) − deg( v (cid:48) ) | − B ( v, v (cid:48) ) + 1 . M u,u (cid:48) ⊆ ( C u ∩ N G ( u )) × ( C u (cid:48) ∩ N G (cid:48) ( u (cid:48) )) that mini-mizes (cid:80) ( v,v (cid:48) ) ∈ M u,u (cid:48) F ( v, v (cid:48) ). Sort the pairs in M u,u (cid:48) in decreasing order oftheir F value, and concatenate them to L u,u (cid:48) . Add their ﬁrst coordinatesto L (1) u,u (cid:48) and their second coordinates to L (2) u,u (cid:48) .(iii) Iterate step (ii), replacing ( u, u (cid:48) ) by the rest of the pairs in L u,u (cid:48) andremoving from C u and C u (cid:48) the nodes already aligned.More speciﬁcally, in the k -th iteration, take the k -th element ( v , v (cid:48) ) of L u,u (cid:48) . For every w ∈ ( C u \ L (1) u,u (cid:48) ) ∩ N G ( v ) and every w (cid:48) ∈ ( C u (cid:48) \ L (2) u,u (cid:48) ) ∩ N G (cid:48) ( v (cid:48) ), compute F ( w, w (cid:48) ). Then, compute a matching M v ,v (cid:48) ⊆ (cid:0) ( C u \ L (1) u,u (cid:48) ) ∩ N G ( v ) (cid:1) × (cid:0) ( C u (cid:48) \ L (2) u,u (cid:48) ) ∩ N G (cid:48) ( v (cid:48) ) (cid:1) that minimizes (cid:80) ( v,v (cid:48) ) ∈ M v ,v (cid:48) F ( v, v (cid:48) ). Sort the pairs forming M v ,v (cid:48) indecreasing order of their F value, and concatenate them to L u,u (cid:48) . Addtheir ﬁrst coordinates to L (1) u,u (cid:48) and their second coordinates to L (2) u,u (cid:48) .The matchings in step (ii) as well as in each iteration in step (iii) are computedwith the Hungarian algorithm [20]. Figure 3 shows an example of the alignmentof a pair of clusters: one cluster from the ﬁrst network and another cluster fromthe second network.The overall idea behind the algorithm described above is that a node v in C u should be matched to a node v (cid:48) in C u (cid:48) when they have a similar topologicalrole in the cluster and similar sequences, provided that, furthermore, there existpaths connecting the cluster centers u and u (cid:48) with v and v (cid:48) , respectively, suchthat their intermediate nodes are already aligned in sequential order along thepaths. The alignment procedure gives priority to matching neighbors of nodes x, x (cid:48) at the possible shortest distance of the respective cluster centers and with F ( x, x (cid:48) ) as large as possible among those pairs already matched at their sameiterative step.The resulting alignment L u,u (cid:48) deﬁnes a partial injective mapping η u,u (cid:48) : C u → C u (cid:48) . The nodes in C u that are matched to nodes in C u (cid:48) form the domainof the mapping η u,u (cid:48) , which is denoted by Dom η u,u (cid:48) . Step 3. Matching between families of clusters

Let A = { η u,u (cid:48) | u ∈ V, u (cid:48) ∈ V (cid:48) , B ( u, u (cid:48) ) > } be the set of alignments obtained in step 2. The score of every alignment η u,u (cid:48) ∈ A is deﬁned as Score ( η u,u (cid:48) ) = (cid:80) v ∈ Dom η u,u (cid:48) B ( v, η u,u (cid:48) ( v )) | Dom η u,u (cid:48) | + | Dom η u,u (cid:48) | max η w,w (cid:48) ∈A | Dom η w,w (cid:48) | where | X | stands for the number of elements in the set X . This score assessessimultaneously the average similarity of the sequences of the proteins matchedby η u,u (cid:48) and their number.Once computed all these scores, AligNet obtains a matching between C ( G )and C ( G (cid:48) ) by considering a bipartite graph where the nodes are the clusters in8 m247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs3857hs5638 hs553hs5433 hs6992hs12566 dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs3857hs5638 hs553hs5433 hs6992hs12566dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs3857hs5638 hs553hs5433 hs6992hs12566 dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs3857hs5638 hs553hs5433 hs6992hs12566 Figure 3: This ﬁgure shows how AligNet aligns two clusters which correspondsto Step 2 of our algorithm. The clusters in this example are, respectively, theﬁrst in the list of clusters of G , which are shown on the left in Figure 2 and theseventh in the list of clusters of G (cid:48) , which are shown on the right in Figure 2.We show in the picture all the steps needed to align the cluster of G with thecluster of G (cid:48) . From top to bottom in this ﬁgure, we can see that AligNet ﬁrstaligns the centers of the clusters, which are the nodes highlighted in blue. Then,AligNet aligns the neighbors of the centers (second row). Next, AligNet alignsthe neighbors of the neighbors. In each step we show in dashed lines the nodesthat are already aligned and in solid lines the nodes that are aligned in thepresent step. Notice that, in this example, there are two nodes that remainunmatched. C ( G ) and C ( G (cid:48) ), the edges correspond to alignments η u,u (cid:48) ∈ A , and the weightof an edge connecting C u with C u (cid:48) is the corresponding score Score ( η u,u (cid:48) ). Thematching between the nodes in C ( G ) and the nodes in C ( G (cid:48) ) is then obtained byapplying the maximum weighted bipartite matching algorithm to this bipartitegraph. Recall that the nodes in C ( G ) are the clusters in G and the nodes in C ( G (cid:48) ) are the clusters in G (cid:48) . The solution to the maximum weighted bipartitematching problem provides us with a matching between the clusters in G and theclusters in G (cid:48) . We shall denote by C the set of partial injective mappings η u,u (cid:48) corresponding to pairs of clusters ( C u , C u (cid:48) ) that are matched by this matching.Figure 4 shows the matching between the family of clusters in Figure 1 andthe family of clusters in Figure 2 obtained in this step. Step 4. Local alignment of PPIN

In this step, AligNet produces a local alignment between G and G (cid:48) from thematching between C ( G ) and C ( G (cid:48) ) obtained in the previous step. The main ideais to deﬁne this alignment by merging the partial injective mappings η u,u (cid:48) ∈ C .The problem is that these mappings may be inconsistent, because C ( G ) and C ( G (cid:48) ) are overlapping clusterings. Indeed, it may happen that a node w belongsto more than one cluster C u , and that the corresponding mappings η u,u (cid:48) ∈ C send w to diﬀerent nodes in G (cid:48) ; conversely, for w (cid:48) belonging to multiple C u (cid:48) mappings to diﬀerent nodes in G . 9 m247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs3857hs5638 hs553hs5433 hs6992hs12566 dm247 dm6389 dm11070dm11454 dm11644 hs3857hs5638 hs553hs5433 hs6992hs12566hs12206 dm247 dm6389 dm11070dm11454 dm2171dm11644 hs3857hs5638 hs553hs5433 hs6992hs12566 dm247 dm6389 dm11070dm11454 dm2171dm11644 hs3857hs5638 hs5433 hs6992hs12566 dm6389 dm11070dm11454 dm2171dm11644dm10450 hs59 hs3857 dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450 hs3857 hs553hs5433 hs6992hs12566 dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566 dm11644dm10450dm8158 hs12566hs12206 Figure 4: This ﬁgure shows the ﬁnal assignment between the clusters in Figure 2produced by AligNet, which corresponds also to Step 3. Each of the eightclusters obtained from G is aligned to one, and only one, of the nine clustersobtained from G (cid:48) . Hence, one cluster from G (cid:48) remains unmatched which is thesecond cluster in the third row on the right in Figure 2. In this ﬁgure, we showthe clusters from G on the left and its corresponding cluster image from G (cid:48) onthe right. 10o overcome this problem, we consider a weighted bipartite hypergraphwhose nodes are the nodes in G and in G (cid:48) , every mapping η u,u (cid:48) is a hyperarcwith source its domain and target its image, and the weight of every hyperarcis the score Score ( η u,u (cid:48) ). Then, the solution of the weighted bipartite hyper-graph assignment problem provides a well-deﬁned local alignment of the inputnetworks. However, in order to decrease the computation time of AligNet, wedo not consider all the mappings η u,u (cid:48) together, but just a subset R of themthat is recursively increased until all mapping η u,u (cid:48) have been considered. Thus,AligNet builds recursively a subset R ⊆ C of best-scored alignments, by choosing,at each step, a mapping η w ,w (cid:48) ∈ C with w not belonging to the union of thedomains of the mappings η w,w (cid:48) already in R and with maximum Score ( η w ,w (cid:48) )among all such mappings. AligNet iterates this procedure until every node in (cid:83) η u,u (cid:48) ∈C Dom η u,u (cid:48) belongs to the domain of some mapping in R . In Figure 5we give the subset R of C for the networks in our running example.Now, consider the directed hypergraph H with nodes V ∪ V (cid:48) and hyperarcsthe mappings η u,u (cid:48) ∈ R : each η u,u (cid:48) is understood as a hyperarc with sourceits domain and target its image. Then, AligNet obtains from this hypergraph alocal well-deﬁned alignment between G and G (cid:48) as a solution of the correspondingweighted bipartite hypergraph assignment problem [3]. Figure 6 shows the localalignment obtained from the hypergraph corresponding to Figure 5. Step 5. Global meaningful alignment of PPIN

In order to extend the local alignment produced in the previous step, AligNetiterates the following procedure: • It removes the nodes in G and G (cid:48) that have already been aligned, and itrecomputes the score of each alignment η u,u (cid:48) following the same deﬁnitionas in Step 3, but only taking into account the remaining nodes in itsdomain and image. • It computes a new optimal matching C between C ( G ) and C ( G (cid:48) ), as instep 3, but using as edges those η u,u (cid:48) whose updated score is positive, andweights these updated scores. • It computes a new set R of best-scored alignments η u,u (cid:48) with Score ( η u,u (cid:48) ) >

0, as in step 4. • It deﬁnes a new directed hypergraph H whose nodes are the nodes in V ∪ V (cid:48) not yet aligned and hyperarcs the mappings η u,u (cid:48) in the new set R , understood as hyperarcs with source the still unaligned nodes in theirdomain and target the still unaligned nodes in their image. • It computes a local alignment between unaligned nodes in V and V (cid:48) bysolving the weighted bipartite hypergraph assignment problem for thishypergraph, and it adds this local alignment to the alignment obtained sofar.This procedure is iterated while there exist nodes not aligned belonging tothe domain or the image of some alignment η u,u (cid:48) with (updated) positive score:In Figure 7 we show the ﬁnal global meaningful alignment obtained with AligNetfor the networks in our running example.11 valuation of alignment quality Several methods have been proposed to evaluate the quality of an alignment andto compare the performance of PPIN aligners [7, 23]. One of the handicaps whenconsidering biological data is that the true alignment is unknown and, therefore,to evaluate the alignment quality one cannot count the number of true and falsepositives and true and false negatives. However, several measures of alignmentquality have been already proposed which are divided in two categories, topo-logical coherence and biological coherence . The topological coherence measuresevaluate the topological similarity of the aligned regions considering the edgecorrectness , which is the percentage of conserved edges, the induced conservedsubstructure , which is the percentage of preserved edges, and the symmetricsubstructure score , which is the percentage of preserved and conserved edges.The biological coherence measures evaluate the protein function similarities ofthe aligned proteins by considering the

KEGG pathway annotation , which mea-sures the percentage of proteins aligned to proteins that participate in the samepathway, and the

Gene Ontology annotation , which measures the similarity be-tween the GO terms of a protein and its image. In the comparison of eightrecent aligners reported in [23], it is stated that there is a strong correlationbetween the three topological coherence measures and also a strong correlationbetween the two biological coherence measures. Therefore, when evaluating thealignment quality, it is enough to consider one of the tree measures for topo-logical coherence and one of the two biological coherence measures. However,there is a low correlation between the topological and the biological coherencemeasures. Thus, to evaluate the alignment quality, we considered both the edgecorrectness and the Gene Ontology annotation measures, which are deﬁned asfollows:Let G = ( V, E ) and G (cid:48) = ( V (cid:48) , E (cid:48) ) be two PPIN such that | V | (cid:54) | V (cid:48) | . The edge correctness ratio of a mapping µ : G → G (cid:48) is the ratio of the edges thatare preserved by µ , and it is deﬁned by EC ( µ ) = (cid:12)(cid:12)(cid:8) { u, v } ∈ E | { µ ( u ) , µ ( v ) } ∈ E (cid:48) (cid:9)(cid:12)(cid:12) min {| E | , | E (cid:48) |} . The functional coherence value , or

GO consistency , of a mapping µ : G → G (cid:48) is deﬁned as F C ( µ ) = (cid:80) u ∈ V F S ( u, µ ( u )) | V | , where the similarity score F S is deﬁned by

F S ( u, u (cid:48) ) = | GO ( u ) ∩ GO ( u (cid:48) ) || GO ( u ) ∪ GO ( u (cid:48) ) | , with GO ( u ) and GO ( u (cid:48) ) the sets of GO annotations of the proteins u and u (cid:48) ,respectively. Results and Discussion

In this section we report the tests performed to assess the performance ofAligNet. These tests have been designed taking into account the study and re-sults reported in [7] and [23], where a comparison of algorithms for the pairwise12odes Edges (with loops) E (without loops)

M. musculus

623 776 559

C. elegans

S. cerevisiae

D. melanogaster

H. sapiens

Network data

We have considered the same dataset used in [7], so that it makes sense tocompare the results obtained by AligNet with the results reported therein. Thus,we have downloaded from the IsoBase database [27] the PPIN (version 1.0.2) ofﬁve organisms:

M. musculus (mus),

C. elegans (cel),

D. melanogaster (dme),

S. cerevisiae (sce), and

H. sapiens (hsa), and aligned each pair of them withAligNet. The number of nodes and edges of these PPIN are shown in Table 1.

Comparison to other aligners

Quantitative analysis

In order to evaluate the quality of the alignments produced by AligNet, andto compare it with that of the aforementioned aligners, we have used both theedge correctness (EC) and the functional coherence (FC) metrics deﬁned in theprevious section.As reported in [7], an edge correctness ratio of 100% should not be takenas a conclusive evidence of a correct network alignment, because it is alwayspossible that two biologically unrelated edges had been mapped to each other.In addition, it is reported in [23] that the alignment of a dense network with asparse network produces a low edge correctness ratio. We can observe that thissituation is reﬂected for instance in the alignment between

S. cerevisiae and

D. melanogaster , since there are more edges in the source network (sce) thanin the target network (dme). Thus, an edge correctness ratio of 20% shouldbe considered an evidence of an incorrect network alignment, only when thenumber of edges of the source network is smaller than the number of edges inthe target network. Therefore, given the limitations of this topological measureof alignment quality, measures of agreement derived from biological informationare also popular in the literature. In particular, most papers on PPIN align-ment make use of gene orthology annotations from the Gene Ontology (GO)database [2] to measure alignment accuracy, by comparing the similarity of GOannotations between aligned proteins. Hence, we have also used the functionalcoherence value, or GO consistency, introduced in the previous section, to assess13 et1 Net2 EC AligNet EC HubAlign EC L-GRAAL EC PINALOG EC SPINALmus cel 0.58 0.81 0.79 0.34 0.01mus sce 0.65 0.97 0.68 0.56 0.05mus dme 0.65 0.88 0.70 0.30 0.03mus hsa 0.76 0.95 0.77 0.62 0.24cel sce 0.24 0.83 0.38 0.30 0.06cel dme 0.31 0.68 0.53 0.18 0.01cel hsa 0.31 0.77 0.43 0.23 0.01sce dme 0.03 0.01 0.08 0.19 0.03sce hsa 0.04 0.03 0.13 0.19 0.04dme hsa 0.13 0.37 0.31 0.13 0.01

Table 2: Edge Correctness scores obtained by the considered aligners.the quality of our alignments.In order to compare the EC scores and the FC scores of all the consideredaligners for every pair of PPIN, we run all the aligners on the same pairs of inputnetworks. However, as it has been already stated in previous studies of existingaligners [7, 23], some diﬃculties appear when trying to do this work. Moreprecisely, there were computations that never stopped. This was the case ofNATALIE. L-GRAAL matches the network with the smallest number of edgesto the network with the largest number of edges, which means that it inter-changes the order of the input networks in the case of the alignments between

S. cerevisiae and

D. melanogaster and between

S. cerevisiae and

H. sapiens .And also, for most of the aligners, some parameters must be ﬁxed. We con-sidered the parameters suggested by default in all the aligners whenever it waspossible. With L-GRAAL, also a time limit or a maximum number of stepsmust be considered. We again decided to consider the parameters suggested bydefault in its implementation.In Table 2 and Table 3 we report the results obtained with this test. We canobserve there that the alignments of small networks with a low number of edges,such as

M. musculus , produced alignments with high EC scores, especially whenthe target network has a large number of edges. However, even in this case, theEC scores obtained with the aligners PINALOG and SPINAL are not high. It isvery surprising that SPINAL preserved less than 10% of the edges in the sourcenetwork in all the alignments, except in the alignment between

M. musculus and

H. sapiens , only 24% of the edges present in

M. musculus were preserved. Onthe other hand, PINALOG obtained much more reasonable scores than SPINALbut in the best scenario, that is, when the alignment is between

M. musculus and the others networks, PINALOG preserved less than 40% of the edges in twoalignments (

M. musculus with

C. elegans and

D. melanogaster ) while the otheraligners preserved more than 60% of the edges present in

M. musculus . Thebest scores in this case are obtained by HubAlign, which preserves more than80% of the edges, followed by L-GRAAL that preserves the 70% of the edges,and then AligNet that preserves 60% of the edges. However, we can also observehere that, when the number of edges in the source network increases, the ECscores decrease dramatically even in the case of HubAlign. When considering thealignments between

S. cervisiae and

D. melanogaster or H. sapiens , we observedthat all aligners, even L-GRAAL that interchanges the input networks, obtainedless than 10% of the edges matched in the target network.Therefore, we can conclude that the analysis of the obtained results reveals14hat the EC score is only a measure of topological relation between the inputnetworks. When the source network is smaller and has much less edges than thetarget network, then a high EC score should be expected. However, when thesource network is similar to the target network or when it has a higher numberof edges, then a very low EC score should be expected. Overall, it clearly impliesthat another deﬁnition of alignment correctness must be considered.As far as the functional coherency goes, we report the obtained results inTable 3. We can observe there that all the aligners obtained a very low score.However, we cannot conclude that all aligners have a low biological coherence,because it is not clear if the low value is due to the alignment itself or to themeasure of biological coherence. Therefore, we tried to obtain the maximumvalue of the FC score that can be expected for every pair of networks. To obtainthis value we performed the following test: for every pair of input networks, weconsidered a complete bipartite graph where the nodes are the proteins in thetwo input networks, the edges are all protein pairs consisting on a protein inthe source network and a protein in the target network, and the weight of eachedge is the FC score of the corresponding pair of proteins. Then, we obtain themaximum FC score,

F C max , as the FC score of the solution to the maximumweighted bipartite matching problem. Hence, for every pair of networks, we cancompare the FC score obtained for each aligner with the maximum score. Wedeﬁne the relative biological coherence as the ratio between the FC score andthe

F C max . That is,

F C rel = F CF C max . Now, if we look at the results presentedin Table 3, we can observe that the values of

F C max range from 0 .

19 to 0 . M. musculus and

H. sapiens we obtain that

F C max = 0 .

54, which means that, in average in most of the alignments, the bestalignment from the functional coherence point of view, maps correctly only 20%of the proteins. Considering now the results obtained by the diﬀerent aligners,we can observe that the order from the highest to the lowest scores is almostthe opposite to the order obtained when considering the EC scores. That is, thebest scores are achieved by SPINAL, followed by PINALOG, AligNet, HubAlignand L-GRAAL. However, the results obtained even with SPINAL is that only10% of the proteins is matched correctly, considering the GO term measure ofbiological coherence. In addition, the best possible alignment would be able toalign correctly only 20% of the proteins. We discard the hypothesis that the GOterm measure is low due to the lack of GO terms, since, as we show in Table 4,for every pair of networks, 90% of protein pairs have their GO terms annotated,and there is no correlation between the number of annotated GO terms and FCscores. Therefore, since the FC scores are not conclusive of meaningful biologicalalignments, we performed an additional test explained below.

Qualitative analysis

As stated in [7], the evaluation of any aligner should be done considering alsotheir quality and not only the quantity, that is, the accuracy of the alignmentand not only the number of preserved edges or GO terms. The accuracy ofany aligner is easily tested when there is a gold standard to compare to. Withthis idea in mind, we have considered the following test to study the qualityof our alignments in contrast with the quality of the alignments obtained byPINALOG, HubAlign and L- GRAAL. Since the results obtained by SPINALin the previous test were not convincing, we do not consider it in the next test.15 et1 Net2

F C max

AligNet HubAlign L-GRAAL PINALOG SPINALFC –

F C rel

FC –

F C rel

FC –

F C rel

FC –

F C rel

FC –

F C rel mus cel 0.21 0.06 – 0.26 0.04 – 0.21 0.03 – 0.17 0.10 – 0.50 0.12 – 0.60mus sce 0.24 0.08 – 0.33 0.07 – 0.30 0.04 – 0.18 0.12 – 0.50 0.15 – 0.64mus dme 0.19 0.05 – 0.24 0.03 – 0.16 0.03 – 0.13 0.07 – 0.42 0.06 – 0.33mus hsa 0.54 0.23 – 0.42 0.26 – 0.49 0.10 – 0.18 0.48 – 0.90 0.10 – 0.20cel sce 0.20 0.06 – 0.33 0.03 – 0.14 0.04 – 0.21 0.13 – 0.70 0.19 – 0.99cel dme 0.23 0.04 – 0.18 0.02 – 0.07 0.02 – 0.09 0.09 – 0.42 0.09 – 0.42cel hsa 0.24 0.04 – 0.17 0.02 – 0.08 0.03 – 0.13 0.08 – 0.35 0.08 – 0.36sce dme 0.24 0.05 – 0.19 0.07 – 0.30 0.02 – 0.07 0.07 – 0.31 0.10 – 0.43sce hsa 0.26 0.06 – 0.24 0.08 – 0.30 0.02 – 0.07 0.09 – 0.29 0.11 – 0.45dme hsa 0.20 0.04 – 0.18 0.02 – 0.11 0.02 – 0.08 0.09 – 0.41 0.08 – 0.43

Table 3: Functional Coherence scores obtained by the considered aligners.

Net1 Net2

F C max

NotAvailableGOpairs AvailableGOpairs Percentagemus cel 0.21 1 598 99.83mus sce 0.24 0 599 100.00mus dme 0.19 2 597 99.67mus hsa 0.54 1 598 99.83cel sce 0.20 20 2954 99.33cel dme 0.23 256 2718 91.39cel hsa 0.24 102 2872 96.57sce dme 0.24 45 5478 99.19sce hsa 0.26 14 5509 99.75dme hsa 0.20 0.19 7126 96.47

Table 4: This table shows the relation between available GO pairs for every pairof networks.

Protein complex prediction

In order to test the behavior of AligNet in the alignment of protein complexes, wealso performed the protein complex prediction test reported in [29] using PINA-LOG. Following the procedure explained therein, we considered the databaseMIPS CORUM [32] for the human protein complexes and, as a gold standardfor the yeast complexes, we considered the information available in [10]. Inaddition, we considered the functional information available in MIPS CORUMfor the human complexes and in MIPS FunCat [33] for the yeast complexes.Then, we considered the overlapping score of complexes introduced in [29], andwe deﬁned a functional coherence value for the alignment of protein complexes,as explained below.Let G = ( V, E ) and G (cid:48) = ( V (cid:48) , E (cid:48) ) be two PPIN such that | V | (cid:54) | V (cid:48) | , let µ : G → G (cid:48) be a mapping, and let c ⊆ V and c (cid:48) ⊆ V (cid:48) be two protein complexesin G and G (cid:48) , respectively. The overlapping score of c and c (cid:48) is deﬁned as OS ( c, c (cid:48) ) = |{ u ∈ c | µ ( u ) ∈ c (cid:48) }| min( | c | , | c (cid:48) | ) . To deﬁne a functional coherence value for the protein complex alignment,every protein complex c in G is ﬁrst mapped to a protein complex c (cid:48) in G (cid:48) provided that OS ( c, c (cid:48) ) lies over a threshold, ﬁxed at 0 . c in G we consider the complex c (cid:48) in G (cid:48) such that OS ( c, c (cid:48) )is maximum.As it was the case with the edge correctness ratio, an overlapping score ofpairs of complexes of 100% need not be evidence of a correct network alignment,16ligNet HubAlign PINALOG L-GRAALNotAssigned 1269 1154 945 996NotCoherent 377.00 589.00 626.00 741Coherent 128 31 203 37CFC 25.34 5 24.48 4.75Table 5: This table shows the number of complexes that are not as-signed/assigned correctly and assigned incorrectly.because every protein complex is supposed to develop several biological func-tions, and the alignment may establish a correspondence between two complexesthat are completely unrelated from the point of view of their function.The main point here is that the aim of the alignment should be clearlystated. If it only aims at matching similar topological substructures of the net-works, in order to detect those substructures that appear in both networks,then maximizing the sum of the overlapping score of pairs of complexes maybe a suitable goal. However, if the alignment searches for pairs of proteins thatshare biological functions, then only those complexes with a common functionshould be matched. Since the main application of PPIN alignment is to in-fer biological functions of proteins and protein complexes, it is very importantthat the alignment does not match biologically unrelated complexes. Therefore,we deﬁne the complex functional coherence of an alignment between PPIN asfollows. First, a pair of two complexes, one in each network, is said to be co-herent if they share some biological function; otherwise, the pair is incoherent.Then, the complex functional coherence value (CFC) of the alignment is deﬁnedby the complex alignment precision, that is, the ratio of complexes that arealigned correctly with respect to the aligned complexes. If we denote by CP the number of coherent pairs and by N CP the number of incoherent pairs, then

CF C = CPCP + NCP × .

48. However, the CFC values obtained inthe alignments produced by L-GRAAL and HubAlign are lower. They are 4 . .

19 for AligNetand 0 .

17 for PINALOG. Therefore, the only diﬀerences that can be observebetween these aligners is that PINALOG aligns more complexes than AligNet.If it aligns more complexes, and its precision is slightly the same as the precisionof AligNet, then PINALOG has more incorrect alignments than AligNet and alsomore correct alignments. In this sense, AligNet is a more conservative alignerthan PINALOG, although its precision is slightly higher than the precision ofPINALOG.We present in Figure 8 a visualization of the obtained results when we con-18lignNet vs PINALOG Ratio PINALOG vs AlignNet RatioNot Assigned 815.00 86.24 815.00 64.22Not Coherent 105 11.11 375 29.55Coherent 25 2.65 79 6.23Table 8: This table shows how AligNet assigned the complexes that are notassigned by PINALOG and conversely.trasted AligNet with the other aligners. We present there the ratio of unalignedcomplexes, correctly aligned complexes (coherent pairs) and incorrectly alignedcomplexes (incoherent pairs). We can observe that HubAlign versus AligNet(second bar from the left) as well as L-GRAAL versus AligNet (ﬁrst bar fromthe right) obtain a higher proportion of incoherent pairs and a lower proportionof coherent pairs. In contrast, AligNet versus PINALOG and PINALOG verusAligNet (the two bars in the center) obtain a similar proportion of correctly andincorrectly aligned pairs.As a result of the comparison between the aligners, we obtain again, as itwas the case in [7] and [23], that the agreement of the alignments obtained bydiﬀerent aligners is vey low. The majority of the global aligners achieve a highnode coverage, meaning that the average of assigned nodes in the source networkis high, but all of them obtain a very low biological coherence value. With re-spect to the topological coherence value, some aligners are able to obtain a highscore but it is always associated with a biological coherence score. Overall, wecan conclude that AligNet is the aligner that obtains a better balance betweentopological coherency (it preserves 60% of the edges) and functional coherency(relative function coherency values between 20% and 40% and the highest com-plex functional coherency score, 25 . Aligners analysis

In order to study the eﬃciency of the considered aligners, we take into accounttheir running time and memory space needed to perform an alignment. We runour implementation of AligNet on a server with 4 processors at 2.6 GHz and20 GB of RAM and we also run the latest implementation of NATALIE (down-loaded from ), PINALOG (down-loaded from ), SPINAL (downloadedfrom http://code.google.com/p/spinal/ ), HubAlign (downloaded from http://ttic.uchicago.edu/hashemifar/software/HubAlign.zip ) and L-GRAAL(downloaded from http://bio-nets.doc.ic.ac.uk/L-GRAAL/ ).As we already explained in the Background section, one of the weak pointsof PPIN aligners is either their running time or the memory space they use.19ndeed, although NATALIE was suggested as a good aligner, it could not evenalign the two smallest networks,

C. elegans and

D. melanogaster , on a computerwith 64 GB of RAM. With respect to PINALOG, SPINAL, HubAlign and L-GRAAL, we were able to complete all the alignments and we show their runningtimes in Table 9. In order to visualize their running times, we also show therunning times of every ﬁnished computation for each aligner in Figure 9. Wecan observe there that SPINAL is, with a big diﬀerence, the slowest one tocompute the alignments between

H. sapiens and

S. cerevisiae , and also between

D. melanogaster and

S. cerevisiae . In addition, PINALOG is the slowest one,also with a big diﬀerence, to compute the alignment between

C. elegans and

H. sapiens , as well as the alignment between

H. sapiens and

M. musculus .We can also observe that AligNet is considerably faster than PINALOG andSPINAL, with a running time of less than a thousand seconds in most of thealignments. Only in one computation, the alignment between

D. melanogaster and

H. sapiens , AligNet is slower than PINALOG and SPINAL with a diﬀerenceof less than two thousand seconds. However, it is diﬃcult to see the runningtimes in some alignments because SPINAL needed more than 20 ,

000 seconds forthe alignment between

S. cerevisiae and

H. sapiens . Thus, in order to visualizethe results in the cases where the aligners consumed less than 3 ,

500 seconds,we decided to remove SPINAL and PINALOG and in Figure 10 we show aginthe results considering only AligNet, HubAlign and L-GRAAL. We can observethere that L-GRAAL is the aligner that consumed more time in most of thecomputations. Concerning HubAlign and AligNet, HugAlign is faster except inthe alignments between

C. elegans and

S. cerevisiae and also

S. Cerevisiae and

D. melanogaster .Furthermore, we show in Figure 11 the relation between network size andrunning time for all of the computations with each of the aligners. The size of anetwork pair is the sum of their nodes. Thus, the network pairs in the diagramsare positioned in increasing order. A perfect aligner, from the eﬃciency point ofview, should present a linear relation between the size of the network pair andthe consumed time. From top left to bottom right, we show the results for thealigners AligNet, HubAlign, SPINAL, PINALOG, and L-GRAAL. We can ob-serve that HubAlign and AligNet present a clear relation between computationtime and size of the input networks. However, this is not the case of PINALOG,SPINAL and L-GRAAL. It should be noticed here, that L-GRAAL has a stepparameter which may force to stop the computation.

Conclusions

In this paper we present AligNet, a new method and software tool for thepairwise global alignment of PPIN aimed to produce biologically meaningfulalignments by achieving a good balance between structural matching and proteinfunction conservation. AligNet is a parameter-free algorithm that, given twoPPIN, produces a consistent alignment from the smaller network, in terms ofnumber of nodes, to the larger network. In order to assess the correctness of ourAligNet aligner, we have evaluated the quality of the alignments obtained withAligNet and with the best aligners established in [7, 23], namely: PINALOG,SPINAL, HubAlign, and L-GRAAL. The obtained results show that, indeed,AligNet produces biologically more meaningful alignments than state-of-the-art20 lignment Time cel-dme Compute Matrices 18.3Overlapping Clustering 47.5Clusters Alignment and Assignment 105.45Global Alignment 113.173

Total 331.098 cel-hsa Compute Matrices 49.68Overlapping Clustering 79.28Clusters Alignment 109.198Global Alignment 215.246

Total 555.328 cel-mmu Compute Matrices 3.914Overlapping Clustering 5.436Clusters Alignment 17.556Global Alignment 1.774

Total 56.381 cel-sce Compute Matrices 14.422Overlapping Clustering 28.59Clusters Alignment 29.663Global Alignment 42.676

Total 147.788 dme-hsa Compute matrices 293.08Overlapping Clustering 125.215Clusters Alignment 277.877Global Alignment 1195.752

Total 2108.493 dme-mmu Compute Matrices 10.684Overlapping Clustering 24.722Clusters Alignment 72.988Global Alignment 10.638

Total 176.482 dme-sce Compute Matrices 171.263Overlapping Clustering 60.953Clusters Alignment 66.736Global Alignment 192.9

Total 542.181 hsa-mmu Compute Matrices 21.392Overlapping Clustering 46.998Clusters Alignment 94.407Global Alignment 14.947

Total 288.964 hsa-sce Compute Matrices 56.28Overlapping Clustering 101Clusters Alignment 380.864Global Alignment 424.039

Total 1066.908 mmu-sce Compute Matrices 7.792Overlapping Clustering 14Clusters Alignment 13.107Global Alignment 1.567

Total 61.630

Table 9: AligNet running times in seconds.21ethods and tools, by achieving a better balance between structural matchingand protein function conservation.We have used both the edge correctness (EC) and the functional coherence(FC) metrics. The results obtained and presented in the Results and Discussionsection of this paper, reveal that HubAlign and L-GRAAL obtained the bestEC scores when the source input network has considerably less number of edgesthan the target network, preserving 80% of the edges, while AligNet preserved60% of the edges. However, all aligners obtained very low EC scores, with lessthan 10% of the edges preserved, when the input PPIN have similar size.Concerning the eﬃciency of the considered aligners from the computationalpoint of view, HubAlign and AligNet obtained the best running time. In addi-tion, running time increases with both aligners with the increase in the size ofthe input networks, unlike the other aligners, which are slower than HubAlignand AligNet and have a variable running time that is not related to the size ofthe input networks.

Acknowledgements

We thank Gabriel Riera for the technical support.

References [1] Ahmet E. Aladaˇg and Cesim Erten. SPINAL: Scalable protein interactionnetwork alignment.

Bioinformatics , 29(7):917–924, 2013.[2] Michael Ashburner et al. Gene Ontology: tool for the uniﬁcation of biology.

Nat. Genet. , 25:25–29, 2000.[3] Ralf Bornd¨orfer and Olga Heismann. The hypergraph assignment problem.

Discrete Optim. , 15:15–25, 2015.[4] Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, JasonPapadopoulos, Kevin Bealer, and Thomas L. Madden. BLAST+: archi-tecture and applications.

BMC Bioinformatics , 10(1):1, 2009.[5] L. Chindelevitch, CY. Ma, CS. Liao, and B. Berger. Optimizing a globalalignment of protein interaction networks.

Bioinformatics , 29(21):2765–73,2013.[6] C. Clark and J. Kalita. A multiobjective memetic algorithm for ppi networkalignment.

Bioinformatics , 31(12):1988–98, 2015.[7] Connor Clark and Jugal Kalita. A comparison of algorithms for the pairwisealignment of biological networks.

Bioinformatics , 30(16):2351–2359, 2014.[8] Ahed Elmsallati, Connor Clark, and Jugal Kalita. Global alignment ofprotein-protein interaction networks: A survey.

IEEE/ACM Transactionson Computational Biology and Bioinformatics , 13(4):689–705, 2016.[9] Jason Flannick, Antal Novak, Balaji S Srinivasan, Harley H. McAdams,and Seraﬁm Batzoglou. Graemlin: general and robust alignment of multiplelarge interaction networks.

Genome Res. , 16(9):1169–81, 2006.2210] Anne-Claude Gavin, Patrick Aloy, Paola Grandi, Roland Krause, MarkusBoesche, Martina Marzioch, Christina Rau, Lars Juhl Jensen, Sonja Bas-tuck, Birgit D¨umpelfeld, et al. Proteome survey reveals modularity of theyeast cell machinery.

Nature , 440(7084):631–636, 2006.[11] Pietro Hiram Guzzi and Tijana Milenkovi´c. Survey of local and globalbiological network alignment: the need to reconcile the two sides of thesame coin.

Brieﬁngs in bioinformatics , page bbw132, 2017.[12] S. Hashemifar, J. Ma, H. Naveed, S. Canzar, and J. Xu. Modulealign:module-based global alignment of protein-protein interaction networks.

Bioinformatics , 32(17):658–64, 2016.[13] S. Hashemifar and J. Xu. HubAlign: an accurate and eﬃcient method forglobal alignment of protein-protein interaction networks.

Bioinformatics ,30(17):i438–i444, 2014.[14] Nazlin K Howell. Protein-protein interactions. In

Biochemistry of foodproteins , pages 35–74. Springer, 1992.[15] Brian P Kelley, Bingbing Yuan, Fran Lewitter, Roded Sharan, Brent RStockwell, and Trey Ideker. PathBLAST: a tool for alignment of proteininteraction networks.

Nucleic Acids Res. , 32(Web Server issue):W83–88,July 2004.[16] Ozlem Keskin, Nurcan Tuncbag, and Attila Gursoy. Predicting protein–protein interactions from the molecular to the proteome level.

ChemicalReviews , 116(8):4884–4909, 2016.[17] Hiroaki Kitano. Systems biology: a brief overview.

Science ,295(5560):1662–1664, 2002.[18] Gunnar W. Klau. A new graph-based method for pairwise global networkalignment.

BMC Bioinformatics , 10(Suppl. 1):S59, 2009.[19] Mehmet Koyut¨urk, Yohan Kim, Umut Topkara, Shankar Subramaniam,Wojciech Szpankowski, and Ananth Grama. Pairwise alignment of proteininteraction networks.

J. Comput. Biol. , 13(2):182–199, March 2006.[20] HW Kuhn. The Hungarian method for the assignment problem.

NavalResearch Logistics , 52(1):7–21, 2005.[21] Zhenping Li, Yong Wang, Shihua Zhang, Xiang-Sun Zhang, and LuonanChen. Alignment of protein interaction networks by integer quadratic pro-gramming. In

Proc. 28th Annual International Conference of the IEEEEngineering in Medicine and Biology Society , pages 5527–5530, 2006.[22] Zhi Liang, Meng Xu, Maikun Teng, and Liwen Niu. NetAlign: a web-based tool for comparison of protein interaction networks.

Bioinformatics ,22(17):2175–2177, September 2006.[23] N. Malod-Dognin, K. Ban, and N.Prˇzulj. Uniﬁed alignment of protein-protein interaction networks.

Scientiﬁc Reports , 7(953), 2017.2324] N. Malod-Dognin and N. Prˇzulj. L-graal: Lagrangian graphlet-based net-work aligner.

Bioinformatics , 31(13):2182–9, 2015.[25] Manikandan Narayanan and Richard M. Karp. Comparing protein inter-action networks via a graph match-and-split algorithm.

J. Comput. Biol. ,14(7):892–907, September 2007.[26] Behnam Neyshabur, Ahmadreza Khadem, Somaye Hashemifar, andSeyed Shahriar Arab. NETAL: a new graph-based method for global align-ment of protein-protein interaction networks.

Bioinformatics , 29(13):1654–1662, 2013.[27] Daniel Park, Rohit Singh, Michael Baym, Chung-Shou Liao, and BonnieBerger. IsoBase: a database of functionally related proteins across PPInetworks.

Nucleic Acids Res. , 39(suppl 1):D295–D300, 2011.[28] Rob Patro and Carl Kingsford. Global network alignment using multiscalespectral signatures.

Bioinformatics , 28(23):3105–3114, 2012.[29] Hang T. T. Phan and Michael J. E. Sternberg. PINALOG: A novel ap-proach to align protein interaction networks—implications for complex de-tection and function prediction.

Bioinformatics , 28(9):1239–1245, 2012.[30] R Core Team.

R: A Language and Environment for Statistical Computing .R Foundation for Statistical Computing, Vienna, Austria, 2015.[31] V. Srinivasa Rao, K. Srinivas, G. N. Sujini, and G. N. Sunand Kumar.Protein-protein interaction detection: Methods and analysis.

Int. J. Pro-teomics , 2014:147648, 2014.[32] Andreas Ruepp, Barbara Brauner, Irmtraud Dunger-Kaltenbach, Goar Fr-ishman, Corinna Montrone, Michael Stransky, Brigitte Waegele, ThorstenSchmidt, Octave Noubibou Doudieu, Volker St¨umpﬂen, and H. WernerMewes. CORUM: the comprehensive resource of mammalian protein com-plexes.

Nucleic Acids Res. , 36(suppl 1):D646–D650, 2008.[33] Andreas Ruepp, Alfred Zollner, Dieter Maier, Kaj Albermann, Jean Hani,Martin Mokrejs, Igor Tetko, Ulrich G¨uldener, Gertrud Mannhaupt, MartinM¨unsterk¨otter, and H. Werner Mewes. The FunCat, a functional annota-tion scheme for systematic classiﬁcation of proteins from whole genomes.

Nucleic Acids Res. , 32(18):5539–5545, 2004.[34] Andreas Schmidt, Ignasi Forne, and Axel Imhof. Bioinformatic analysis ofproteomics data.

BMC Syst. Biol. , 8(Suppl 2):S3, 2014.[35] Rohit Singh, Jinbo Xu, and Bonnie Berger. Global alignment of multi-ple protein interaction networks with application to functional orthologydetection.

Proceedings of the National Academy of Sciences of the UnitedStates of America , 105(35):12763–12768, 2008.[36] V. Vijayan, V. Saraph, and T. Milenkovi´c. Magna++: Maximizing ac-curacy in global network alignment via both node and edge conservation.

Bioinformatics , 31(14):2409–11, 2015.24 m247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206

Figure 5: This ﬁgure shows how AligNet constructs an appropriate set of align-ments considered to obtain a ﬁnal local alignment. This corresponds to theStep 4 of our aligner. First of all, a maximum score alignment between a pairof clusters is chosen: in this case, this corresponds to the matching between theclusters in Figure 3. Both clusters are shown in the second row of this ﬁgure.The shadowed nodes are the nodes that are not aligned. Next, a maximum scorealignment of a pair of clusters with source a cluster centered at a shadowed nodeis chosen: it turns out to be the one in the second row in Figure 4 and it is shownin the third row in this ﬁgure. Finally, the last alignment to be included in theappropriate set of alignments must be the one with source cluster centered atthe remaining shadowed node: this corresponds to the alignment in the last rowin Figure 4 shown in the bottom of this ﬁgure. Notice that in the end, thatis when we consider the three alignments together, there are four nodes in thesource network with inconsistent assignments..25 m247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206

Figure 6: This ﬁgure shows the local alignment of the original networks obtainedby AligNet in its Step 4, once the inconsistent assignments have been solved.The coherent assignment of nodes is obtained as the solution to the weightedbipartite hypergraph assignment problem, for the hypergraph associated to theappropriate set of alignments described in Figure 5. In this case, the hypergraphhas three hyperarcs, corresponding to the three alignments considered in theappropriate set of alignments.. dm247 dm6389 dm11070dm11454 dm2171dm11644dm10450dm8158 hs59 hs399hs3857hs5638 hs553hs5433 hs6992hs12566hs12206

Figure 7: This ﬁgure shows the ﬁnal global alignment of the original networksobtained by AligNet. Notice that, in Step 5 of AligNet, the previous alignmentis extended to a global one. In this case, there were two unmatched nodes in thesource network in Figure 6 which are now assigned. The assignment of thesetwo nodes is shown with solid arrows while with dashed arrows we show thealready assigned nodes. 26 li g N e t H ub A li gn H ub A li gn A li g N e t A li g N e t P I N A L OG P I N A L OG A li g N e t A li g N e t L − G R AA L L − G R AA L A li g N e t Complexes assignment P e r c en t Figure 8: This ﬁgure shows the results in the protein complexes test obtainedby AligNet in contrast to the others aligners, when we consider the alignmentof complexes between S. cerevisiae and H. sapiens. We show the proportion ob-tained by AligNet of coherent, not coherent and not assigned complexes whenthe other aligners do not assign complexes, and conversely. Thus, the ﬁrst barshows the proportion between coherent, not coherent and not assigned com-plexes by AligNet when HubAlign does not assigned. Conversely, the secondbar shows the proportion between coherent, not coherent and not assigned com-plexes by HubAlign when AligNet does not assigned.27 u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a05000100001500020000 AligNetHubAlignSPINALPINALOGL−GRAAL

Figure 9: This ﬁgure shows the running times (in seconds) we obtained whenwe performed all the alignments for every pair of the considered networks. Inthe ﬁgure we present the results obtained with the aligners AligNet, PINALOG,SPINAL, HubAlign and L-GRAAL. 28 u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a0500100015002000250030003500 AligNetHubAlignL−GRAAL

Figure 10: This ﬁgure shows the same information presented in Figure 9 con-sidering only the aligners AligNet, HubAlign and L-GRAAL.29

AligNetHubAlignSPINALPINALOGL−GRAAL l l ll ll ll l l S e c ond s m u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a S e c ond s m u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a S e c ond s m u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a S e c ond s m u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a S e c ond s m u s − c e l m u s − sc e m u s − d m e c e l − sc e c e l − d m e m u s − h s a sc e − d m e c e l − h s a sc e − h s a d m e − h s a40200050001000021442