[PDF] Lagrangian Relaxation Applied to Sparse Global Network Alignment

Abstract

Data on molecular interactions is increasing at a tremendous pace, while the development of solid methods for analyzing this network data is lagging behind. This holds in particular for the field of comparative network analysis, where one wants to identify commonalities between biological networks. Since biological functionality primarily operates at the network level, there is a clear need for topology-aware comparison methods. In this paper we present a method for global network alignment that is fast and robust, and can flexibly deal with various scoring schemes taking both node-to-node correspondences as well as network topologies into account. It is based on an integer linear programming formulation, generalizing the well-studied quadratic assignment problem. We obtain strong upper and lower bounds for the problem by improving a Lagrangian relaxation approach and introduce the software tool natalie 2.0, a publicly available implementation of our method. In an extensive computational study on protein interaction networks for six different species, we find that our new method outperforms alternative state-of-the-art methods with respect to quality and running time.

Full PDF

LLagrangian Relaxation Applied toSparse Global Network Alignment

Mohammed El-Kebir , , , Jaap Heringa , , , and Gunnar W. Klau , Centrum Wiskunde & Informatica, Life Sciences Group, Science Park 123, 1098 XGAmsterdam, the Netherlands, {m.el-kebir,gunnar.klau}@cwi.nl Centre for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam, DeBoelelaan 1081A, 1081 HV Amsterdam, the Netherlands, [email protected] Netherlands Institute for Systems Biology, Amsterdam, the Netherlands Netherlands Bioinformatics Centre, Nijmegen, the Netherlands

Abstract.

Data on molecular interactions is increasing at a tremendouspace, while the development of solid methods for analyzing this networkdata is lagging behind. This holds in particular for the ﬁeld of compara-tive network analysis, where one wants to identify commonalities betweenbiological networks. Since biological functionality primarily operates atthe network level, there is a clear need for topology-aware comparisonmethods. In this paper we present a method for global network align-ment that is fast and robust, and can ﬂexibly deal with various scoringschemes taking both node-to-node correspondences as well as networktopologies into account. It is based on an integer linear programmingformulation, generalizing the well-studied quadratic assignment problem.We obtain strong upper and lower bounds for the problem by improv-ing a Lagrangian relaxation approach and introduce the software tool natalie 2.0 , a publicly available implementation of our method. In anextensive computational study on protein interaction networks for sixdiﬀerent species, we ﬁnd that our new method outperforms alternativestate-of-the-art methods with respect to quality and running time.

In the last decade, data on molecular interactions has increased at a tremendouspace. For instance, the STRING database [24], which contains protein proteininteraction (PPI) data, grew from 261,033 proteins in 89 organisms in 2003 to5,214,234 proteins in 1,133 organisms in May 2011, more than doubling the num-ber of proteins in the database every two years. The same trends can be observedfor other types of biological networks, including metabolic, gene-regulatory, sig-nal transduction and metagenomic networks, where the latter can incorporatethe excretion and uptake of organic compounds through, for example, a micro-bial community [12, 21]. In addition to the plethora of experimentally derivednetwork data for many species, also the structure and behavior of molecularnetworks have become intensively studied over the last few years [2], leading tothe observation of many conserved features at the network level. However, the a r X i v : . [ c s . D S ] A ug Mohammed El-Kebir, Jaap Heringa, and Gunnar W. Klau development of solid methods for analyzing network data is lagging behind, par-ticularly in the ﬁeld of comparative network analysis. Here, one wants to identifycommonalities between biological networks from diﬀerent strains or species, orderived form diﬀerent conditions. Based on the assumption that evolutionaryconservation implies functional signiﬁcance, comparative approaches may help(i) improve the accuracy of data, (ii) generate, investigate, and validate hypothe-ses, and (iii) transfer functional annotations. Until recently, the most commonway of comparing two networks has been to solely consider node-to-node cor-respondences, for example by ﬁnding homologous relationships between nodes(e.g. proteins in PPI networks) of either network, while the topology of the twonetworks has not been taken into account. Since biological functionality pri-marily operates at the network level, there is a clear need for topology-awarecomparison methods. In this paper we present a network alignment method thatis fast and robust, and can ﬂexibly deal with various scoring schemes taking bothnode-to-node correspondences as well as network topologies into account.

Previous work.

Network alignment establishes node correspondences basedon both node-to-node similarities and conserved topological information. Sim-ilar to sequence alignment, local network alignment aims at identifying one ormore shared subnetworks, whereas global network alignment addresses the over-all comparison of the complete input networks.Over the last years a number of methods have been proposed for bothglobal and local network alignment, for example

PathBlast [14],

Network-Blast [22],

MaWISh [16],

Graemlin [8],

IsoRank [23],

Graal [17], and

SubMAP [4].

PathBlast heuristically computes high-scoring similar paths intwo PPI networks. Detecting protein complexes has been addressed with

Net-workBlast by Sharan et al. [22], where the authors introduce a probabilisticmodel and propose a heuristic greedy approach to search for shared complexes.Koyut¨urk et al. [16] use a more elaborate scoring scheme based on an evolu-tionary model to compute local pairwise alignments of PPI networks. The

Iso-Rank algorithm by Singh et al. [23] approaches the global alignment problem bypreferably matching nodes which have a similar neighborhood, which is elegantlysolved as an eigenvalue problem. Kuchaiev et al. [17] take a similar approach.Their method

Graal matches nodes that share a similar distribution of so-calledgraphlets, which are small connected non-isomorphic induced subgraphs.In this paper we focus on pairwise global network alignment, where an align-ment is scored by summing up individual scores of aligned node and interactionpairs. Among the above mentioned methods,

IsoRank and

Graal use a scoringmodel that can be expressed in this manner.

Contribution.

We present an algorithm for global network alignment based onan integer linear programming (ILP) formulation, generalizing the well-studiedquadratic assignment problem (QAP). We improve upon an existing Lagrangianrelaxation approach presented in previous work [15] to obtain strong upper andlower bounds for the problem. We exploit the closeness to QAP and generalize parse Global Network Alignment 3 c ( v , v ) =  , if v = A and v = a ,1 , if v = B and v = b ,1 , if v = C and v = c ,1 , if v = D and v = d ,0 , otherwise. w ( v , v , w , w ) =  , if ( v , w ) ∈ E and ( v , w ) ∈ E ,0 , otherwise. G = ( V , E ) G = ( V , E ) AB D c ad ebC

Fig. 1: Example of a network alignment. With the given scoring function, thealignment has a score of 4 + 40 = 44.a dual descent method for updating the Lagrangian multipliers to the general-ized problem. We have implemented the revised algorithm from scratch as thesoftware tool natalie 2.0 . In an extensive computational study on protein in-teraction networks for six diﬀerent species, we compare natalie 2.0 to Graal and

IsoRank , evaluating the number of conserved edges as well as functionalcoherence of the modules in terms of GO annotation. We ﬁnd that natalie 2.0 outperforms the alternative methods with respect to quality and running time.Our software tool natalie 2.0 as well as all data sets used in this study arepublicly available at http://planet-lisa.net.

Given two simple graphs G = ( V , E ) and G = ( V , E ), an alignment a : V (cid:43) V is a partial injective function from V to V . As such we have that analignment relates every node in V to at most one node in V and that converselyevery node in V has at most one counterpart in V . An alignment is assigned areal-valued score using an additive scoring function s deﬁned as follows: s ( a ) = (cid:88) v ∈ V c ( v, a ( v )) + (cid:88) v,w ∈ V v

1. In Section 5 we will discussthis assumption. Rather than using the aforementioned constraints, we make useof a stronger set of constraints which we obtain by multiplying constraints (2)and (3) by x ik : (cid:88) ll (cid:54) = k y ikjl = (cid:88) ll (cid:54) = k x ik x jl ≤ (cid:88) l x ik x jl ≤ x ik , ∀ i, j, k, i < j (5) (cid:88) jj>i y ikjl = (cid:88) jj>i x ik x jl ≤ (cid:88) j x ik x jl ≤ x ik , ∀ i, k, l, k (cid:54) = l (6) parse Global Network Alignment 5 We proceed by splitting the variable y ikjl (where i < j and k (cid:54) = l ). In other words,we extend the objective function such that the counterpart of y ikjl becomes y jlik . This is accomplished by rewriting the dummy constraint in (6) to j (cid:54) = i .In addition, we split the weights: w ikjl = w jlik = ( w (cid:48) ikjl /

2) where w (cid:48) ikjl denotesthe original weight. Furthermore, we require that the counterparts of the splitdecision variables assume the same value, which amounts tomax x,y (cid:88) i,k c ik x ik + (cid:88) i,jij (cid:88) k,lk (cid:54) = l w ikjl y ikjl (ILP)s.t. (cid:88) l x jl ≤ ∀ j (7) (cid:88) j x jl ≤ ∀ l (8) (cid:88) ll (cid:54) = k y ikjl ≤ x ik ∀ i, j, k, i (cid:54) = j (9) (cid:88) jj (cid:54) = i y ikjl ≤ x ik ∀ i, k, l, k (cid:54) = l (10) y ikjl = y jlik ∀ i, j, k, l, i < j, k (cid:54) = l (11) y ikjl ∈ { , } ∀ i, j, k, l, i (cid:54) = j, k (cid:54) = l (12) x ik ∈ { , } ∀ i, k (13)We can solve the continuous relaxation of (ILP) via its Lagrangian dual bydualizing the linking constraints (11) with multiplier λ :min λ Z LD ( λ ) , (LD)where Z LD ( λ ) equalsmax x,y (cid:88) i,k c ik x ik + (cid:88) i,jij (cid:88) k,lk (cid:54) = l ( w ikjl − λ jlik ) y ikjl s.t. (7), (8), (9), (10), (12) and (13)Now that the linking constraints have been dualized, one can observe that the re-maining constraints decompose the variables into | V || V | disjoint groups, wherevariables across groups are not linked by any constraint, and where each group Mohammed El-Kebir, Jaap Heringa, and Gunnar W. Klau contains a variable x ik and variables y ikjl for j (cid:54) = i and l (cid:54) = k . Hence, we have Z LD ( λ ) = max x (cid:88) i,k [ c ik + v ik ( λ )] x ik (LD λ )s.t. (cid:88) l x jl ≤ ∀ j (14) (cid:88) j x jl ≤ ∀ l (15) x ik ∈ { , } ∀ i, k (16)which corresponds to a maximum weight bipartite matching problem on the so-called alignment graph G m = ( V ∪ V , E m ). In the general case G m is a completebipartite graph, i.e. E m = { ( i, k ) | i ∈ V , v ∈ V } . However, by exploitingbiological knowledge one can make G m more sparse by excluding biologically-unlikely edges (see Section 4). For the global problem, the weight of a matchingedge ( i, k ) is set to c ik + v ik ( λ ), where the latter term is computed as v ik ( λ ) = max y (cid:88) jj>i (cid:88) ll (cid:54) = k ( w ikjl + λ ikjl ) y ikjl + (cid:88) jj i , or w ikjl − λ jlik if j < i .So in order to compute Z LD ( λ ), we need to solve a total number of | V || V | + 1maximum weight bipartite matching problems, which, using the Hungarian al-gorithm [18, 20] can be done in O ( n ) time, where n = max( | V | , | V | ). In casethe alignment graph is sparse, i.e. O ( | E m | ) = O ( n ), Z LD ( λ ) can be computedin O ( n log n ) time using the successive shortest path variant of the Hungarianalgorithm [7]. It is important to note that for any λ , Z LD ( λ ) is an upper boundon the score of an optimal alignment. This is because any alignment a is feasibleto Z LD ( λ ) and does not violate the original linking constraints and thereforehas an objective value equal to s ( a ). In particular, the optimal alignment a ∗ is also feasible to Z LD ( λ ) and hence a ∗ ≤ Z LD ( λ ). Since the two sets of prob-lems resulting from the decomposition both have the integrality property [6],the smallest upper bound we can achieve equals the linear programming (LP)bound of the continuous relaxation of (ILP) [9]. However, computing the small-est upper bound by ﬁnding suitable multipliers is much faster than solving thecorresponding LP. Given solution ( x, y ) to Z LD ( λ ), we obtain a lower bound on s ( a ∗ ), denoted Z lb ( λ ), by considering the score of the alignment encoded in x . parse Global Network Alignment 7 In this section we will discuss strategies for identifying Lagrangian multipliers λ that yield an as small as possible gap between the upper and lower boundresulting from the solution to Z LD ( λ ). Subgradient optimization.

We start by discussing subgradient optimization,which is originally due to Held and Karp [10]. The idea is to generate a sequence λ , λ , . . . of Lagrangian multiplier vectors starting from λ = as follows: λ t +1 ikjl = λ tikjl − α · ( Z LD ( λ ) − Z lb ( λ )) (cid:107) g ( λ t ) (cid:107) g ( λ tikjl ) ∀ i, j, k, l, i < j, k (cid:54) = l (20)where g ( λ tikjl ) corresponds to the subgradient of multiplier λ tikjl , i.e. g ( λ tikjl ) = y ikjl − y jlik , and α is the step size parameter. Initially α is set to 1 and it is halvedif neither Z LD ( λ ) nor Z lb ( λ ) have improved for over N consecutive iterations.Conversely, α is doubled if M times in a row there was an improvement in either Z LD ( λ ) or Z lb ( λ ) [5]. In case all subgradients are zero, the optimal solutionhas been found and the scheme terminates. Note that this is not guaranteed tohappen. Therefore we abort the scheme after exceeding a time limit or a pre-speciﬁed number of iterations. In addition, we terminate if α has dropped belowmachine precision. Algorithm 1 gives the pseudo code of this procedure. Algorithm 1:

SubgradientOpt ( λ, M, N ) α ← n ← N ; m ← M [LB ∗ , UB ∗ ] ← [ Z lb ( λ ) , Z LD ( λ )] while g ( λ ) (cid:54) = 0 do λ ← λ − α ( Z LD ( λ ) − Z lb ( λ )) (cid:107) g ( λ t ) (cid:107) g ( λ t ) if [LB ∗ , UB ∗ ] \ [ Z lb ( λ ) , Z LD ( λ )] = ∅ then n ← n − else LB ∗ ← max[LB ∗ , Z lb ( λ )] UB ∗ ← min[UB ∗ , Z LD ( λ )] m ← m − if n = 0 then α ← α/ n ← N if m = 0 then α ← α ; m ← M return [LB ∗ , UB ∗ ] Dual descent.

In this section we derive a dual descent method which is anextension of the one presented in [1]. The dual descent method takes as a starting

Mohammed El-Kebir, Jaap Heringa, and Gunnar W. Klau point the dual of Z LD ( λ ): Z LD ( λ ) = min α,β (cid:88) i α i + (cid:88) k β k (21)s.t. α i + β k ≥ c ik + v ik ( λ ) ∀ i, k (22) α i ≥ ∀ i (23) β k ≥ ∀ k (24)where the dual of v ik ( λ ) is v ik ( λ ) = min µ,ν (cid:88) jj (cid:54) = i µ ikj + (cid:88) ll (cid:54) = k ν ikl (25)s.t. µ ikj + ν ikl ≥ w ikjl + λ ikjl ∀ j, l, j > i, l (cid:54) = k (26) µ ikj + ν ikl ≥ w ikjl − λ jlik ∀ j, l, j < i, l (cid:54) = k (27) µ ikj ≥ ∀ j (28) ν ikl ≥ ∀ l. (29)Suppose that for a given λ t we have computed dual variables ( α, β ) solving(21) with objective value Z LD ( λ t ), as well as dual variables ( µ ik , ν ik ) yieldingvalues v ik ( λ ) to linear programs (25). The goal now is to ﬁnd λ t +1 such thatthe resulting bound is better or just as good, i.e. Z LD ( λ t +1 ) ≤ Z LD ( λ t ). Weprevent the bound from increasing, by ensuring that the dual variables ( α, β )remain feasible to (21). This we can achieve by considering the slacks: π ik ( λ ) = α i + β k − c ik − v ik ( λ ) . So for ( α, β ) to remain feasible, we can only allow every v ik ( λ t ) to increase by as much as π ik ( λ t ). We can achieve such an increase byconsidering linear programs (25) and their slacks deﬁned as γ ikjl ( λ ) = (cid:40) µ ikj + ν ikl − w ikjl + λ ikjl , if j > i , µ ikj + ν ikl − w ikjl − λ jlik , if j < i , ∀ j, l, j (cid:54) = i, l (cid:54) = k, (30)and update the multipliers in the following way. Lemma 1.

The adjustment scheme below yields solutions to linear programs (25) with objective values v ik ( λ t +1 ) at most π ik ( λ t ) + v ik ( λ t ) for all i, k . λ t +1 ikjl = λ tikjl + ϕ ikjl (cid:20) γ ikjl ( λ t ) + τ ik (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t ) (cid:21) − ϕ jlik (cid:20) γ jlik ( λ t ) + τ jl (cid:18) n −

1) + 12( n − (cid:19) π jl ( λ t ) (cid:21) (31) for all j, l, i < j, k (cid:54) = l , where n = | V | , n = | V | , and ≤ ϕ ikjl , τ jl ≤ areparameters.Proof. We prove the lemma by showing that for any i, k there exists a feasiblesolution ( µ (cid:48) ik , ν (cid:48) ik ) to (25) whose objective value v ik ( λ t +1 ) is at most π ik ( λ t ) + parse Global Network Alignment 9 v ik ( λ t ). Let ( µ ik , ν ik ) be the solution to (25) given multipliers λ t . We claim thatsetting µ (cid:48) ikj = µ ikj + π ik ( λ t )2( n − ∀ j, j (cid:54) = iν (cid:48) ikl = ν ikj + π ik ( λ t )2( n − ∀ l, l (cid:54) = k, results in a feasible solution to (25) given multipliers λ t +1 . We start by showingthat constraints (26) and (27) are satisﬁed. From (31) the following bounds on λ t +1 follow. λ tikjl − γ jlik ( λ t ) − (cid:18) n −

1) + 12( n − (cid:19) π jl ( λ t ) ≤ λ t +1 ikjl ∀ j, l, j < i, l (cid:54) = kλ t +1 ikjl ≤ λ tikjl + γ ikjl ( λ t ) + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t ) ∀ j, l, j < i, l (cid:54) = k. Therefore we have that the following inequalities imply constraints (26) and (27)for all j, l , j > i , l (cid:54) = k : µ (cid:48) ikj + ν (cid:48) ikl ≥ w ikjl + λ tikjl + γ ikjl ( λ t ) + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t )and for all j, l , j < i , l (cid:54) = kµ (cid:48) ikj + ν (cid:48) ikl ≥ w ikjl − λ tjlik + γ ikjl ( λ t ) + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t ) . Constraints (26) and (27) are indeed implied, as, for all j, l , j > i , l (cid:54) = k , µ (cid:48) ikj + ν (cid:48) ikl = µ ikj + ν ikl + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t ) ≥ w ikjl + λ tikjl + γ ikjl ( λ t ) + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t )and for all j, l , j < i , l (cid:54) = kµ (cid:48) ikj + ν (cid:48) ikl = µ ikj + ν ikl + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t ) ≥ w ikjl − λ tjlik + γ ikjl ( λ t ) + (cid:18) n −

1) + 12( n − (cid:19) π ik ( λ t ) . Since µ ikj , ν ikl ≥ ∀ j, l, j (cid:54) = i, l (cid:54) = k ) and by deﬁnition π ik ( λ t ) ≥

0, constraints(28) and (29) are satisﬁed as well. The objective value of ( µ (cid:48) ik , ν (cid:48) ik ) is given by (cid:88) jj (cid:54) = i µ (cid:48) ikj + (cid:88) ll (cid:54) = k ν (cid:48) ikl = (cid:88) jj (cid:54) = i µ ikj + (cid:88) ll (cid:54) = k ν ikl + π ik ( λ t ) = v ik ( λ t ) + π ik ( λ t ) . Since (25) are minimization problems and there exist, for all i, k , feasible solu-tions with objective values v ik ( λ t ) + π ik ( λ t ), we can conclude that the objectivevalues of the solutions are bounded by this quantity. Hence, the lemma follows. We use ϕ = 0 . τ = 1, and perform the dual descent method L successivetimes (see Algorithm 2). Algorithm 2:

DualDescent ( λ, L ) ϕ ← .

5; [LB ∗ , UB ∗ ] ← [ Z lb ( λ ) , Z LD ( λ )] for n ← to L do foreach i, k, j, l, i < j, k (cid:54) = l do λ ikjl ← λ ikjl + ϕ ( γ ikjl + π ik ( λ )2( n − + π ik ( λ )2( n − )) − ϕ ( γ jlik + π jl ( λ )2( n − + π jl ( λ )2( n − )) LB ∗ ← max[LB ∗ , Z lb ( λ )] UB ∗ ← Z LD ( λ ) return [LB ∗ , UB ∗ ] Overall method.

Our overall method combines both the subgradient optimiza-tion and dual descent method. We do this performing the subgradient methoduntil termination and then switching over to the dual descent method. Thisprocedure is repeated K times (see Algorithm 3). Algorithm 3:

Natalie ( K, L, M, N ) λ ← ; [LB ∗ , UB ∗ ] ← [0 , ∞ ] for k ← to K do [LB ∗ , UB ∗ ] ← SubgradientOpt ( λ, M, N ) ∩ [LB ∗ , UB ∗ ] [LB ∗ , UB ∗ ] ← DualDescent ( λ, L ) ∩ [LB ∗ , UB ∗ ] return [LB ∗ , UB ∗ ] We implemented natalie in C ++ using the LEMON graph library (http://lemon.cs.elte.hu/). The successive shortest path algorithm for maximum weightbipartite matching was implemented and contributed to LEMON. Special carewas taken to deal with the inherent numerical instability of ﬂoating point num-bers. Our implementation supports both the GraphML and GML graph formats.Rather than using one big alignment graph, we store and use a diﬀerent align-ment graph for every local problem (LD ikλ ). This proved to be a huge improve-ment in running times, especially when the global alignment graph is sparse. natalie is publicly available at http://planet-lisa.net. From the STRING database v8.3 [24], we obtained PPI networks for the fol-lowing six species:

C. elegans (cel),

S. cerevisiae (sce),

D. melanogaster (dme), parse Global Network Alignment 11

R. norvegicus (rno),

M. musculus (mmu) and

H. sapiens (hsa). We only consid-ered interactions that were experimentally veriﬁed. Table 1 shows the sizes of thenetworks. We performed, using the BLOSUM62 matrix, an all-against-all globalsequence alignment on the protein sequences of all (cid:0) (cid:1) = 15 pairs of networks.We used aﬃne gap penalties with a gap-open penalty of 2 and a gap-extensionpenalty of 10. The ﬁrst experiment in Section 4.1 compares the raw performanceof IsoRank , Graal and natalie in terms of objective value. In Section 4.2we evaluate the biological relevance of the alignments produced by the threemethods. All experiments were conducted on a compute cluster with 2.26 GHzprocessors with 24 GB of RAM. species nodes annotated interactions cel (c) 5,948 4,694 23,496sce (s) 6,018 5,703 131,701dme (d) 7,433 6,006 26,829rno (r) 8,002 6,786 32,527mmu (m) 9,109 8,060 38,414hsa (h) 11,512 9,328 67,858

Table 1: Characteristics of input networks considered in this study. The columnscontain species identiﬁer, number of nodes in the network, number of annotatednodes thereof, and number of interactions

The objective function used for scoring alignments in

Graal counts the numberof mapped edges. Such an objective function is easily expressible in our frame-work using s ( a ) = |{ ( v, w ) ∈ E | ( a ( v ) , a ( w )) ∈ E }| and can also be modeledusing the IsoRank scoring function. In order to compare performance of themethods across instances, we normalize the scores by dividing by min( | E | , | E | ).This measure is called the edge-correctness by Kuchaiev et al. [17].As mentioned in Section 3, our method beneﬁts greatly from using a sparsealignment graph. To that end, we use the e-values obtained from the all-against-all sequence alignment to prohibit biologically unlikely matchings by only con-sidering protein-pairs whose e-value is at most 100. Note that this only applies to natalie as both Graal and

IsoRank consider the complete alignment graph.On each of the 15 instances, we ran

Graal with 3 diﬀerent random seeds andsampled the input parameter which balances the contribution of the graphletswith the node degrees uniformly within the allowed range of [0 , Iso-Rank , when setting the parameter α —which controls to what extent topologicalsimilarity plays a role—to the desired value of 1, very poor results were ob-tained. Therefore we also sampled this parameter within its allowed range andre-evaluated the resulting alignments in terms of edge-correctness. natalie was run with a time limit of 10 minutes and K = 3, L = 100, M = 10, N = 20. Forboth Graal and

IsoRank only the highest-scoring results were considered.Figure 2 shows the results.

IsoRank was only able to compute alignmentsfor three out of the 15 instances. On the other instances

IsoRank crashed, whichmay be due to the large size of the input networks. For

Graal no alignmentsconcerning sce could be computed, which is due to the large number of edges inthe network on which the graphlet enumeration procedure choked: in 12 hoursonly for 3% of the nodes the graphlet degree vector was computed. As for thelast three instances,

Graal crashed due to exceeding the memory limit inherentto 32-bit processes. Unfortunately no 64-bit executable was available. On theinstances for which

Graal could compute alignments, the performance—bothin solution quality and running time—is very poor when compared to

IsoRank and natalie . natalie outperforms IsoRank in both running time and solutionquality.

In order to measure the biological relevance of the obtained network alignments,we make use of the Gene Ontology (GO) [3]. For every node in each of the sixnetworks we obtained a set of GO annotations (see Table 1 for the exact num-bers). Each annotation set was extended to a multiset by including all ancestralGO terms for every annotation in the original set. Subsequently we employeda similarity measure that compares a pair of aligned nodes based on their GOannotations and also takes into account the relative frequency of each annota-tion [11]. Since the similarity measure assigns a score between 0 and 1 to everyaligned node pair, the highest similarity score one can get for any alignment is theminimum number of annotated nodes in either of the networks. Therefore we cannormalize the similarity scores by this quantity. Unlike the previous experiment,this time we considered the bitscores of the pairwise global sequence alignments.Similarly to

IsoRank parameter α , we introduced a parameter β ∈ [0 ,

1] suchthat the sequence part of the score has weight (1 − β ) and the topology part hasweight β . For both IsoRank and natalie we sampled the weight parametersuniformly in the range [0 ,

1] and showed the best result in Figure 3. There we cansee that both natalie and

IsoRank identify functionally coherent alignments.

Inspired by results for the closely related quadratic assignment problem (QAP),we have presented new algorithmic ideas in order to make a Lagrangian relax-ation approach for global network alignment practically useful and competitive.In particular, we have generalized a dual descent method for the QAP. We havefound that combining this scheme with the traditional subgradient optimizationmethod leads to fastest progress of upper and lower bounds.Our implementation of the new method, natalie 2.0 , works very well andfast when aligning biological networks, which we have shown in an extensive parse Global Network Alignment 13 c - s c - d c - r c - m c - h s - d s - r s - m s - h d - r d - m d - h r - m r - h m - h IsoRank ( α =1)GRAAL IsoRanknatalie (a) Edge correctness c - s c - d c - r c - m c - h s - d s - r s - m s - h d - r d - m d - h r - m r - h m - h IsoRank ( α = 1)GRAAL IsoRanknatalie (b) Running times in seconds Fig. 2: Performance of the three diﬀerent methods for the all-against-all speciescomparisons (15 alignment instances). Missing bars correspond to exceededtime/memory limits or software crashes. c - s c - d c - r c - m c - h s - d s - r s - m s - h d - r d - m d - h r - m r - h m - h IsoRank GRAAL natalie

Fig. 3: Biological relevance of the alignments measured via GO similaritystudy on the alignment of cross-species PPI networks. We have compared na-talie 2.0 to those state-of-the-art methods whose scoring schemes can be ex-pressed as special cases of the scoring scheme we propose. Currently, these meth-ods are

IsoRank and

Graal . Our experiments show that the Lagrangian re-laxation approach is a very powerful method and that it currently outperformsthe competitors in terms of quality of the results and running time.Currently, all methods, including ours, approach the global network align-ment problem heuristically, that is, the computed alignments are not guaran-teed to be optimal solutions of the problem. While the other approaches areintrinsically heuristic—both

IsoRank and

Graal , for instance, approximatethe neighborhood of a node and then match it with a similar node—the inex-actness in our methods has two causes that we plan to address in future work:On the one hand, there may still be a gap between upper and lower bound ofthe Lagrangian relaxation approach after the last iteration. We can use thesebounds, however, in a branch-and-bound approach that will compute provablyoptimal solutions. On the other hand, we currently do not consider the completebipartite alignment graph and may therefore miss the optimal alignment. Here,we will investigate preprocessing strategies, in the spirit of [25], to safely sparsifythe input bipartite graph without violating optimality conditions.The independence of the local problems (LD ikλ ) allows for easy paralleliza-tion, which, when exploited would lead to an even faster method. Another im-provement in running times might be achieved when considering more involvedheuristics for computing the lower bound, such as local search. More functionally-coherent alignments can be obtained when considering a scoring function wherenode-to-node correspondences are not only scored via sequence similarity butalso for instance via GO similarity. In certain cases, even negative weights for parse Global Network Alignment 15 topological interactions might be desired in which case one needs to reconsiderthe assumption of entries of matrix W being positive. Acknowledgments.

References

1. W. P. Adams and T. Johnson. Improved linear programming-based lower boundsfor the quadratic assignment problem. In

DIMACS Series in Discrete Mathematicsand Theoretical Computer Science , 1994.2. U. Alon. Network motifs: theory and experimental approaches.

Nat Rev Genet ,8(6):450–461, 2007.3. M. Ashburner, C. A. Ball, J. A. Blake, et al. Gene ontology: tool for the uniﬁcationof biology.

Nat Genet , 25, 2000.4. F. Ay, M. Kellis, and T. Kahveci. SubMAP: Aligning Metabolic Pathways withSubnetwork Mappings.

J Comput Biol , 18(3):219–35, 2011.5. A. Caprara, M. Fischetti, and P. Toth. A heuristic method for the set coverproblem.

Oper Res , 47:730–743, 1999.6. J. Edmonds. Path, trees, and ﬂowers.

Canadian J Math , 17:449–467, 1965.7. J. Edmonds and R. M. Karp. Theoretical improvements in algorithmic eﬃciencyfor network ﬂow problems.

J ACM , 19:248–264, 1972.8. J. Flannick, A. Novak, B. S. Srinivasan, H. H. McAdams, and S. Batzoglou. Graem-lin: general and robust alignment of multiple large interaction networks.

GenomeRes , 16(9):1169–1181, 2006.9. M. Guignard. Lagrangean relaxation.

Top , 11:151–200, 2003.10. M. Held and R. M. Karp. The traveling-salesman problem and minimum spanningtrees: Part II.

Math Program , 1:6–25, 1971.11. S. Jaeger, C. Sers, and U. Leser. Combining modularity, conservation, and interac-tions of proteins signiﬁcantly increases precision and coverage of protein functionprediction.

BMC Genomics , 11(1):717, 2010.12. M. Kanehisa, S. Goto, M. Hattori, et al. From genomics to chemical genomics:new developments in KEGG.

Nucleic Acids Res , 34:D354–D357, 2006.13. R. M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W.Thatcher, editors,

Complexity of Computer Computations , pages 85–103. PlenumPress, 1972.14. B. P. Kelley, R. Sharan, R. M. Karp, et al. Conserved pathways within bacteriaand yeast as revealed by global protein network alignment.

P Natl Acad Sci USA ,100(20):11394–11399, 2003.15. G. W. Klau. A new graph-based method for pairwise global network alignment.

BMC Bioinform , 10 (Suppl 1):S59, 2009.16. M. Koyut¨urk, Y. Kim, U. Topkara, et al. Pairwise alignment of protein interactionnetworks.

J Comput Biol , 13(2):182–199, 2006.17. O. Kuchaiev, T. Milenkovic, V. Memisevic, W. Hayes, and N. Przulj. Topologicalnetwork alignment uncovers biological function and phylogeny.

J R Soc Interface ,7(50):1341–54, 2010.6 Mohammed El-Kebir, Jaap Heringa, and Gunnar W. Klau18. H. W. Kuhn. The Hungarian method for the assignment problem.

Nav Res LogistQ , 2(1-2):83–97, 1955.19. E. L. Lawler. The quadratic assignment problem.

Manage Sci , 9(4):586–599, 1963.20. J. Munkres. Algorithms for the assignment and transportation problems.

SIAM JAppl Math , 5:32–38, 1957.21. R. Sharan and T. Ideker. Modeling cellular machinery through biological networkcomparison.

Nat Biotechnol , 24(4):427–433, 2006.22. R. Sharan, T. Ideker, B. Kelley, R. Shamir, and R. M. Karp. Identiﬁcation of pro-tein complexes by comparative analysis of yeast and bacterial protein interactiondata.

J Comput Biol , 12(6):835–846, 2005.23. R. Singh, J. Xu, and B. Berger. Global alignment of multiple protein interactionnetworks with application to functional orthology detection.

P Natl Acad Sci USA ,105(35):12763–12768, 2008.24. D. Szklarczyk, A. Franceschini, M. Kuhn, et al. The STRING database in 2011:functional interaction networks of proteins, globally integrated and scored.

NucleicAcids Res , 39:561–568, 2010.25. I. Wohlers, R. Andonov, and G. W. Klau. Algorithm engineering for optimalalignment of protein structure distance matrices.

Related Researches

The Multiplicative Version of Azuma's Inequality, with an Application to Contention Analysis

by William Kuszmaul

Balanced Districting on Grid Graphs with Provable Compactness and Contiguity

by Cyrus Hettle

Deterministic Tree Embeddings with Copies for Algorithms Against Adaptive Adversaries

by Bernhard Haeupler

Approximately counting independent sets of a given size in bounded-degree graphs

by Ewan Davies

A Dynamic Data Structure for Temporal Reachability with Unsorted Contact Insertions

by Luiz F. Afra Brito

Semi-Streaming Algorithms for Submodular Matroid Intersection

by Paritosh Garg

Multivariate Analysis of Scheduling Fair Competitions

by Siddharth Gupta

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

by De Huang

Online Bin Packing with Predictions

by Spyros Angelopoulos

Minimum projective linearizations of trees in linear time

by Lluís Alemany-Puig

Parameterized complexity of computing maximum minimal blocking and hitting sets

by Júlio Araújo

A 2 -Approximation Algorithm for Flexible Graph Connectivity

by Sylvia Boyd

A Faster Algorithm for Finding Closest Pairs in Hamming Metric

by Andre Esser

Kernelization of Maximum Minimal Vertex Cover

by Júlio Araújo

Fractionally Log-Concave and Sector-Stable Polynomials: Counting Planar Matchings and More

by Yeganeh Alimohammadi

Optimal Construction of Hierarchical Overlap Graphs

by Shahbaz Khan

Gapped Indexing for Consecutive Occurrences

by Philip Bille

CountSketches, Feature Hashing and the Median of Three

by Kasper Green Larsen

A Refined Analysis of Submodular Greedy

by Ariel Kulik

Generalized Parametric Path Problems

by Prerona Chatterjee

Approximate Privacy-Preserving Neighbourhood Estimations

by Alvaro Garcia-Recuero

Coalgebra Encoding for Efficient Minimization

by Hans-Peter Deifel

Algorithms and Complexity on Indexing Founder Graphs

by Massimo Equi

A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs

by Sangsoo Park

Density Sketches for Sampling and Estimation

by Aditya Desai

«

1

2

3

4

»

Submitted on 22 Aug 2011 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar