An introductory guide to aligning networks using SANA, the Simulated Annealing Network Aligner
VVol. 00 no. 00 2017Pages 1–11
An introductory guide to aligning networks using SANA,the Simulated Annealing Network Aligner
Wayne B. Hayes ∗ Department of Computer Science, University of California, Irvine CA 92697-3435, USA
This is a preprint. Citation to published version below.
Citation:
Hayes, Wayne B. “An Introductory Guide to Aligning Networks Using SANA, theSimulated Annealing Network Aligner.” In
Protein-Protein Interaction Networks , pp. 263-284. Humana,New York, NY, 2020.
ABSTRACT
Sequence alignment has had an enormous impact on ourunderstanding of biology, evolution, and disease. The alignmentof biological networks holds similar promise. Biological networksgenerally model interactions between biomolecules such as proteins,genes, metabolites, or mRNAs. There is strong evidence that thenetwork topology— the “structure” of the network—is correlated withthe functions performed, so that network topology can be usedto help predict or understand function. However, unlike sequencecomparison and alignment—which is an essentially solved problem—network comparison and alignment is an NP-complete problem forwhich heuristic algorithms must be used.Here we introduce SANA, the
Simulated Annealing Network Aligner .SANA is one of many algorithms proposed for the arena of biologicalnetwork alignment. In the context of global network alignment,SANA stands out for its speed, memory efficiency, ease-of-use, andflexibility in the arena of producing alignments between 2 or morenetworks. SANA produces better alignments in minutes on a laptopthan most other algorithms can produce in hours or days of CPU timeon large server-class machines. We walk the user through how to useSANA for several types of biomolecular networks.
Availability: https://github.com/waynebhayes/sana
Contact: [email protected]
Supplementary information:
Available online.
A biological network consists of a set of nodes representing entities,with edges connecting entities that are related in some way. Theycome in many varieties, such as protein-protein interaction (PPI)networks (Williamson and Sutcliffe, 2010; Jaenicke and Helmreich,2012), gene regulatory networks (Davidson, 2010; Karlebach andShamir, 2008), gene- µ RNA networks (Chen and Rajewsky, 2007;Prescott, 2012; Farazi et al. , 2013; Kotlyar et al. , 2015; Tokar et al. , 2017), metabolic networks (Fiehn, 2002), brain connectomes(Milano et al. , 2017), and many others (Junker and Schreiber, 2011).It is believed that the structure of the networks, in the form of thenetwork topology, is related to the function of the entities (Davidson,2010; Davis et al. , 2015; Sporns, 2010). The alignment of suchnetworks aims to use connectivity between nodes—the topology of ∗ to whom correspondence should be addressed ( [email protected] ) the network—to aid extraction of information about the nodes andtheir function. Network alignments can be used to build taxonomictrees and find highly conserved pathways across distant species(Kuchaiev et al. , 2010); and by extension finding such topologicalsimilarities may aid in transfering functional knowledge from better-understood species to less well-understood ones, much like howsequence alignment has been doing so for sequence for decades.Networks are even starting to have an influence on individual humanhealth (Van El et al. , 2013)Network alignment is a fundamentally difficult problem: itis a generalization of the NP-Complete subgraph isomorphismproblem (Cook, 1971; Garey and Johnson, 1979); and adding to thedifficulty is that current data sets are very noisy (Von Mering et al. ,2002). Therefore, modern alignment algorithms try to approximatesolutions using heuristic approaches.There are several sub-classes of network alignment. GlobalNetwork Alignment (GNA) is the task of attempting to completelyalign entire networks to each other; GNA applied to just twonetworks is called pairwise
GNA (Kuchaiev et al. , 2010; Malod-Dognin and Prˇzulj, 2015; Saraph and Milenkovi´c, 2014; Mamanoand Hayes, 2017; Hashemifar and Xu, 2014; Sun et al. , 2015; Patroand Kingsford, 2012), while aligning more than two whole networksis called multiple
GNA. In contrast,
Local Network Alignment (LNA) attempts to find similarity in the local wiring patterns amongsmall groups of nodes, either in the same network, or across manynetworks. In all of these cases, alignments can map nodes 1-to-1, or many-to-many; the latter is more biologically realistic since,for example, one gene in yeast may have multiple homologs inmammals. However, the 1-to-1 assumption makes programmingsimpler and so the majority of aligners take the 1-to-1 mappingas a simplifying assumption. A more recent version of networkalignment looks into modeling dynamic networks (see for exampleVijayan and Milenkovi´c (2017)). An excellent comprehensivesurvey of all these types of alignments is provided by Faisal et al. (2015a). SANA was originally a 1-to-1 pairwise global networkalignment algorithm, although we here also introduce a prototypemultiple network alignment version.
Source code to SANA is available on GitHub at http://github.com/waynebhayes/SANA , and is bestcloned from github on the Unix command line using c (cid:13) Wayne B. Hayes 2017. a r X i v : . [ q - b i o . M N ] N ov ayne B. Hayes git clone http://github.com/waynebhayes/SANA SANA is written in C++ and runs best on the Unix command line.It has been tested with gcc 4.8, 4.9, 5.2, and 5.4, and runs on Unix,Linux, Mac OS/X, and under the Windows-based Unix emulatorCygwin ( http://cygwin.com ), 32-bit or 64-bit. SANA hasa rudimentary Web interface at http://sana.ics.uci.edu ,and a rudimentary SANA app is available in the Cytoscape appstore. SANA expects its input networks to be in a two-columnASCII format we call edge list format : each line is one edge,specified by listing the two nodes at each end of the edge in arbitraryorder (unless -nodes-have-types is specified, see below).Duplicate edges and self-loops are not allowed. We also supply aprogram called createEdgeList that can convert the followingtypes of formats into SANA’s edgeList format: XML, GML, LEDA,.gw, CSV, LGF. An alignment measure is any quantity designed to evaluate thequality of a network alignment. Alignment measures can beclassified along many axes. The first axis is thedistinction between objectives and what we call post-hoc measures.While both can be evaluated on any given alignment, any measureused to guide an alignment as it is being created is called an objective function ; any measure not used to guide the alignmentis generally applied after-the-fact as an independent measure ofquality. A good alignment algorithm should be able to use virtuallyany measure as an objective, and also evaluate the alignmentafter-the-fact using any other measures which were not used asobjectives.
Another axisalong which measures can be classified is topological vs biological . A topological measure quantifies a network alignment based solelyon graph-theoretic grounds. Several such measures exist: EC (Kuchaiev et al. , 2010), ICS (Patro and Kingsford, 2012), and S (Saraph and Milenkovi´c, 2014) quantify the number of edgesin one network that are mapped to edges in the other network(s);they are all described in more detail below. Other topologicalmeasures use graphlets to quantify local structure (Prˇzulj et al. ,2004b; Milenkovi´c and Prˇzulj, 2008; Yavero˘glu et al. , 2014; Malod-Dognin and Prˇzulj, 2015), while still others use graph measuressuch as spectral analysis (Patro and Kingsford, 2012) and degreesimilarity-based measures such as Importance (Hashemifar and Xu,2014). Biological measures.
In contrast, biological measures are usuallyused to compare the nodes from different networks that have beenpaired together by the alignment. For genes or proteins, a commonmeasure is the sequence similarity or BLAST score between thealigned nodes (Camacho et al. , 2009); sequence similarity is alsofrequently combined with topology to produce a hybrid objectivefunction (see for example Kuchaiev and Prˇzulj (2011); Saraph andMilenkovi´c (2014); Mamano and Hayes (2017); Malod-Dognin andPrˇzulj (2015), among many others). Another biology-based measureis the functional similarity between pairs of aligned proteins asexpressed by GO (Gene Ontology) terms (Consortium, 2008). While many authors quantify the functional similarity exposedby an alignment using the mean value of various pairwise GOsimilarity measures across the alignment, such mean-of-pairwise-scores assume each pair of aligned proteins is independent ofall others, which is not true in an alignment since every pair isimplicitly related to every other pair via the alignment itself. Thisproblem is alleviated by the NetGO score as implemented in SANA(Hayes and Mamano, 2017), which is a global rather than localscoring mechanism (see below for the meaning of local vs. globalmeasures).
The final axis along whichnetwork alignment measures can be classified is what we refer as local vs. global measures.
A local measure is one that involves evaluating node pairs thatare aligned to each other, and has no explicit dependence on thealignment edges and thus has no explicit dependence on networktopology. Examples of local measures include sequence similarityand pairwise GO term similarity as described above; some localmeasures such as graphlet similarity (Kuchaiev et al. , 2010; Malod-Dognin and Prˇzulj, 2015; Saraph and Milenkovi´c, 2014) andImportance (Hashemifar and Xu, 2014) include topology indirectlyby pre-computing all-by-all pairwise local topological similaritiesbetween all pairs of nodes in one network and all pairs of nodes inthe other.
Global measures are ones that implicitly or explicitly can becomputed only on the entire alignment and have nothing to do withpairwise node similarities. The most common global measures are EC , ICS , and S , described in more detail below. In order tomore easily understand and discuss topological measures, weintroduce an analogy between pairwise network alignment, and theold board game of
Battleship . A Battleship game consists of manyholes in a board, and some pegs that are placed into the holes. In ouranalogy, assume G is a “smaller” network with n nodes and m edges, and G is a “larger” network with n nodes and m edges,and we assume that n ≤ n —that is, G is the smaller networkin terms of number of nodes. We will furthermore depict G asblue and G as red. Consider Figure 1: this board has n = 6 redholes with red edges painted between two holes if there is an edgebetween the two corresponding nodes in G . The smaller network G is represented by n = 4 blue pegs; edges between the pegs arerepresented by blue “laser beams” between the corresponding pegs(because laser beams don’t get tangled as pegs are moved from holeto hole). Any placement of the n pegs into the n holes representsan alignment between G and G ; for now we assume that eachpeg is placed into exactly one hole, so that there are exactly n − n empty holes. Furthermore, since mixing red and blue creates purple,we depict the alignment (far right of Figure 1) in purple: a blue pegin a red hole is purple, and a blue edge lying on top of a red one isalso depicted as purple. EC, ICS, S We can now definesome edge-based topological measures based on this analogy. Thefraction of laser beams that lie on top of painted edges is called the ANA for Biological Network Alignment
Fig. 1.
A simple example of a network alignment. The smaller network G (far left) has its pegs, numbered 1–4, and edges (“laser beams”) depictedin blue; the larger network G (middle) has its holes and painted edgesdepicted in red. One possible alignment (in this case the “visually obvious”one) is depicted at the far right. Here, aligned nodes and edges are depictedas purple; unaligned laser beams from G are still blue, and unaligned holesand edges from G are still red. As stated in the text, in an alignment figurelike the one on the right, the number of edges in G is always m = (purple+ blue edges), and the number of edges in G is always m = (purple + rededges). Thus, from the figure, it can be easily seen that EC = 3 / , and S = 3 / (where 6 is the total number of edges visible across all colorson the subgraph induced by the alignment); also ICS = 3 / , since thereare 4 edges induced in G by the alignment (ie., by purple nodes). Thepurple network is called the Common Subgraph , and it can consist of severalconnected components. In this case there is only one
Common ConnectedSubgraph consisting of 4 nodes and 3 edges. EC (Kuchaiev et al. , 2010). The numerator of EC is the numberof (purple) edges that are aligned between the two networks, callit AE (an integer), while the denominator is m ; note that sinceat most m edges can be aligned, the value EC = AE/m isalways less than or equal to 1. The authors of MAGNA (Saraphand Milenkovi´c, 2014) noted that EC is asymmetric: in particular,if n = n then we can “turn the board upside down”, swappingthe roles of pegs and holes. In that case, the EC changes because G and G are swapped: in particular, the numerator is always thenumber of aligned edges AE , but the denominator switches from m to m .The authors of MAGNA fixed the asymmetry of EC byintroducing the Symmetric Substructure Score or S . Consider therightmost section of Figure 1, which depicts a proposed alignment.In our analogy, if we “look down” on the alignment from above,we can see four different types of edges. There are: (i) AE aligned (purple) edges; (ii) UE unaligned (blue) edges from G ;(iii) UE in unaligned (red) edges in G induced between purplenodes; and (iv) UE out unaligned (red) edges outside the alignment(ie., not induced between purple nodes). Note that the followingequations always hold: m = AE + UE and m = AE + UE in + UE out . Whereas EC = AE/m , S is defined as AE/ ( AE + UE + UE in ) , and is thus symmetric with respectto the interchange of G and G . Another way of saying thisis that both EC and S are rewarded for purple edges in thenumerator, but EC ’s denominator is penalized only for blue edgesin its denominator, whereas S is penalized in its denominator forboth blue and red edges induced by the alignment.Another measure called ICS Induced Conserved Substructure (Patro and Kingsford, 2012) measures AE divided by the numberof painted edges that exist only between holes that have pegs inthem. ICS has the significant disadvantage that it can be maximizedby finding a network alignment that minimizes the number of edges Variously called Edge Coverage, Edge Correspondence, or EdgeCorrectness by various authors between filled holes(Saraph and Milenkovi´c, 2014; Vijayan et al. ,2015; Mamano and Hayes, 2017), which can hardly be said to bea good alignment. Consider again Figure 1. The reason ICS is abad measure is because we could make it equal to / , ie. 1, bymoving node 2 to align with e and 3 to align with f ; then therewould be 2 purple edges ( a -1 to d -4, and e -2 to f -3) and no rededges induced by the alignment on G , even though there would be 3blue edges (1-2, 4-3, and 1-3) unaligned from G . Thus there existsan alignment with ICS = 1 even though it only exposes 2 edgesof common topology, which is less common topology discoveredby maximizing either EC or S . This demonstrates the generalprinciple that choosing the right objective function is crucial togetting good alignments . Graphlets (Prˇzulj et al. , 2004a,b)are small, connected, induced subgraphs on a larger graph. Theyhave myriad uses, such as quantifying global topological structure(Prˇzulj et al. , 2004b; Yavero˘glu et al. , 2014). Enumerating graphletsin a large graph is an NP -hard problem and much work has goneinto heuristics to make their enumeration more efficient. SANAuses ORCA (Hoˇcevar and Demˇsar, 2014) to exhaustively enumerategraphlets in a network. By computing an orbit degree vector (Milenkovi´c and Prˇzulj, 2008), one can create a local measure thatcompares the orbit degree vectors of two nodes (one from eachnetwork); that local measure can then be used as an objectiveto guide the alignment. GRAAL (Kuchaiev et al. , 2010) was thefirst to use orbit degree vectors , and SANA uses the exact samemechanism. However, as networks grow larger, the exhaustiveenumeration of its graphlets is becoming very expensive. Forexample, ORCA takes more than 24 hours to compute the orbitdegree vectors when aligning the 2018 BioGRID (Chatr-Aryamontri et al. , 2017) networks of H. sapiens and
S. cerevisiae . Instead,we intend to move SANA towards statistical sampling of graphletswhich can be accomplished far faster and produce results with lowfrequency error and high confidence (see for example Rossi et al. (2017); Yang et al. (2018); Hasan et al. (2017)).
We believe that one ofthe major outstanding questions in network alignment is the designof good topological objective functions. While most measuresthat currently exist have been shown to correlate with interestingbiological information, none have been shown to be substantiallybetter than any other in terms of recovering relevant biology. Forexample, while S is symmetric and can thus be considered a moreaesthetically pleasing measure from a mathematical standpoint, it’sby no means clear that it actually produces better correlations withbiology than EC . And while graphlets have been shown to correlatewith biological information (Kuchaiev et al. , 2010; Malod-Dogninand Prˇzulj, 2015; Davis et al. , 2015), it is not clear that we knowthe best way to use them to recover the greatest amount of relevantbiological information (cf. Section 3.1, especially Table 4). Ingeneral, the design of good topological objective functions is awide-open area of research that deserves to be explored. SANA,with its speed and accuracy, is an ideal playground for exploringobjective functions. In the GRAAL paper we used the term “graphlet degree vector” but it’smore correctly called an “orbit degree vector” because it’s a vector of orbitcounts, not graphlet counts. ayne B. Hayes To explain what we mean by experimenting with objectivefunctions, consider Figure 2. There are three orthogonal componentsto network alignment: (1) a (possibly vague) scientific orinformational goal G ; (2) an objective function M created by theuser that attempts to formally encode G ; and (3) an alignmentalgorithm S that builds an alignment trying to optimize M . Insequence alignment, the three orthogonal components are clearlydelimited: the substitution/indel cost matrix encodes the goal theuser wants, and tools like BLAST (Camacho et al. , 2009) quicklyfind (near-)optimal solutions. Practitioners can use BLAST withouthaving to understand the details of how it works. It is a trustedtool, like a C++ compiler is to a developer, or a linear solverto a scientist solving a linear system; practitioners iterate thefamiliar edit-compile-debug loop, gaining knowledge from thefeedback process until they are satisfied that they have achievedtheir goal. Unfortunately, this edit-compile-debug loop is virtuallyimpossible in the network alignment arena, due to (i) the the lack ofan algorithm fast enough to perform effective edit-compile-debugloops, (ii) the lack of a generally-accepted “gold standard” ofnetwork alignment, and (iii) the lack of a clear separation of the goal , its formalized objective , and the alignment tool . SANA fixesthe first two; the third is a matter of scientific culture in the networkalignment community that we hope to influence by spreading theuse of SANA in conjunction with the process depicted in Figure 2.
The
Software
Development Cycle
1. Edit source of program P to implement ideas/changes/fix bugs to so it implements your science goal. 2. Compile P : create correct, efficient executable E(P) implementing P at machine level.
3. Run
E(P) , producing output.
4. Evaluate output, decide if P did what you wanted or expected. 5. Think how to modify P to better obtain your science goal.
6. Go back to step 1 (or possibly change science goal).
Proposed
Alignment Objective
Development Cycle
1. Edit objective function F to implement ideas/changes/fix bugs to so it implements your science goal. 2. Create an efficient algorithm S that optimizes the objective F(A, G1,G2) across all possible alignments A.
3. Run S ( F, G1, G2) , producing alignment A .
4. Evaluate alignment A , decide if F did what you wanted or expected 5. Think how to modify F to better obtain your science goal.
6. Go back to step 1 (or possibly change science goal).
Fig. 2.
Comparison of the standard software development cycle(left), and proposed cycle for developing new objective functionsfor alignment (right). Red highlights the step that should be entirelyautomated and requiring no effort on the user’s part.
It may help here to (re-)state the obvious:the whole point of network alignment is to align networks basedupon their network topology. This is a desirable goal becausethere is a strong belief that the topology of a network is somehowrelated to its function. For example, we believe that humans andchimpanzees are very close relatives, taxonomically speaking. Ifthere is a particular protein h in humans that performs a certainfunction by interacting with seven other proteins h , h , . . . , h ,then it is quite likely that there is a very similar protein c in chimpanzees that also interacts with (close to) seven proteins c , c , . . . , c to perform virtually the same function. Another wayof saying this is that the network topology of the protein-proteininteraction networks of human and chimp are likely to be verysimilar in the vicinity of h and c , respectively. As such, a naturalnetwork alignment between human and chimp should contain theordered pairs ( h , c ) , ( h , c ) , . . . , ( h , c ) . If the network of interactions around h and c are in fact similar, then any networkalignment algorithm worth its mettle, optimizing an objective thathighlights such network similarities, should include the above pairswith high likelihood.The problem, at least in the research area of protein-proteininteraciton (PPI) networks, is that the data on current PPI networksis extremely incomplete in terms of enumerating the edges in the PPInetworks. For example, as of 2018, the most complete PPI networkis that of S. cerevisiae , and it may be only about 50% complete;the human PPI network is probably less than 10% complete (Vidal,2016); other species are even far less complete. For instance,we’d expect most mammals to have about the same number ofinteractions in their PPI networks, and yet the 2018 BioGRIDHuman network has almost 300,000 interactions, but mouse andrat have only 38,000 and 5,000 interactions listed, respectively.If Human is only 10% complete and currenthly contains 300,000interactions, then we may expect the complete interactome to haveover 1 million interactions. By this measure, mouse and rat areat most a few percent, and well less than one percent complete,respectively. Here’s the crux: if we are missing 90% or more ofthe edges in most mammal PPI networks, no network alignmentalgorithm based solely upon network topology has any hope ofproviding good alignments . This is the state of affairs in PPI networkalignment.Thus, it is no surprise that virtually every network alignmentalgorithm currently in existence must rely on using sequencesimilarity information to help give network alignments that showdecent functional similarity. However, if network alignment is of anyworth whatsoever, the use of sequence similarity should be viewedonly as a temporary crutch—a necessary evil—until such time asthe interactions in PPI networks are more completely enumerated .On the other hand, since protein function is defined by the shapeof the folded protein, and disrupting the function of a protein can belethal, the folded structure of a protein tends to be better conservedthan its sequence (Lesk and Chothia, 1986). This in turn suggeststhat the network of interactions may also be better conserved thansequence. If this is the case, then network alignment may ultimatelybe at least as useful as sequence alignment in terms of learning aboutprotein function. Alas, we must wait until PPI networks are far morecomplete than they are today to test this hypothesis.
Given two networks with n ≤ n nodes, respectively, the numberof possible 1-to-1 pairwise global network alignments between themis exactly n !( n − n )! . This is an enormous number; for example ifthe two networks each have thousands of nodes (not uncommonfor protein-protein interaction networks), the the number of possiblealignments can easily exceed , . This is an enormous searchspace, far larger, for example, than the number of elementaryparticles in the known universe—which according to Wikipedia isa paltry .The task of a network alignment algorithm is to search throughthis enormous space of possible alignments, looking for ones thatscore well according to one or more of the measures described in ANA for Biological Network Alignment
Section 1.2. Since network alignment is an NP-complete problem ,all such algorithms must use heuristics to navigate this enormoussearch space. Search methods abound; several good review papersexist (Clark and Kalita, 2014; Faisal et al. , 2015b; Milano et al. ,2017; Guzzi and Milenkovi´c, 2017); for an extensive comparisonspecifically showing that SANA outperforms about a dozen of thebest existing algorithms, see Mamano and Hayes (2017). SANA isvirtually unique in that it was designed from the start to be able tooptimize any objective function, including the objective functionsintroduced by other researchers; a preliminary report shows thatSANA outperforms over a dozen other algorithms at optimizingtheir own objective functions (Kanne and Hayes, 2017). We believe that, in order to be of general use, a network alignmentalgorithm must satisfy the following properties:
Speed, if so desired.
SANA can produce better alignments inminutes that most other aligners can in hours. This is useful formany reasons: to perform test alignments; to experiment withobjective functions; to perform multiple alignments of the same pairof networks in order to see which parts of the alignment, if any,come out the same each time (more on this later).
High quality of results, if so desired.
SANA’s primary user-tunableparameter is the amount of time the user wishes to wait. WhileSANA can produce better alignments in one minute on a laptopthan many existing algorithms can do given hours of CPU, userscan also tell SANA to spend any amount of time improving thealignment, such as 5 minutes, 3 hours, or a week. SANA generallyproduces better scoring alignments with longer run times, althoughwe generally see a point of diminishing returns beyond a few hours.
It should be simple to use.
By this we mean that if there areany algorithmic parameters that crucially control the quality of theresult, those parameters should be tuned automatically without userinput—in other words, the user should not need be an expert on thealgorithm in order to understand how to use it. The primary internalparameters controlling the anneal is the temperature schedule, andby default SANA spends a minute or two automatically findinga near-optimal temperature schedule before starting the anneal.(Another algorithm called SailMCS (Larsen et al. , 2016) also usessimulated annealing but fails to automatically determine a goodtemperature schedule, and so SANA produces alignments that arefar superior to those of SailMCS (Kanne and Hayes, 2017).)
Providing confidence estimates on the quality of the alignment.
Forexample, if some set of pegs P always end up in the same holesevery time SANA is run and another set of pegs P end up indifferent holes each time SANA is run, this suggests the set P is confidently aligned, whereas we should be suspicious about thealignment of pegs in P . Few algorithms are capable of this sort ofconfidence testing of the alignment; SANA, on the other hand, isso fast that it is easy to look for such core alignments (Milenkovi´c et al. , 2010)—cf. Section 3.1. For those who are inclined to graph theory, the proof is trivial: findinga network alignment with an EC of exactly 1 is equivalent to solving thesubgraph isomorphism problem.
Flexible with objective functions.
SANA has over a dozen pre-programmed objective functions that users can experiment with.In addition, users can supply SANA with externally computedsimilarity matrices, either node-to-node, or edge-to-edge. Finally,we have tried to make the code base of SANA clear so that anybodyfamiliar with C++ can program new objective functions easily.
Able to handle nodes that have ASCII names rather than onlyallowing integers as node identifiers.
To a programmer, creatinga mapping between ASCII names and integers is easy. To non-programmers this is not so easy, and many aligners have theinexcusable fault of insisting that nodes are named by sequentialintegers. SANA does this internally but allows users to use whatevernames they want to identify nodes.
Available to plug in to existing popular tools such as Cytoscape.
SANA is available in the Cytoscape App store.
Able to handle multiple input graph formats.
Currently SANAonly natively accepts networks in edge list format, and LEDA.gwformat. The former is a line-by-line list of edges (two nodes fromthe same network listed on one line), while the latter is a ratherdeprecated format used by an old version of LEDA (Mehlhornand Naher, 1999). However, we do provide a converter called createEdgeList that outputs our edge list format given any ofthe following input formats: GML, XML, graphML, LEDA, CSV,and LGF.
SANA shares one important aspect with a few other alignersincluding MAGNA (Saraph and Milenkovi´c, 2014; Vijayan et al. ,2015) and OptNetAlign (Clark and Kalita, 2015): it is a randomizedsearch algorithm. Like these other algorithms, SANA starts witha random alignment and then starts to move pegs around betweenholes; each time it tries to swap or move pegs around, it asks ifthe objective function has gotten better or not. As time progresses,the alignment gets better according to the objective function. If theobjective function is an easy one to optimize, SANA will quicklyfind the optimal or near-optimal alignment (Mamano and Hayes,2017; Kanne and Hayes, 2017); in harder cases it will simply findbetter-and-better solutions as it is given more time.The fact that SANA intentionally injects randomness has somesurprising positive aspects. In particular, if there exist highly similarregions between the two networks G and G , SANA is likely tofind them and align them identically every time, despite starting witha different random alignment each time. If there are other parts ofthe networks that are dissimilar and there is no obvious way to alignthem correctly, those regions are likely to get aligned differentlyeach time SANA is run. Given two regions R in G and R in G , the more topologically similar R is to R , the more likely itis that SANA will align them the same way every time it is run,independent of the randomness. Since SANA is extremely fast, andsince it has this random aspect, it is relatively painless to run SANAmany times on the same pair of networks and look for pairs ofnodes that are aligned together frequently. We use the term corealignment to refer to pairs of nodes that are stable across many runsof SANA; the more frequently a pair of nodes is aligned together,the more confident we are that they truly belong together accordingto the objective function being optimized. So for example, if werun SANA 10 times on the same 2 networks and produce output ayne B. Hayes files out0.align, out1.align, out2.align, . . . , out9.align, then we cantrivially measure the core frequencies on the Unix command line asfollows: $ sort out?.align | uniq -c | sort -nr The first sort puts identical lines from all 10 files side-by-side;the uniq -c counts how many unique lines are side-by-side (thusmeasuring core frequency), and the final sort -nr then sorts thealigned pairs of nodes by frequency, most frequent pairs of nodesfirst—that is, the most confident parts of the alignment are listedfirst. Note that the output of the above command line is a listof pairs precedid by their frequency. Note in particular that, eventhough SANA is a 1-to-1 aligner per run , with multiple runs we canproduce non-1-to-1 mappings between the two networks, along witha confidence level for each particular pair. Currently, SANA aligns only two networks at a time. Each time,it produces a 1-to-1 mapping between the nodes of the smallernetwork to the nodes of the larger one (ie., an arrangement ofpegs into holes). So technically, SANA is a global, pairwise, 1-to-1 alignment algorithm—the simplest type of global alignmentalgorithm. However, as we described above, SANA produces good alignments so quickly that it can be run many times on the same pairof networks in the same time it takes to run most other algorithmsjust once; by running SANA many times we effectively produce notonly a non-1-to-1 mapping, but also a confidence estimate of eachpair of nodes we output. So far as we are aware, no other algorithmproduces such confidence estimates.Furthermore, even though SANA technically aligns only 2networks at a time, in the Appendix of this paper we describe aprototype version of multi-SANA that uses pairwise alignments toconstruct a multiple network alignment.Thus, although SANA is technically only a 1-to-1 pairwisenetwork aligner, it can effectively produce both many-to-manyalignments (with confidences), and multiple alignments.
Table 1 contains a sequence of Unix Shell commands that willdownload the repo from GitHub, compile SANA, and perform yourfirst test of SANA to ensure everything works.The most basic run of SANA requires the user only to specifywhich two networks to align; in Table 1 it is the 2018 BioGRIDrenditions of
Rattus norvegicus (the common sewer rat, aka lab rat),and the single-celled yeast
Schizosaccharomyces pombe . SANAdefaults to using S as the objective function, and 5 minutes asthe amount of time to perform simulated annealing. Total runtime is about 6–7 minutes including the initial computation of thetemperature schedule, which we now describe.Simulated annealing only works well if the temperature scheduleis chosen carefully. We must start with a temperature high enoughthat moves are essentially random, so that even bad moves are We are also working on functionality to produce core alignments in onerun of SANA; that functionality may exist by the time this article goes topress and accessible via the command-line option “ -cores ”. frequently accepted (this keeps us out of local minima); and thenend with a temperature low enough that only good moves areaccepted (to hone in on the best local maximum once we’ve foundits general vicinity). Empirically, we are controlling the probabilityof accepting a bad move , or pBad ; it must start close to 1, andend close to zero. Unfortunately there’s no analytical method tocompute these extremes, so the first 1-2 minutes of SANA are spentestimating the initial temperature t initial , the final temperature t final that gives a pBad starting near 1 and ending near zero, alongwith the t decay , the temperature decay rate that gets us from one tothe other in the allotted time (5 minutes by default).Next you will see the statement, Start execution ofSANA s3 which says SANA is finally starting the anneal,optimizing s3 . After that, you’ll see updates every few secondsas SANA progresses. These updates show the update number, theelapsed time so far, the current score, some statistical theoreticalvalues that don’t concern us here, and the sampled pBad, whichshould start above 0.98 and end somewhere below about 1e-6.Once SANA is finished running, there are exactly two output files(whose names can be changed with the “-o” option): sana.out contains as its first (long) line an internal representation of thealignment, followed by some human-readable statistics; an exampleis in Table 2. The second file, called sana.align , containsthe actual alignment in two-column format: on each line, the leftcolumn contains a node (“peg”) from G and the right column is thealigned node (“hole”) from G .The default objective function is S ; changing the objectivefunction is easy on the command line. For example to have SANAoptimize a 50-50 combination of EC and S , type ./sana -ec 0.5 -s3 0.5 -fg1 ... To turn off S entirely and perform an EC -only alignment, do ./sana -s3 0 -ec 1 -fg1 ... To perform an alignmet that optimizes 90% Importance as definedby HubAlign (Hashemifar and Xu, 2014) 5% graphlets as used byGRAAL (Kuchaiev et al. , 2010), 5% EC, and no S , do ./sana -s3 0 -importance 0.9 -graphlets 0.05 -ec0.05 ... Note that one does not need to manually ensure that all the weightsspecified on the command line add to 1; if they do not, SANA willsimply re-normalize them all so that they add to 1.Similarly, the are many other objective functions defined bySANA; currently implemented ones are listed in Table 3.
As a part of our first publication on SANA (Mamano and Hayes,2017), we wanted to automate the process of directly comparingto many other existing aligners. Thus, the external source codeof over a dozen existing aligners were directly incorporated intoSANA so that they can be called from the SANA commandline. This was done to ensure consistent calling conventionsto these other aligners during our comparisons. These othermethods can be called from the SANA command line using the -method argument. In the SANA repo, these other aligners arein the directory wrappedAlgorithms ; see the online SANAdocumentation for more details. The other aligners currently If you are an author of one of these aligners and notice that SANA is notusing your algorithm optimally, feel free to contact us with any corrections. ANA for Biological Network Alignment
Table 1.
Getting started with SANA on the Unix command line. We first clone the repo from GitHub, then “make” SANA, then run it on the two smallestBioGRID 2018 networks:
R. norvegicus and
S. pombe . We then look at the output file sana.out , which contains scores and other useful information, as wellas the actual alignment file sana.align . SANA has many command-line options; type “ ../sana -h | less ” to see a long list of them. incorporated into SANA are LGRAAL (Malod-Dognin and Prˇzulj,2015), MAGNA++ (Vijayan et al. , 2015), HubAlign (Hashemifarand Xu, 2014), WAVE (Sun et al. , 2015), NETAL (Neyshabur et al. , 2013), MIGRAAL (Kuchaiev and Prˇzulj, 2011), GHOST(Patro and Kingsford, 2012), PISWAP (Chindelevitch et al. , 2013),OptNetAlign (Clark and Kalita, 2015), SPINAL (Alada˘g andErten, 2013), GREAT (Crawford and Milenkovi´c, 2015), NATALIE2.0 (El-Kebir et al. , 2011), GEDEVO (Ibragimov et al. , 2013),CytoGEDEVO (Malek et al. , 2016), BEAMS (Alkan and Erten,2014), HGRAAL (Milenkovi´c et al. , 2010), PINALOG (Phan andSternberg, 2012).
As shown in Figure 2, SANA can be used to experiment withobjective functions; we believe that such experimentation is oneof the most important but apparently under-appreciated aspectsof the science of network alignment. Here we describe one suchexperiment with a very well-defined scientific goal.
Consider a set of gene-microRNA (mRNA) networks (Tokar et al. ,2017), one network for each species. These networks are bipartite,meaning that genes interact with microRNAs, but neither genes normicroRNAs interact with their own type. Thus, when aligning twogene-mRNA networks, we wish to align genes from one network togenes in the other, and mRNAs in one to mRNAs in the other, but weshould never align a gene to an mRNA or vice-versa. In essence, thenodes have two types , and we must provide a type-specific networkalignment.At first, SANA did not have the functionality to provide a typed-node alignment. The question was: how do the various topological It does now, using the -nodes-have-types argument, in which casewe assume that the first column in the edge list is one type, and the secondcolumn is the other type. Only two types are supported at the moment. objective functions compare in their ability to automatically aligntypes correctly, given that typing is not enforced by the alignmentalgorithm?Referring to Figure 2, the scientific goal is clear: maximize thefraction of nodes that are aligned to like-type nodes in the othernetwork . The question is now, which topological objective functionbest achieves this scientific goal?
We received 535 networks directly from one of the authors ofTokar et al. (2017). We chose 1,000 pairs of networks at randomout of the (cid:0) (cid:1) = 142 , possible pairs of networks. For eachpair of networks, we tested the following objective functions fortheir ability to correctly align nodes of like type to each otherwhen this was not enforced: EC , S , Importance (Hashemifar andXu, 2014), GRAAL-type graphlet orbit signatures (Milenkovi´c andPrˇzulj, 2008; Kuchaiev et al. , 2010), and LGRAAL-type graphletorbit signatures (Malod-Dognin and Prˇzulj, 2015). To further test thedependence on runtime, we ran SANA on all the above objectivesfor all 1,000 networks for runtimes of 1 and 4 minutes. Finally, tolook at the frequency of core alignments, we performed each of theabove pairs 5 times each. The results are in Table 4.One column of great interest is the “mix” column, which countsthe number of times, out of the approximately 30 million pairs ofaligned nodes, in which a gene from one network was aligned to anmRNA in the other network—which is the kind of mis-typed nodealignment we are trying to avoid. The rows are sorted best-to-worstby this measure, in each of the 1-minute and 4-minute sub-tables.As we can see, the EC objective scores best at avoiding this kind ofmis-typed alignment. In the 1-minute runs, EC aligns unlike typednode-pairs in only 0.65% of cases; S is a close second, mis-typingjust under 1% of the aligned pairs of nodes. In contrast, HubAlign’sImportance measure (Hashemifar and Xu, 2014) is almost 20 timesworse in terms of incorrectly aligning nodes of different types, doingso in about 15% of aligned pairs of nodes, while both graphletmeasures fare the worst, aligning unlike-type nodes in over 20%of cases.Even more interesting is the 4 minute runs, in which EC cuts itsmis-typed node alignment in half, down to about 0.3% of aligned ayne B. Hayes Table 2.
The sana.out file (whose name can be changed using the -o command-line option) contains information about the input networks (nodes,edges, connected components) and an analysis of the alignment (various measures applied to the entire alignment, and also applied to the common connectedsubgraphs). ANA for Biological Network Alignment
Table 3.
Measures accepted by SANA on the command line. Note that “Name” means “command-line option”, so for example to give ec a weight of 0.5, use“ -ec 0.5 ” on the SANA command line. Name Description s3 Symmetric Substructure Score (Saraph and Milenkovi´c, 2014) ec Edge Coverage/Correspondence/Correctness (Kuchaiev et al. , 2010) ics
Induced Conserved Structure (Patro and Kingsford, 2012) graphlet
Orbit Degree Vector (ODV) Similarity (Milenkovi´c and Prˇzulj, 2008; Kuchaiev et al. , 2010) graphletlgraal
LGRAAL-normalization of ODV sim (Malod-Dognin and Prˇzulj, 2015) go Mean ResnikMax GO similarity (Resnik, 1995; Ashburner et al. , 2000)
NetGO
Network-alignment-based GO similarity (Hayes and Mamano, 2017) wec
Weighted EC (Sun et al. , 2015) esim
External file defining node-pair similarities sequence
BLANT-based sequence similarities (Camacho et al. , 2009) lccs
Largest Common Connected Subgraph (Kuchaiev et al. , 2010) nc Node Correctness (if known, defines the exact alignment) spc
Shortest Path Conservation (Mamano and Hayes, 2017) edgeCount degree difference edgeDensity relative degree difference importance
HubAlign’s Importance (Hashemifar and Xu, 2014) nodeDensity local node density ewec
External edge-based similarity matrix, eg., edge-graphlet similarity(Crawford and Milenkovi´c, 2015) sequence
BLAST bit scores based on protein sequence similarity (Camacho et al. , 2009)
Table 4.
Table of results when testing various objective functions (leftmost column) for their ability to correctly align genes-to-genes, and mRNAs-to-mRNAs,when aligning a pair of gene-mNRA networks (Tokar et al. , 2017). Objectives tested were EC (Kuchaiev et al. , 2010), S (Saraph and Milenkovi´c, 2014),Importance (Hashemifar and Xu, 2014), graphlet (Milenkovi´c and Prˇzulj, 2008; Kuchaiev et al. , 2010), and graphlet-LGRAAL (Malod-Dognin and Prˇzulj,2015). The columns are as follows. pairs : total number of pairs of nodes aligned in all 1,000 network pairs that were run 5 times each. : number ofpairs in which a gene was correctly aligned to another gene. mix : number of pairs in which a gene in one network was aligned to an mRNA in the other. : number of pairs in which an mRNA was aligned to another mRNA. coreFreq(XY) > : the number of aligned pairs that had a core frequency greaterthan 1 (indicating the objective function strongly prefers to align this pair of nodes together) for type-pairs GG, MG, and MM. objective pairs 2*Gene mix 2*RNA coreFreq(GG) > coreFreq(MG) > coreFreq(MM) > ec 30424880 29953074 198792 273014 1268806 570 3169s3 30424880 29986047 284470 154363 1093307 2947 688importance 30241594 25434876 4658345 148373 651969 114137 1386graphlet-GRAAL 30424880 24109670 6176510 138700 1902554 449738 17331graphlet-LGRAAL 30424880 23056815 7305611 62454 1718519 584735 7086 objective pairs 2*Gene mix 2*RNA coreFreq(GG) > coreFreq(MG) > coreFreq(MM) > ec 30424880 30055465 97811 271604 1245103 208 5908s3 30424880 29953309 283313 188258 1092319 3779 1508importance 30292547 25473995 4669942 148610 652830 114815 1621graphlet-GRAAL 30424880 24104880 6180836 139164 2208583 502806 25308graphlet-LGRAAL 30424880 23051615 7310109 63156 2090416 692752 10504pairs, while all other measures fail to improve their “mix” columnwith the longer runtime.Recall that if SANA aligns the same pair of nodes together inmore than one run, we say that pair is in the core alignment, becausethe objective function is unlikely to align two nodes together morethan once by chance. Another column of great interest is thus the coreFreq(MG) > column, which tells us how frequently theobjective function seems to strongly prefer mis-aligning a pair ofnodes of different types. Again we see that the EC measure is byfar the best measure by this criterion: in the 1 minute runs, only 570mistyped pairs appear out of 30 million (about 2 per 100,000 pairs), while the 4 minute runs cut that “error rate” in half, suggestingthat longer runs will do a better job of correctly aligning types.Meanwhile, S does 10x worse at 1 minute and gets more bad inthe 4 minute runs, while importance and both graphlet measuresmisalign orders of magnitude more typed pairs, presenting a strongpreference for misaligning nodes in about 1–2% of pairs. We conclude that the EC measure is, by far , the best availableobjective function for this particular purpose among those wetested. For the moment we do not hypothesize why this is the case,but empirically the result seems iron-clad. While we agree thatthe S measure is mathematically more aesthetically pleasing and ayne B. Hayes would seem to be a better measure intuitively, for this particularpurpose EC seems to work better. The author finds the poorperformance of graphlet-based measures particularly surprising,since the author is a strong believer that graphlets are a usefultool for network analysis (see for example Hasan et al. (2017))—and graphlets have certainly demonstrated their value in othercontexts (Davis et al. , 2015; Yavero˘glu et al. , 2014). However,these results suggest that perhaps orbit degree signatures as theyare currently defined (Milenkovi´c and Prˇzulj, 2008; Kuchaiev et al. ,2010; Malod-Dognin and Prˇzulj, 2015) may not be the best wayto leverage graphlet-based information in the context of globalpairwise network alignment. We have described the use of SANA (Mamano and Hayes, 2017),the
Simulated Annealing Network Aligner , in the context of thepairwise 1-to-1 global alignment of biological networks. SANAprovides many advantages over the many other aligners currentlyavailable: as a search algorithm, it is lightning fast, producing well-scoring alignments in minutes rather than hours; it provides a largearray of objective functions users may wish to experiment with, aswell as the facility to add more objectives in the future; it does notrequire the user to know much about the internal workings of thealigner in order to use it; and it is well on the way towards being fullyintegrated into popular network analysis tools such as Cytoscape.We have introduced the concept of objective function experimentation (cf. Figure 2 and Section 3.1), which we believe is at the coreof future developments in network alignment. SANA’s speed andeffectiveness makes it the ideal aligner to implement the processdepicted in Figure 2.
APPENDIX
A prototype of a multiple-network-alignment version of SANAis available in the SANA GitHub repo. Simply re-compileSANA with the -DWEIGHTED option on the command line(see the
Makefile ), and the consult the Bourne shell script multi-pairwise.sh ; running it without any argumentsprovides a short help message.Questions about SANA, comments, or feature requests should bedirected to the author at [email protected] . REFERENCES
Alada˘g, A. E. and Erten, C. (2013). Spinal: scalable protein interaction networkalignment.
Bioinformatics , (7), 917–924.Alkan, F. and Erten, C. (2014). Beams: backbone extraction and merge strategy for theglobal many-to-many alignment of multiple ppi networks. Bioinformatics , (4),531–539.Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M.,Rubin, G. M., and Sherlock, G. (2000). Gene Ontology: tool for the unification ofbiology. Nature Genetics , (1), 25–29.Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J. S., Bealer, K., andMadden, T. L. (2009). Blast+: architecture and applications. BMC Bioinformatics , , 421. Chatr-Aryamontri, A., Oughtred, R., Boucher, L., Rust, J., Chang, C., Kolas, N. K.,O’Donnell, L., Oster, S., Theesfeld, C., Sellam, A., et al. (2017). The biogridinteraction database: 2017 update. Nucleic acids research , (D1), D369–D379.Chen, K. and Rajewsky, N. (2007). The evolution of gene regulation by transcriptionfactors and micrornas. Nature reviews. Genetics , (2), 93.Chindelevitch, L., Ma, C.-Y., Liao, C.-S., and Berger, B. (2013). Optimizing a globalalignment of protein interaction networks. Bioinformatics , (21), 2765–2773.Clark, C. and Kalita, J. (2014). A comparison of algorithms for the pairwise alignmentof biological networks. Bioinformatics , (16), 2351–2359.Clark, C. and Kalita, J. (2015). A multiobjective memetic algorithm for ppi networkalignment. Bioinformatics , (12), 1988–1998.Consortium, T. G. O. (2008). The gene ontology project in 2008. Nucleic AcidsResearch , (suppl 1), D440–D444.Cook, S. A. (1971). The complexity of theorem-proving procedures. In Proceedings ofthe third annual ACM symposium on Theory of computing , pages 151–158. ACM.Crawford, J. and Milenkovi´c, T. (2015). Great: graphlet edge-based network alignment.In
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conferenceon , pages 220–227. IEEE.Davidson, E. H. (2010).
The regulatory genome: gene regulatory networks indevelopment and evolution . Academic press, USA.Davis, D., Yavero˘glu, ¨O. N., Malod-Dognin, N., Stojmirovic, A., and Prˇzulj, N.(2015). Topology-function conservation in protein–protein interaction networks.
Bioinformatics , (10), 1632–1639.El-Kebir, M., Heringa, J., and Klau, G. W. (2011). Lagrangian relaxation appliedto sparse global network alignment. In IAPR International Conference on PatternRecognition in Bioinformatics , pages 225–236. Springer.Faisal, F. E., Meng, L., Crawford, J., and Milenkovi´c, T. (2015a). The post-genomic eraof biological network alignment.
EURASIP Journal on Bioinformatics and SystemsBiology , (1), 3.Faisal, F. E., Meng, L., Crawford, J., and Milenkovi´c, T. (2015b). The post-genomic eraof biological network alignment. EURASIP Journal on Bioinformatics and SystemsBiology , (1), 1.Farazi, T. A., Hoell, J. I., Morozov, P., and Tuschl, T. (2013). Micrornas in humancancer. In MicroRNA Cancer Regulation , pages 1–20. Springer, Germany.Fiehn, O. (2002). Metabolomics-the link between genotypes and phenotypes. In
Functional Genomics , pages 155–171. Springer, Germany.Garey, M. and Johnson, D. (1979).
Computers and Intractability: A Guide to the Theoryof NP-Completeness . New York: W.H. Freeman, New York.Guzzi, P. H. and Milenkovi´c, T. (2017). Survey of local and global biological networkalignment: the need to reconcile the two sides of the same coin.
Briefings inbioinformatics , page bbw132.Hasan, A., Chung, P.-C., and Hayes, W. (2017). Graphettes: Constant-timedetermination of graphlet and orbit identity including (possibly disconnected)graphlets up to size 8.
PloS one , (8), e0181570.Hashemifar, S. and Xu, J. (2014). HubAlign: an accurate and efficient method forglobal alignment of proteinprotein interaction networks. Bioinformatics , (17),i438–i444.Hayes, W. B. and Mamano, N. (2017). Sana netgo: a combinatorial approach to usinggene ontology (go) terms to score network alignments. Bioinformatics , (8), 1345–1352.Hoˇcevar, T. and Demˇsar, J. (2014). A combinatorial approach to graphlet counting. Bioinformatics , (4), 559–565.Ibragimov, R., Malek, M., Guo, J., and Baumbach, J. (2013). Gedevo: an evolutionarygraph edit distance algorithm for biological network alignment. In OASIcs-OpenAccess Series in Informatics , volume 34. Schloss Dagstuhl-Leibniz-Zentrumfuer Informatik.Jaenicke, R. and Helmreich, E. (2012).
Protein-protein interactions , volume 23.Springer Science & Business Media, Germany.Junker, B. H. and Schreiber, F. (2011).
Analysis of biological networks , volume 2. JohnWiley & Sons, USA.Kanne, D. P. and Hayes, W. B. (2017). Sana: separating the search algorithm from theobjective function in biological network alignment, part 1: Search.Karlebach, G. and Shamir, R. (2008). Modelling and analysis of gene regulatorynetworks.
Nature reviews. Molecular cell biology , (10), 770.Kotlyar, M., Pastrello, C., Sheahan, N., and Jurisica, I. (2015). Integrated interactionsdatabase: tissue-specific view of the human and model organism interactomes. Nucleic acids research , (D1), D536–D541.Kuchaiev, O. and Prˇzulj, N. (2011). Integrative network alignment reveals large regionsof global network similarity in yeast and human. BIOINFORMATICS , , 1390–1396. ANA for Biological Network Alignment
Kuchaiev, O., Milenkovi´c, T., Memiˇsevi´c, V., Hayes, W., and Prˇzulj, N. (2010).Topological network alignment uncovers biological function and phylogeny.
Journal of The Royal Society Interface , (50), 1341–1354.Larsen, S. J., Alkærsig, F. G., Ditzel, H. J., Jurisica, I., Alcaraz, N., and Baumbach, J.(2016). A simulated annealing algorithm for maximum common edge subgraphdetection in biological networks. In Proceedings of the 2016 on Genetic andEvolutionary Computation Conference , pages 341–348. ACM.Lesk, A. and Chothia, C. (1986). The response of protein structures to amino-acidsequence changes.
Phil. Trans. R. Soc. Lond. A , (1540), 345–356.Malek, M., Ibragimov, R., Albrecht, M., and Baumbach, J. (2016). Cytogedevoglobalalignment of biological networks with cytoscape. Bioinformatics , (8), 1259–1261.Malod-Dognin, N. and Prˇzulj, N. (2015). L-graal: Lagrangian graphlet-based networkaligner. Bioinformatics .Mamano, N. and Hayes, W. B. (2017). Sana: Simulated annealing far outperforms manyother search algorithms for biological network alignment.
Bioinformatics (Oxford,England) , , 21562164.Mehlhorn, K. and Naher, S. (1999). Leda: A platform for combinatorial and geometriccomputing.
Cambridge University Press, United Kingdom.Milano, M., Guzzi, P. H., Tymofieva, O., Xu, D., Hess, C., Veltri, P., and Cannataro, M.(2017). An extensive assessment of network alignment algorithms for comparisonof brain connectomes.
BMC bioinformatics , (6), 235.Milenkovi´c, T. and Prˇzulj, N. (2008). Uncovering biological network function viagraphlet degree signatures. Cancer Inform. , (Epub 2008 Apr 14), 257–273.Milenkovi´c, T., Ng, W. L., Hayes, W., and Prˇzulj, N. (2010). Optimal networkalignment with graphlet degree vectors. Cancer Informatics , , 121–137.Neyshabur, B., Khadem, A., Hashemifar, S., and Arab, S. S. (2013). Netal: a newgraph-based method for global alignment of proteinprotein interaction networks. Bioinformatics , (13), 1654–1662.Patro, R. and Kingsford, C. (2012). Global network alignment using multiscale spectralsignatures. Bioinformatics , (23), 3105–3114.Phan, H. T. and Sternberg, M. J. (2012). Pinalog: a novel approach to align proteininteraction networks—implications for complex detection and function prediction. Bioinformatics , (9), 1239–1245.Prescott, D. M. (2012). Cell Biology A Comprehensive Treatise V3: Gene Expression:The Production of RNA’s , volume 3. Elsevier, Amsterdam-London-New York-Oxford-Paris-Shannon-Tokyo.Prˇzulj, N., Wigle, D., and Jurisica, I. (2004a). Functional topology in a network ofprotein interactions.
Bioinformatics , (3), 340–348. Prˇzulj, N., Corneil, D. G., and Jurisica, I. (2004b). Modeling interactome: scale-free orgeometric? Bioinformatics , (18), 3508–3515.Resnik, P. (1995). Using information content to evaluate semantic similarity in ataxonomy. arXiv preprint cmp-lg/9511007 .Rossi, R. A., Zhou, R., and Ahmed, N. K. (2017). Estimation of graphlet statistics. arXiv preprint arXiv:1701.01772 .Saraph, V. and Milenkovi´c, T. (2014). MAGNA: maximizing accuracy in globalnetwork alignment. Bioinformatics , (20), 2931–2940.Sporns, O. (2010). Networks of the Brain . MIT press, USA.Sun, Y., Crawford, J., Tang, J., and Milenkovi`c, T. (2015). Simultaneous optimizationof both node and edge conservation in network alignment via WAVE. In M. Pop andH. Touzet, editors,
Algorithms in Bioinformatics , volume 9289 of
Lecture Notes inComputer Science , pages 16–39. Springer Berlin Heidelberg, Germany.Tokar, T., Pastrello, C., Rossos, A. E., Abovsky, M., Hauschild, A.-C., Tsay, M., Lu,R., and Jurisica, I. (2017). mirdip 4.1integrative database of human microrna targetpredictions.
Nucleic acids research , (D1), D360–D370.Van El, C. G., Cornel, M. C., Borry, P., Hastings, R. J., Fellmann, F., Hodgson, S. V.,Howard, H. C., Cambon-Thomsen, A., Knoppers, B. M., Meijers-Heijboer, H., et al. (2013). Whole-genome sequencing in health care: recommendations of the europeansociety of human genetics. European Journal of Human Genetics , (6), 580.Vidal, M. (2016). How much of the human protein interactome remains to be mapped?Vijayan, V. and Milenkovi´c, T. (2017). Aligning dynamic networks with dynawave. Bioinformatics , (10), 1795–1798.Vijayan, V., Saraph, V., and Milenkovi´c, T. (2015). Magna++: Maximizing accuracyin global network alignment via both node and edge conservation. Bioinformatics , (14), 2409–2411.Von Mering, C., Krause, R., Snel, B., Cornell, M., et al. (2002). Comparativeassessment of large-scale data sets of protein-protein interactions. Nature , (6887), 399.Williamson, M. P. and Sutcliffe, M. J. (2010). Protein–protein interactions.Yang, C., Lyu, M., Li, Y., Zhao, Q., and Xu, Y. (2018). Ssrw: A scalable algorithm forestimating graphlet statistics based on random walk. In International Conference onDatabase Systems for Advanced Applications , pages 272–288. Springer.Yavero˘glu, ¨O. N., Malod-Dognin, N., Davis, D., Levnajic, Z., Janjic, V., Karapandza,R., Stojmirovic, A., and Prˇzulj, N. (2014). Revealing the hidden language ofcomplex networks.
Scientific reports , , 4547. ol. 00 no. 00 2017Pages 1–2 SANA: separating the search algorithm from theobjective function in biological network alignment, Part 1:Search–Supplementary Material
Dillon Kanne, Wayne B. Hayes ∗ Department of Computer Science, University of California, Irvine CA 92697-3435, USA
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
In this supplementary material we discuss TAME and its trianglecorrectness measure as well as give a full chart of the runtimes of allalgorithms tested.
Simulated Annealing was built on the assumption of almost everymove increasing or decreasing the net energy (score) of a system,just as moving particles in annealing metal increases or decreasesthe net energy of the metal. In the case of network alignment,this assumes that (almost) every change to the alignment has some measurable effect on the objective function; there should be somekind of guidance for a large fraction of moves. Some solutionsspaces are not like that, but are instead extremely sparse, with manyscores of exactly zero interspersed with almost delta-function-likejumps in the score. In such a solution space, almost every move hasno effect at all, such as moving from a zero-scoring alignment toanother alignment also scoring zero. Even the “perfect” objectivefunction, which scores some “perfect” alignment with 1 and everyother alignment with 0, is impossible to find with random searchbecause each move has no affect on the score; there is no guidancefor “good” and “bad” moves. In such cases, there is little hope of“converging” on a good solution since even if one finds oneselftemporarily in the vicinity of a delta-function increase in the score,the very nature of the random search (with a non-zero temperature)means we are likely to wander back into zero-score “flatlands,”and spend most of our time there. We mentioned these restrictionson simulated annealing in section 2.1.1 of the main paper whenintroducing SANA.With the method of moves we have chosen for SANA, onesuch flatland is “triangle alignment,” where the only thing theobjective function cares about is the number of aligned triangles.This is the objective function used in TAME ( ? ). With the systemof moves we currently use in SANA, we were unable to get agood triangle alignment, for the above reasons. In this sense therecan indeed be objectives that SANA may not excel at comparedto a hand-coded deterministic algorithm. We hypothesize thatSANA could easily be modified to do well at aligning triangles.First, we would exhaustively list all the triangles in each graph,for which asymptotically optimal algorithms exist ( ? ); then, wewould program SANA’s “moves” to swap or move entire triangles. ∗ to whom correspondence should be addressed ( [email protected] ) S I F TAME (Triangle Conservation) S c o r e Network Pairs R N - SP R N - C E R N - MM R N - S CR N - A T R N - D M R N - H SSP - C E SP - MM SP - S C SP - A T SP - D M SP - H S C E - MM C E - S CC E - A T C E - D M C E - H S MM - S C MM - A T MM - D MMM - H SS C - A T S C - D M S C - H S A T - D M A T - H S D M - H S Fig. 1.
The chart of TAME’s comparison against SANA. It follows the samepattern as the first two columns of Figure 1 of the main publication.
However, we believe this would be a waste of time because thenumber of triangles in all the biological networks we’ve encounteredinvolve only a small fraction of all the edges in the networks; wefind it hard to believe that any good alignment algorithm couldpossibly recover relevant biology by completely ignoring most ofthe edges in the network. In the BioGRID networks tested in thispaper, the amount of edges participating in triangles ranged from aslow as 18% in the case of CElegans to as high as 92% in the case ofSCerevisiae. Most of the networks have less than half of their edgesin triangles, indicating that a majority of topological information isdiscarded when using Triangle Conservation.Figure 1 includes the chart for SANA’s comparison against onlyTAME. It follows the key in Figure 1 of the main publication. Thechart shown has similar graphical issues around y = 1 to NATALIE2.0 due to the scale for the top half of the chart being very differentthan the bottom half of the chart. The network alignment algorithms take a variable amount of time toalign networks. Table 1 shows how long each network aligner tookfor each pair of networks. We performed all tests on AMD Opteron6378 processors. We only count the time that each algorithm tookto make the alignment; any overhead or preparation is not included.ModuleAlign is not included on this table because their code is too c (cid:13) Wayne B. Hayes 2017. a r X i v : . [ q - b i o . M N ] N ov anne, Hayes Table 1.
A complete runtime summary of all comparisons. The network aligners are organized from left to right by speed. All numbers are in seconds. Somenetwork aligners, OptNetAlign and LGRAAL, used a user-specified amount of time.Pair PROPER Hubalign WAVE SANA GHOST LGRAAL OptNetAlign NATALIE 2.0 GREAT MAGNA++RN-SP 4.33 10.563 46.439 1200.982 218.272 3626.067 21627.105 36318.183 1782 6588.55RN-CE 2.19 14.884 67.499 1205.878 254.168 3602.511 21594.192 36798.131 1710.79 7946.815RN-MM 7.36 18.619 89.104 1204.306 357.098 3604.936 21597.395 37682.798 1584.07 10036.277RN-SC 141.93 86.6 108.63 1207.698 697.682 3678.167 21620.571 37620.052RN-AT 6.8 22.609 116.183 1202.232 494.11 3653.807 21598.76 39227.251 2464.4 12815.721RN-DM 5.21 43.941 151.403 1202.941 761.318 3646.642 21594.732 6252.963 4954.68 23582.27RN-HS 43.25 57.022 1204.202 1576.823 3718.981 21596.822 59503.783SP-CE 3.08 18.746 86.451 1205.717 417.435 3618.937 21612.816 37658.221 5909.08 9233.363SP-MM 4.35 23.241 118.53 1207.585 568.282 3621.64 21590.237 39494.238 5926.76 11278.529SP-SC 32.42 54.946 142.788 1211.021 921.838 4035.292 21616.769 41411.814SP-AT 5.71 32.811 147.993 1207.724 733.266 3790.599 21632.481 9208.076 7320.69 14183.744SP-DM 7.8 57.022 206.362 1203.379 908.696 3757.407 21650.206 12863.58 11967.41 25254.116SP-HS 31.28 95.469 1204.168 2020.892 3820.667 21644.772 61487.093CE-MM 6.94 96.675 286.315 1208.865 934.614 3663.708 21641.083 5541.833 15797.56 12687.727CE-SC 36.81 133.092 346.997 1211.463 1406.377 3647.845 21619.654 41783.25CE-AT 7.46 96.456 370.909 1203.619 1203.285 3618.605 21620.149 6869.122 17449.32 15844.895CE-DM 16.16 139.15 481.428 1201.269 1544.476 3720.279 21656.884 15684.91 19398.99 26565.248CE-HS 49.9 235.182 1207.018 2979.279 3750.567 21581.82 62952.649MM-SC 247.92 290.84 667.415 1203.642 2383.905 3890.048 21648.963 44150.335MM-AT 22.02 269.954 770.317 1202.983 1888.38 3902.632 21632.147 10884.46 59376.92 18204.402MM-DM 24.14 438.859 914.006 1206.956 2518.58 3687.144 21597.686 23804.774 66183.56 29844.522MM-HS 110.46 621.394 1206.642 4655.939 5872.466 21618.983 65119.34SC-AT 221.57 660.456 1131.38 1211.012 7490.71 4020.427 21760.065 48313.407SC-DM 59.23 593.628 1495.722 1214.728 9495.281 3799.839 21689.732 62692.563SC-HS 176.36 1080.103 1210.358 13686.569 5248.067 21773.541 100999.695AT-DM 46.91 648.73 1642.546 1205.803 3653.613 3962.908 21602.428 142527.574 239526.32 33663.098AT-HS 182.43 862.562 1216.563 6687.567 4718.554 21608.316 68265.577DM-HS 96.58 1735.286 1223.887 12888.451 6127.944 21641.357 82430.435Average 57.164 301.387 447.067 1207.237 2976.675 3993.096 21631.060 30721.074 30756.837 36944.974 SANA takes a few extra minutes at the start of each run to calculate the temperature schedule. If this is included, SANA takes around 1400 seconds. GHOST took almost 2 years of CPU time (11 days on 64 cores) in some cases to make the spectral signatures for individual networks. This time is notincluded in this chart. difficult to run in parallel properly and takes far too long to run inseries than is worth it. We originally ran ModuleAlign in series butforgot to measure the exact time and rerunning would take too long than is important. ModuleAlign took around an hour to sometimesmore than ten hours.difficult to run in parallel properly and takes far too long to run inseries than is worth it. We originally ran ModuleAlign in series butforgot to measure the exact time and rerunning would take too long than is important. ModuleAlign took around an hour to sometimesmore than ten hours.