Safety of Flow Decompositions in DAGs
SSafety of Flow Decompositions in DAGs
Shahbaz Khan ! Department of Computer Science, University of Helsinki, Finland
Alexandru I. Tomescu ! Department of Computer Science, University of Helsinki, Finland
Abstract
Network flows are some of the most studied combinatorial optimization problems with innumerableapplications. Any flow on a directed acyclic graph (DAG) G having n vertices and m edges canbe decomposed into a set of O ( m ) paths, with applications ranging from network routing to theassembly of biological sequences. In some of these applications, the flow decomposition correspondsto some particular data that need to be reconstructed from the flow. Thus, such applications requirefinding paths (or subpaths) appearing in all possible flow decompositions, referred to as safe paths .Recently, Ma, Zheng, and Kingsford [WABI 2020] addressed a related problem in a probabilisticframework. In a follow-up work, they gave a quadratic-time algorithm based on a global criterion,for a generalized version (AND-Quant) of the corresponding verification problem, i.e., reporting if agiven flow path is safe. Our contributions are as follows: A simple characterization for the safety of a given path based on a local criterion, which can bedirectly adapted to give an optimal linear time verification algorithm. A simple enumeration algorithm that reports all maximal safe paths on a flow network in O ( mn )time. The algorithm reports all safe paths using a compact representation of the solution (called P c ), which is Ω( mn ) in the worst case, but merely O ( m + n ) in the best case. An improved enumeration algorithm where all safe paths ending at every vertex are representedas funnels using O ( n + |P c | ) space. These can be computed and used to report all maximal safepaths, using time linear in the total space required by funnels, with an extra logarithmic factor.Overall we present a simple characterization for the problem leading to an optimal verificationalgorithm and a simple enumeration algorithm. The enumeration algorithm is improved using thefunnel structures for safe paths, which may be of independent interest. Mathematics of computing → Graph algorithms; Mathematicsof computing → Network flows; Theory of computation → Network flows; Networks → Networkalgorithms
Keywords and phrases safety, flows, networks, funnel, directed acyclic graphs
Acknowledgements
We thank Romeo Rizzi and Edin Husić for helpful discussions. This work waspartially funded by the European Research Council (ERC) under the European Union’s Horizon2020 research and innovation programme (grant agreement No. 851093, SAFEBIO) amd partially bythe Academy of Finland (grants No. 322595, 328877).
Network flows are a central topic in computer science, with countless practical applications.Assuming that the flow network has a unique source s and a unique sink t , every flow canbe decomposed into a collection of weighted s - t paths and cycles [12]. For DAGs such adecomposition contains only paths. Such path (and cycle) view of a flow indicates howinformation optimally passes from s to t . For example, this decomposition is a key step innetwork routing problems (e.g. [16, 9, 15, 23]), transportation problems (e.g. [24, 25]), orin the more recent and prominent application of reconstructing biological sequences ( RNAtranscripts , see e.g. [26, 31, 14, 6, 30, 36], or viral quasi-species genomes , see e.g. [3, 2]). a r X i v : . [ c s . D S ] F e b Safety of Flow Decompositions in DAGs
Finding the decomposition with the minimum number of paths (and cycles) is NP-hard,even if the flow network is a directed acyclic graph (DAG) [33]. On the theoretical side, thishardness result lead to research on approximation algorithms [15, 29, 27, 23, 4, 5], and FPTalgorithms [17]. On the practical side, many approaches usually employ a standard greedywidth heuristic [33], of repeatedly removing an s - t path carrying the most amount of flow.Another pseudo-polynomial-time heuristic was recently proposed by [28], in connection withbiological data, that tries to iteratively simplify the graph such that the flow decompositionproblem can be solved locally at some vertices.In the routing and transportation applications, an optimal flow decomposition indicateshow to send some information from s to t , and thus any optimal decomposition is satisfactory.However, this is not the case in the prominent application of reconstructing biologicalsequences, since each flow path represents a reconstructed sequence: a different optimal setof flow paths encodes different biological sequences, which may differ from the real ones.For a concrete example, consider the following application. In complex organisms, a genemay produce more RNA molecules ( RNA transcripts , i.e., strings over an alphabet of fourcharacters), each having a different abundance. Currently, given a sample, one can readthe RNA transcripts and find their abundances using high-throughput sequencing [35]. Thistechnology produces short overlapping substrings of the RNA transcripts. The main approachfor recovering the RNA transcripts from such data is to build an edge-weighted DAG fromthese fragments and to transform the weights into flow values by various optimization criteria,and then to decompose the resulting flow into an “optimal” set of weighted paths (i.e., theRNA transcripts and their abundances in the sample) [21]. Clearly, if there are multipleoptimal flow decomposition solutions, then the reconstructed RNA transcripts may not matchthe original ones, and thus be incorrect.
Recently, Ma et al. [19] were the first to address the issue of multiple solutions to the flowdecomposition problem, under a probabilistic framework. Later, they [20] solve a problem(
AND-Quant ), which, in particular, leads to a quadratic-time algorithm for the followingproblem: given a flow in a DAG, and edges e , e , . . . , e k , decide if in every flow decompositionthere is always a decomposed flow path passing through all of e , e , . . . , e k . Thus, by takingthe edges e , e , . . . , e k to be the edges of a path P , the AND-Quant problem can decide if apath P (i.e., a given biological sequence) appears in all flow decompositions. This indicatesthat P is likely part of some original RNA transcript.We build upon the AND-Quant problem, by addressing the flow decomposition problemunder the safety framework [32]. For a problem admitting multiple solutions, a partialsolution is said to be safe if it appears in all solutions to a problem. For example, a path P is safe for the flow decomposition problem, if for any flow decomposition into paths { P , . . . , P k } , it holds that P is a subpath of some P i . In this paper, we will consider any flowdecomposition as a valid solution, not only the ones of minimum cardinality. Dropping theminimality criterion is motivated by both theory and practice. On the one hand, since findingone minimum-cardinality flow decomposition is NP-hard, we believe that finding all safe pathsfor them is also intractable. On the other hand, given the various issues with sequencingdata, practical methods usually incorporate different variations of the minimum-cardinalitycriterion, see e.g. [6, 3, 2]. Thus, safe paths for all flow decompositions are likely correct tomany practical variations of the flow decomposition problem.The safety framework was introduced by Tomescu and Medvedev [32] for the genomeassembly problem from bioinformatics, where e.g. a solution may be a circular walk covering . Khan and A. I. Tomescu 3 every edge of a graph at least once. Safety has precursors in combinatorial optimization, underthe name of persistency . For example, persistent edges present in all maximum bipartitematching were studied by Costa [10]. Incidentally, persistency has also been studied for themaximum flow problem, by finding persistent edges always having a non-zero flow value inany maximum flow solution ([8] acknowledges [18] for first addressing the problem); which iseasily verified if the maximum flow decreases after removing the corresponding edge. Our contributions can be succinctly described as follows A simple local characterization resulting in an optimal verification algorithm :We give a characterization for a safe path P based on excess flow , a local property of P . ▶ Theorem 1.
A path P is w -safe iff its excess flow f P ≥ w > . The previous work [20] on AND-Quant describes a global characterization using themaximum flow of the entire graph transformed according to P , requiring O ( mn ) time.The excess flow is a local property of P which is thus computable in time linear in thelength of P . This also directly gives a simple verification algorithm which is optimal. ▶ Theorem 2.
Given a flow graph (DAG) having n vertices and m edges, it can bepreprocessed in O ( m ) time to verify the safety of any path P in O ( | P | ) = O ( n ) time. Simple Enumeration Algorithm : The characterization results in a simple algorithmfor reporting all maximal safe paths using a flow decomposition of the graph. ▶ Theorem 3.
Given a flow graph (DAG) having n vertices and m edges, all its maximalsafe paths can be reported in O ( |P f | ) = O ( mn ) time, where P f is some flow decomposition. This approach starts with a candidate solution and simply uses the characterization onits subpaths in an efficient manner (a similar approach was previously used by [10, 1]).The solution of the algorithm is reported using a compact representation (referred as P c ),whose size can be Ω( mn ) is the worst case but merely O ( m + n ) in the best case. Improved enumeration algorithm:
The enumeration algorithm can be improvedby collectively considering all the safe paths ending at a vertex, which form a uniquestructure called a funnel [22], that can be used to report all maximal safe paths. ▶ Theorem 4.
Given a flow graph (DAG) with n vertices and m edges, the funnelstructures for the left maximal safe paths ending at its vertices(a) requires total O ( n + |P c | ) space for all vertices,(b) can be computed in O ( m log n + n + |P c | ) time,(c) can be used to report all maximal safe paths in O ( n + |P c | log n ) time,where P c is the concise representation of the solution. This approach exploits the structure of the safe paths (unlike the simple algorithm) toidentify how maximal safe solutions overlap, and thus results in efficient algorithms.This often leads to output sensitive enumeration algorithms, whose running time isparametrized by the size of the output (a similar approach was previously used by [7]).
Safety of Flow Decompositions in DAGs
We consider a DAG G = ( V, E ) with n vertices and m edges, where each edge e has a flow f ( e ) passing through it (also called its weight ). Any two edges are called siblings if theyshare either their source vertex, or their target vertex. For each vertex u , f in ( u ) and f out ( u )denotes the total flow on its incoming edges and total flow on its outgoing edges, respectively.A vertex v in the graph is called a source if f in ( v ) = 0 and a sink if f out ( v ) = 0. Every othervertex v satisfies the conservation of flow f in ( v ) = f out ( v ), making the graph a flow graph .For a path P in the graph, | P | represents the number of its edges. For a set of paths P = { P , · · · , P k } we denote its total size (number of edges) by |P| = | P | + · · · + | P k | .Similarly, for any subgraph F of the graph, we denote its total size (number of edges) by |F| .For any flow graph (DAG), its flow decomposition is a set of weighted paths P f such thatthe flow on each edge in the flow graph equals the sum of the weights of the paths containingit. A flow decomposition of a graph can be computed in O ( |P f | ) = O ( mn ) time using theclassical Ford Fulkerson algorithm [12].A path P is called w -safe if, in every possible flow decomposition, P is a subpath of somepaths in P f whose total weight is at least w . If P is w -safe with w >
0, we call P a safe flowpath , or simply safe path . Intuitively, for any edge e with non-zero flow, we consider wheredid the flow on e come from? We would like to report the maximal path ending with e alongwhich at least w > e (see Figure 1). A safe path is left maximal (or right maximal ) if extending it to the left (or right) with any edge makes it unsafe. A safepath is maximal if it is both left and right maximal. Figure 1
Prefix of the blue path up to e contributes at least 2 units of flow to e , as the rest mayenter the path by the edges (red) with flow 4 and 2. Similarly, the suffix of the blue path from e maintains at least 1 unit of flow from e , as the rest may exit the path from the edges (red) with flow5 and 2. Both these safe paths are maximal as they cannot be extended left or right. The safety of a path can be characterized by its excess flow (see Figure 2), defined as follows.
Figure 2
The excess flow of a path is the incoming or outgoing flow ( blue ) that passes throughthe path despite the flow ( red ) leaking at its internal vertices. ▶ Definition 5 (Excess flow) . The excess flow f P of a path P = { u , u , ..., u k } is f P = f ( u , u ) − X u i ∈{ u ,...,u k − } v ̸ = u i +1 f ( u i , v ) = f ( u k − , u k ) − X u i ∈{ u ,...,u k − } v ̸ = u i − f ( v, u i ) where the former and later equations are called diverging and converging criterion, respectively. . Khan and A. I. Tomescu 5 The converging and diverging criteria are equivalent by the conservation of flow oninternal vertices. The idea behind the excess flow f P is the amount of the incoming flow (oroutgoing flow) of a path P necessitates f P flow to pass through P despite the flow leakingon internal vertices (Figure 2). Since the path decomposition before entering a vertex doesnot affect the path decomposition on leaving a vertex, the absence of positive excess flowrenders the path unsafe. The following results give the simple characterization and someadditional properties of safe paths. ▶ Theorem 1.
A path P is w -safe iff its excess flow f P ≥ w > . Proof.
The excess flow f P of a path P trivially makes it w ≤ f P -safe as it is necessitated byconservation of flow. If f P < w , then P is surely not w -safe, because a path decompositioncan have the paths entering P which exit P using out-edge of an intermediate note withoutpassing completely through P . In the worst-case, such exiting paths would leave only f P flow (by definition) passing completely through P , which is < w . Additionally, for a path tobe safe, it must hold that w > ◀▶ Lemma 6.
For any path in a flow graph (DAG), we have the following properties:(a) Any subpath of a w -safe path is also at least w -safe.(b) The converging and diverging criteria for a path P = { u , · · · , u k } are equivalent to f P = k − X i =1 f ( u i , u i +1 ) − k − X i =2 f out ( u i ) = k − X i =1 f ( u i , u i +1 ) − k − X i =2 f in ( u i ) . (c) Adding an edge ( u, v ) to the start, or the end, of a path reduces its excess flow by f in ( v ) − f ( u, v ) , or f out ( u ) − f ( u, v ) , respectively.(d) A safe path starting (or ending) with an edge e implies that the path replacing e with itsmaximum weight sibling is also safe.(e) Two safe paths cannot merge at an intermediate vertex (or vertices) and then diverge. Proof. (a) Since all flows are positive, any suffix subpath P ′ of P has f P ′ ≥ f P by truncatingthe corresponding negative terms in the diverging criterion of the excess flow. Similarly,any prefix subpath P ′ of P has f P ′ ≥ f P by truncating the corresponding negative termsin the converging criterion of the excess flow.(b) Expanding the values of f in ( u i ) and f out ( v i ) results in the original criteria.(c) Using the converging criterion in (b) adding an edge at the start of a path modifies itsexcess flow by f ( u, v ) − f in ( v ). Similarly, using the diverging criterion in (b) adding anedge at the end of a path modifies its excess flow by f ( u, v ) − f out ( u ).(d) Let the safe path P start (or end) with e , and consider an alternate path that replaces e with its maximum weight sibling e ∗ . The excess flow of the alternate path can becomputed by the diverging (or converging) criterion to be f P − f ( e ) + f ( e ′ ) ≥ f p > P and P ′ merge at a vertex v , entering v respectively by e and e ′ ,and then diverge at a vertex v , leaving v respectively by e and e ′ (see Figure 3).Using Lemma 6(a), the subpaths { e · · · , e } and { e ′ , · · · , e ′ } of safe paths P and P ′ respectively, are also safe. By diverging criterion of the safe path P we have f ( e ) > f ( e ′ ).On the other hand, by converging criterion of the safe path P ′ we have f ( e ′ ) > f ( e ),which is a contradiction. ◀ Safety of Flow Decompositions in DAGs
Figure 3
The diverging and converging criterion applied to path P and P ′ respectively. The characterization of a safe path in a flow graph can be directly adapted to simplealgorithms for verification and enumeration of safe paths.
The characterization (Theorem 1) can be directly adapted to verify the safety of a pathoptimally. We preprocess the graph to compute the incoming flow f in ( u ) and outgoing flows f out ( u ) for each vertex u in total O ( m ) time. Using Lemma 6(b) the time taken to verifythe safety of any path P is O ( | P | ) = O ( n ), resulting in following theorem. ▶ Theorem 2.
Given a flow graph (DAG) having n vertices and m edges, it can be preprocessedin O ( m ) time to verify the safety of any path P in O ( | P | ) = O ( n ) time. The maximal safe paths can be reported in O ( mn ) time by computing a candidate decompos-ition of the flow into paths, and verifying the safety of its subpaths using the characterizationand a scan with the commonly used two-pointer approach.The candidate flow decomposition can be computed in O ( mn ) time using the classicalflow decomposition algorithm [12] resulting in O ( m ) paths P f each of O ( n ) length. Now, wecompute the maximal safe paths along each path P ∈ P f by a two-pointer scan as follows. Westart with the subpath containing the first two edges of the path P . We compute its excessflow f , and if f > f ≤ u, v ) would conversely modify the flow by f in ( v ) − f ( u, v )). We stopwhen the end of P is reached with a positive excess flow.The excess flow can be updated in O (1) time when adding an edge to the subpath on theright or removing an edge from the left. If the excess flow of a subpath P ′ is positive andon appending it with the next edge it ceases to be positive, we report P ′ as a maximal safepath by reporting only its two indices on the path P . Thus, given a path of length O ( n ), allits maximal safe paths can be reported in O ( n ) time, and hence require total O ( mn ) timefor the O ( m ) paths in the flow decomposition P f , resulting in the following theorem. ▶ Theorem 3.
Given a flow graph (DAG) having n vertices and m edges, all its maximalsafe paths can be reported in O ( |P f | ) = O ( mn ) time, where P f is some flow decomposition. Concise representation
The solution can be reported using a concise representation (referred as P c ) having a set ofpaths as follows. We add to P c every subpath of each path P ∈ P f that contains maximal safepaths, along with the indices of the solution on the path. Thus, for one or more overlapping . Khan and A. I. Tomescu 7 maximal safe subpaths from P we add a single path in P c which is the union of all suchmaximal safe paths, making the paths added to P c of minimal length. Finally, we alsoremove the duplicates among the maximal safe subpaths reported from different paths in P f using a Trie [11], making the set of paths in P c minimal . Thus, we define P c as follows. ▶ Definition 7 (Concise representation P c ) . A minimal set of paths having a minimal lengthsuch that every safe path of the flow network is a subpath of some path in the set. ▶ Remark 8.
In the worst case, the algorithm is optimal for some DAGs having |P c | = |P f | =Ω( mn ), but in general |P c | can be as small as O ( m + n ) (see Figure 4). Thus, improvingthis bound requires us to not use a flow decomposition (and hence a candidate solution). Figure 4
The example demonstrates the worst case (left) and the best case (right) graphs wherethe simple enumeration algorithm is optimal, and inefficient, respectively. We have two paths A = { a , · · · , a k } and B = { b , · · · , b k } . The set C = { c , · · · , c k } has edges from a k and the set D = { d , · · · , d k } has edges to b . Choosing k = n/ C × D we get a graph with any n and m . Let there be flow k on the black edges and unit flow on the red edges. (a) In the worst case graph (left) the flow on remaining edges is according to flow conservationassuming a as the source and b k as the sink. Each edge in C × D necessarily has a separate path in P f from a to b k , with k maximal safe paths between { a i , b i } for all 1 ≤ i ≤ k because every pathbetween a i to b has excess flow i . This ensures that |P c | = |P f | = Ω( mn ). (b) In the best casegraph (right) the two edges from a k − to a k and from b to b carry equal flow, and the remainingedges have flow according to conservation of flow. Each edge in C × D has a safe path of O (1) sizefrom a k to b . Here we have still have |P f | = Ω( mn ) but |P c | = O ( m + n ). We compute the maximal safe paths more efficiently by using dynamic programming alongthe topological ordering in DAGs. Lemma 6(e) gives a unique structure to the left maximalsafe paths ending at a vertex v because there is no possibility of two safe paths diverging after converging . Such a structure was previously studied for an entire DAG satisfying the no diverging after converging property, and referred to as a funnel [22].We shall first describe the funnel structure F u of the left maximal safe paths ending at avertex u . Thereafter, we will describe how this F u can be efficiently used to incrementallybuild F v for each out-neighbour v of u . Finally, we will describe how to report all the maximal safe paths ending at a vertex v using F v . The key idea behind the efficiency ofour algorithm is bounding the overall size of the funnel structures on all the vertices andassociating the work done with the total size of all the funnel structures. Our algorithmcomputes all maximal safe paths of a DAG in O ( n + ( m + |P c | ) log n ) time, where P c is theconcise representation of the solution reported by the simple algorithm. We describe F v using the notion of source or sink paths of a vertex v in a DAG. A sourcepath is any path to the vertex v from a source of the DAG, whereas the sink path is any Safety of Flow Decompositions in DAGs path from the vertex v to a sink of the DAG. The funnel structure (see Figure 5) and itsproperties are described as follows, which also proves Theorem 4(a). Figure 5
The funnel structures of safe paths ending at v having the diverging trees ( blue ) and aconverging tree ( green ). The characteristic edge ( black ) of a path is the first edge incident on theconverging tree. ▶ Theorem 9 (Funnel).
The left maximal safe paths entering a vertex v form a funnel, where:(a) Every vertex has either a unique source path or a unique sink path, or both.(b) The funnel has a single converging tree (having vertices with unique sink path) succeedingpossibly multiple diverging trees (having remaining vertices with unique source path).(c) The characteristic edge of each flow path is its first edge incident on the converging tree.The size of the funnel structure F v at v is O ( n v + m v ) , where n v is the number of verticeshaving a safe path ending at v , and m v is the number of paths in P c containing v . For all v ∈ V , the total size is O ( n + |P c | ) . Proof.
We first prove the properties of the funnel (also derivable from previous work [22])and then the space complexity.(a) Any vertex having both multiple source paths and multiple sink paths has two differentsafe paths that merge at (or before) the vertex and then diverge at (or after) it, whichcontradicts Lemma 6(e).(b) All the vertices having a unique sink path (terminating at v as the funnel contains onlythe safe paths ending at v ), form a single converging tree rooted at v (see Figure 5).Similarly, the remaining vertices having unique sink paths (Theorem 9(a)) form a set ofdisjoint diverging trees since there can be multiple sources. Finally, since a vertex witha unique sink path cannot reach a vertex with multiple sink paths, the vertices of theconverging tree always succeed those of diverging trees in any path.(c) The first incident edge on the converging tree for a path in F v would be from a vertexwith a unique source path to a vertex with a unique sink path. So the edge can be a partof exactly one path, which it characterizes (see Figure 5). Such an edge always existsfor a path because it always starts from a source (hence with a unique source path) andends at v (hence with a unique sink path).The number of vertices (and hence edges of the converging and diverging trees) in thefunnel is O ( n v ) (by definition). Further, the non-tree edges (if any) are the characteristicedges for some safe path containing v , and each such safe path is present in a unique pathin P c (as each path contains v exactly once). Thus, each characteristic edge is uniquelyassociated with a path in P c containing v , resulting in total O ( m v ) non-tree edges in thefunnel. Overall, since n v = O ( n ) and v contributes to the length of at least m v paths in P c ,that total size of all funnels is O ( n + |P c | ). ◀ . Khan and A. I. Tomescu 9 The funnel of a vertex is incrementally built on processing its incoming neighbours. Weprocess the vertices of the DAG in topological order, ensuring that before we process a vertexits funnel is completely built while processing its incoming neighbours.We now describe how we build the funnels of the out neighbours of a vertex u using F u (see Algorithm 1). Since every edge is trivially safe, every outgoing edge ( u, v ) of u is addedto corresponding F v . For updating any F v with the safe paths through u , we traverse F u inthe reverse topological order and verify the safety of each such path using the convergingcriterion. An edge from F u is thus added to F v if some safe path contains the edge and( u, v ). We distinguish the unique maximum weighted outgoing edge ( u, v ∗ ) of u (if any), anduse different approaches to incrementally build F v ∗ and F v ( v ̸ = v ∗ ). This is because beingunique we spend O ( |F u | ) time for F v ∗ , and only O ( | ∆ F v | ) time for other F v , where ∆ F v are the new edges added to F v while processing u . Thus, the time required to build all F v isbounded by the total size of all F v , i.e., O ( n + |P c | ). Algorithm 1
Incrementally building funnels
Build-Funnels(): forall u ∈ V in topological order do Build-From ( u ) Build-From ( u ) : forall ( u, v ) ∈ E do Add ( u, v ) to F v /* Computing F v ∗ */ ( u, v ∗ ) ← Maximum weight out-edge of u forall x ∈ F u do f [ x ] ← // max safe flow leaving xf [ u ] ← f ( u, v ∗ ) forall y ∈ F u in reverse topological order doforall ( x, y ) ∈ F u do /* Safe flow f ′ leaving x */ f ′ ← f [ y ] − f in [ y ] + f ( x, y ) f [ x ] ← max( f [ x ] , f ′ ) if f ′ > and ( x, y ) / ∈ F v ∗ then Add ( x, y ) to F v ∗ /* Computing F v for v ̸ = v ∗ */ forall ( u, v ̸ = v ∗ ) ∈ E do Add v to N s , N p Sort v ∈ N p by weights of ( u, v ) y ← u , P ← { y } while N p ̸ = ∅ do ( x, y ) ← Max weight edge entering yP ← { x } ∪ Pf [ x ] ← f [ x ] − f ( u, v ∗ ) forall v ∈ N s do f ′ ← f [ x ] + f ( u, v ) if f ′ > and ( x, y ) / ∈ F v then Add ( x, y ) to F v // suffix path else Remove v from N s while f [ x ] + f ( u, head ( N p )) < do v ← pop ( N p ), y ′ ← z ← y while y ′ / ∈ F v do y ′ ← Next of y ′ in P Add P [ z, y ′ ] in F v // prefix path y ← x Building F v ∗ We traverse F u in reverse topological order, computing the maximum excess flow f [ x ] of apath from x to v ∗ through u . This is computed by considering f [ y ] of each out-neighbour y of x in F u , where the excess flow of x - v ∗ path is computed in O (1) time using the y - v ∗ path by adding the edge ( x, y ) to its start (Lemma 6(c)). Every such edge ( x, y ) resultingin a safe x - v ∗ path is added to F v ∗ , if not already present (may have been added whileprocessing some other F u ′ ). Note that if x is in the converging tree, there is a unique sink path from x (Theorem 9(b)), hence x considers a single y where f [ y ] represents a single safe y - v ∗ path. However, in the diverging trees y may have multiple sink paths (Theorem 9(b)),but we process only f [ y ] representing the y - v ∗ safe path with the maximum excess flow. Thisis because such a path would always be extendable to add ( x, y ) if any other y - v ∗ path isextendable (Lemma 6(c)). Since we cannot process y for each path separately in O ( |F u | )time, we process y only once using f [ y ] for its y - v ∗ safe path with the maximum excess flow.Thus, we process each edge of F u once to build F v ∗ taking total O ( |F u | ) time. Building F v for v ̸ = v ∗ We limit the paths in F u that are processed to build F v using the following property. ▶ Lemma 10.
For a vertex u and an edge ( u, v ) where v ̸ = v ∗ , F u adds a single path to F v . Proof. If F u adds multiple paths to F v , by Lemma 6(d) such paths are also extendableto some maximum weighted outgoing edge ( u, v ′ ) of u , where v ′ ̸ = v . This contradictsLemma 6(e) as two safe paths containing u converge in F u and then diverge to v and v ′ . ◀ Thus, we process exactly one path in F u to be added to F v in the reverse topologicalorder. Using Lemma 6(d), this path, say P , contains ( u, v ) preceded by the maximum weightincoming edge ( y, u ) of u , and the maximum weight incoming edge ( x, y ) of y , and so on.For a vertex v , a suffix of this path, say P v , is safe as long as its excess flow is positive. Thiscan be verified efficiently using Lemma 6(c) (similar to F v ∗ ). Moreover, maximum excessflow of a safe x - v path can be computed using f [ x ] (computed for F v ∗ ) by reducing f ( u, v ∗ )and adding f ( u, v ) (Lemma 6(c)). However, P v cannot be simply added to F v using O ( | P v | )time, because as described earlier we aim to use only O ( | ∆ F v | ) time. Figure 6 P v overlaps a path (black) in F v except at its suffix (green) and its prefix (blue). Consider Figure 6, if P v has a subpath already present in F v , we cannot traverse theentire P v independently for v . So, we traverse (and add to F v ) only the suffix and the prefix(if any) of P v not in F v independently, and the rest only once for all out-neighbours of u . Thesuffix of P v is simply added in the reverse order until it reaches a vertex y already present in F v . Using Lemma 6(e), y is in the diverging tree of F v as the new suffix containing ( u, v )always exists. Hence y has a unique source path which either contains the rest of P v , orbegins from some vertex y ′ ∈ P v . For the later case, we compute the start (say z ) of P v forall out-neighbours of u together, and then add the corresponding prefix of P v , say P v [ z, y ′ ],for each P v independently. Note that the out-neighbours of u may have different y , y ′ , and z .To compute y , y ′ , and z for all out-neighbours of u , we add the out-neighbours to lists N s and N p to compute the corresponding suffix and prefix, respectively. We traverse P backwards, adding the suffix edges to F v for all v ∈ N s . When for some v , the corresponding y ∈ F v is reached we simply remove v from N s and continue for the other vertices addingeach suffix path to the corresponding F v (until N s = ∅ ). For adding prefix paths we needto compute z for each P v , beyond which P v is not safe. Thus, we traverse backwards in P together for all v until some P v is unsafe. This v would always be the one with lowest weight( u, v ) among ( u, x ) for x ∈ N p by Lemma 6(d), as the remaining path is the same for all.Hence we initially sort v ∈ N p in the increasing order of the weights of ( u, v ), so that whiletraversing P backwards we check the safety of P v only for the first element in N p ( head ( N p )). . Khan and A. I. Tomescu 11 When P v for head ( N p ) becomes unsafe we repeatedly remove the head ( N p ) from N p until P v is safe for the rest of v ∈ N p . For a removed vertex v , we traverse its prefix path (if any)forwards from z until we reach y ′ ∈ F v and add it to F v . Again, we continue traversing backon P until all prefixes are added (until N p = ∅ ). The process thus requires O ( | ∆ F v | ) foradding only new edges (prefixes and suffixes) to F v , O ( |F u | ) for traversing the path P , and O ( deg ( u ) log n ) for considering all out-neighbours independently and sorting them.Since topological sort takes linear time (edges), the total time required to build all thefunnels is O ( m log n + P v |F v | ) = O ( m log n + n + |P c | ) (Theorem 9), proving Theorem 4(b). ▶ Theorem 11.
Given a flow graph (DAG) with n vertices and m edges, the funnel structuresfor the left maximal safe paths ending at its vertices can be computed in O ( m log n + n + |P c | ) ,where P c is the concise representation of the solution. The funnel F u contains all the left maximal safe paths ending at the vertex u . To report all maximal safe paths ending at u , we need to address the following two issues: Some of these safe paths may not start from a source of F u , rather some vertex v in adiverging tree where residual source path was added by some overlapping safe path, asthere can be multiple sink paths for v (Theorem 9(b)). Thus, to represent a maximalsafe path ending at u we report its characteristic edge (Theorem 9) along with its startvertex in F u . For completeness, we also report its excess flow.Thus, all maximal safe paths ending at u are reported in Sol u containing triplets( characteristic edge, start vertex, excess flow ) for each such path. Some of these paths may not be right maximal if the path from its start vertex isextendable to the right of u while remaining safe. Such paths must be removed from thesolution, hence we create another funnel F ∗ u (initialized with F u ) and remove such pathsfrom it without affecting the other maximal safe paths.Thus, the funnel F ∗ u and Sol u are reported as the solution. Approach
The start vertices of these maximal safe paths can be easily computed by visiting the verticesof F u in the reverse topological order, computing their excess flow incrementally using thatof its suffix path (Lemma 6(c)). When such a path becomes unsafe, we know that it’s suffixpath is left maximal. Also, if a safe path reaches a source of F u , it is left maximal. Toverify if any such path is also right maximal, we compute its excess flow after adding theedge ( u, v ∗ ) (Lemma 6(d)) to its right using Lemma 6(c). If the new path is not safe, thesuffix path is reported as maximal safe using its characteristic edge, start vertex, and excessflow. Otherwise, the path is not right maximal (and hence not maximal), so we delete itscharacteristic edge with left and right extensions (not shared with any maximal safe paths)from F ∗ u (see Figure 7).Note that verifying if a path is right maximal safe takes O (1) time. If it is not rightmaximal safe, F ∗ u is updated by deleting some edges, taking overall O ( |F u | ) time. However,computing the start vertex for each maximal safe path independently is not efficient becausea vertex v in the diverging tree (see Figure 7) will be processed multiple times as it can havemultiple sink paths (Theorem 9(b)). Total time for processing such vertices in F u can be O ( m u n u ) (recall Theorem 9). We instead process v once for all such safe paths containing v as follows. While extending the safe paths starting from v to the left, if the path P with Figure 7
Multiple safe paths (red) starting at a vertex v in the diverging tree. Extensions (purple,dashed) of characteristic edges e and e to be deleted if their safe paths are not right maximal. the minimum excess flow is safe, the remaining paths are also safe. Otherwise, P is eitherremoved from F ∗ v if it is not right maximal, or is reported as maximal. In either case, weevaluate the next minimum weight path, and so on. Thus, processing a vertex either proceedsbackward from it or removes a safe path to be evaluated, taking total O ( n u + m u ) time. Heap Structure H v For processing the diverging trees having multiple safe paths starting at each vertex, we useany mergeable-heap [13, 34] structure, which is a min-heap where all operations includingmerging two heaps take amortized O (log n ) time. Additionally, we use the standard lazyupdate propagation technique, where some value can be added to every element of the heap in O (1) time, which is propagated to individual elements when they are accessed. Now, everyelement of a heap characterizing a safe path contains: edg : Denotes the characteristic edge of the safe path. val : The excess flow of the path without applying the lazy update. upd : The lazy update which is to be reduced from the val .For each vertex v ∈ F u , we maintain such a mergeable-heap structure H v which stores anelement corresponding to each safe path from v to u . For a vertex v in the converging tree,having a single sink path (Theorem 9(b)) and hence |H v | = 1, we do not need lazy update, sowe keep upd = − O (1) time, being inserted into an empty heap. Similarly, for a vertex x with asingle out-neighbour y in a diverging tree merging H y into an empty H x takes O (1) time. Algorithm
Our algorithm first adds a single edge path ( u, v ) to the heap of each in-neighbour v of u (see Algorithm 2). Now, moving from y to x through an edge ( x, y ) we consider two cases.(a) If y is in the converging tree , then adding ( x, y ) is always safe (recall that a different startvertex is only possible for the paths starting from a diverging tree). So for the currentpath edg is updated to ( x, y ) (being incident on y ). The change in the excess flow onadding ( x, y ) (Lemma 6(c)) is updated either in val (if x is in converging tree) or in upd (if x is in diverging tree). This updated element is inserted to H x .(b) If y is in a diverging tree , first we verify the safety of adding ( x, y ) to the minimum excessflow path in H y ( head ( H y )). If it is not safe, it is extracted from the heap and dealt with(explained later). We continue to check for unsafe paths in H y and extract them until head ( H y ) is safe. When all paths in H y are safe, the change in excess flow by adding( x, y ) (Lemma 6(c)) is updated in head ( H y ) .upd , and H y is merged into H x . . Khan and A. I. Tomescu 13 Algorithm 2
Reporting maximal safe paths in F u forall ( x, u ) ∈ F u do Add [( x, u ) , f ( x, u ) , −
1] to empty H x upd ∗ ← f out ( u ) − f ( u, v ∗ ) /* update to check right maximal */ forall y ̸ = u ∈ F u in reverse topological order doforall ( x, y ) ∈ F u do upd ← f in ( y ) − f ( x, y ) if H y ̸ = ∅ then // extendable safe paths in H y if head ( H y ) .upd = − then // y in converging tree if x has unit out-degree in F u then // x in converging tree Add [( x, y ) , head ( H y ) .val − upd, −
1] to empty H x else Add [( x, y ) , head ( H y ) .val, upd ] to H x // x in diverging tree else // y in diverging tree while head ( H y ) .val − head ( H y ) .upd − upd ≤ do // path unsafe top ← Extract min from H y if top.val − top.upd − upd ∗ > then // path not maximal Remove top.edg with extensions from F ∗ u else Add [ top.edg, y, top.val − top.upd ] to Sol u // path maximal head ( H y ) .upd ← head ( H y ) .upd + upd if x has unit out-degree in F u then Merge H y into empty H x else Merge H y into H x forall source v ∈ F u dowhile H v ̸ = ∅ do top ← Extract min from H y if top.val − top.upd − upd ∗ > then // path not maximal Remove top.edg with extensions from F ∗ u else Add [ top.edg, v, top.val − top.upd ] to Sol u // path maximal Report F ∗ u and Sol u The unsafe paths extracted from the heaps and the safe paths in the heaps of the sourcesof F u , are processed so that they are either removed from F ∗ v (if not right maximal) or addedto the solution after applying the lazy update. Analysis
Now, each of m u safe paths in F u uses one insert and one extract min operation in a non-emptyheap, while moving from converging tree to diverging tree, and while adding to (or discardingfrom) the solution respectively, using O ( m u ) operations. Also, every merge operation intonon-empty heaps combines at least two different safe paths in a heap, using total O ( m u )operations. The remaining operations (including those on empty heaps) take O (1) time whiletraversing F u taking O ( |F u | ) time. Thus, processing each funnel takes O ( m u log n + |F u | )time, requiring total O ( n + |P c | log n ) time, which also proves Theorem 4(c). ▶ Theorem 12.
Given a flow graph (DAG) with n vertices and m edges and the funnelstructures for the left maximal safe paths ending at its vertices, the maximal safe paths canbe reported in O ( n + |P c | log n ) time, where P c is the concise representation of the solution. We study the safety of flow paths in a given flow graph (DAG), which has applicationsin various domains, including the more prominent assembly of biological sequences. Theprevious work characterized such paths (and their generalizations) using a global criterion.We presented a simpler characterization based on a more efficiently computable localcriterion, that can be directly adapted into an optimal verification algorithm. Also, it resultsin a simple enumeration algorithm, which is optimal for some worst-case examples. However,it is inefficient if the size of the concise representation of the solution is small. We improvethis algorithm by exploiting a unique structure of a set of safe paths, called a funnel , resultingin a running time parameterized by the size of this concise representation of the solution,Our improved enumeration algorithm improved the running time of the simple algorithmonly when the size of flow decomposition O ( |P c | ) is larger than O ( n ). Also, it uses extraspace if |P c | = o ( n ). In the future, it would be interesting to see if it is possible to reportthe solution using optimal space, because both the paths in P c and different F v may have alot of redundancy. Ideally, we would like to see an algorithm taking linear time in additionto the optimal output size (similar to what was achieved by [7] for a different problem).Another interesting extension to this problem having practical significance is finding safepaths for those flow decompositions whose paths have a certain minimum weight threshold. References Nidia Obscura Acosta, Veli Mäkinen, and Alexandru I. Tomescu. A safe and completealgorithm for metagenomic assembly.
Algorithms Mol. Biol. , 13(1):3:1–3:12, 2018. doi:10.1186/s13015-018-0122-7 . Jasmijn A. Baaijens, Bastiaan Van der Roest, Johannes Köster, Leen Stougie, and Alex-ander Schönhuth. Full-length de novo viral quasispecies assembly through variation graphconstruction.
Bioinform. , 35(24):5086–5094, 2019. doi:10.1093/bioinformatics/btz443 . Jasmijn A. Baaijens, Leen Stougie, and Alexander Schönhuth. Strain-aware assembly ofgenomes from mixed samples using flow variation graphs. In Russell Schwartz, editor,
Researchin Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020,Padua, Italy, May 10-13, 2020, Proceedings , volume 12074 of
Lecture Notes in ComputerScience , pages 221–222. Springer, 2020. doi:10.1007/978-3-030-45257-5\_14 . Georg Baier, Ekkehard Köhler, and Martin Skutella. On the k-splittable flow problem. In
European Symposium on Algorithms , pages 101–113. Springer, 2002. Georg Baier, Ekkehard Köhler, and Martin Skutella. The k-splittable flow problem.
Algorith-mica , 42(3-4):231–248, 2005. Elsa Bernard, Laurent Jacob, Julien Mairal, and Jean-Philippe Vert. Flipflop: Fast lasso-basedisoform prediction as a flow problem, 2013. Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, and Elia C. Zirondelli. From omnitigs tomacrotigs: a linear-time algorithm for safe walks – common to all closed arc-coverings of adirected graph, 2020. arXiv:2002.10498 . Katarína Cechlárová and Vladimír Lacko. Persistency in combinatorial optimization problemson matroids.
Discret. Appl. Math. , 110(2-3):121–132, 2001. doi:10.1016/S0166-218X(00)00279-1 . Rami Cohen, Liane Lewin-Eytan, Joseph Seffi Naor, and Danny Raz. On the effect offorwarding table size on sdn network utilization. In
IEEE INFOCOM 2014-IEEE conferenceon computer communications , pages 1734–1742. IEEE, 2014. Marie-Christine Costa. Persistency in maximum cardinality bipartite match-ings.
Operations Research Letters , 15(3):143 – 149, 1994. URL: http: . Khan and A. I. Tomescu 15 , doi:http://dx.doi.org/10.1016/0167-6377(94)90049-3 . Rene De La Briandais. File searching using variable length keys. In
Papers Presented at thethe March 3-5, 1959, Western Joint Computer Conference , page 295–298, New York, NY,USA, 1959. Association for Computing Machinery. doi:10.1145/1457838.1457895 . D. R. Ford and D. R. Fulkerson.
Flows in Networks . Princeton University Press, USA, 2010. Michael L. Fredman and Robert Endre Tarjan. Fibonacci heaps and their uses in improvednetwork optimization algorithms.
J. ACM , 34(3):596–615, July 1987. doi:10.1145/28869.28874 . Thomas Gatter and Peter F Stadler. Ry¯ut¯o: network-flow based transcriptome reconstruction.
BMC bioinformatics , 20(1):190, 2019. Tzvika Hartman, Avinatan Hassidim, Haim Kaplan, Danny Raz, and Michal Segalov. How tosplit a flow? In , pages 828–836. IEEE, 2012. Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri,and Roger Wattenhofer. Achieving high utilization with software-driven wan. In
Proceedingsof the ACM SIGCOMM 2013 conference on SIGCOMM , pages 15–26, 2013. Kyle Kloster, Philipp Kuinke, Michael P O’Brien, Felix Reidl, Fernando Sánchez Villaamil,Blair D Sullivan, and Andrew van der Poel. A practical fpt algorithm for flow decompositionand transcript assembly. In , pages 75–86. SIAM, 2018. Vladimí Lacko. Persistency in optimization problems on graphs and matroids. Master’s thesis,UPJŠ Košice, 1998. Cong Ma, Hongyu Zheng, and Carl Kingsford. Exact transcript quantification over splice graphs.In Carl Kingsford and Nadia Pisanti, editors, , volume172 of
LIPIcs , pages 12:1–12:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.WABI.2020.12 . Cong Ma, Hongyu Zheng, and Carl Kingsford. Finding ranges of optimal tran-script expression quantification in cases of non-identifiability. bioRxiv , 2020. URL: , , doi:10.1101/2019.12.13.875625 . Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu.
Genome-ScaleAlgorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing .Cambridge University Press, 2015. doi:10.1017/CBO9781139940023 . Marcelo Garlet Millani, Hendrik Molter, Rolf Niedermeier, and Manuel Sorge. Efficientalgorithms for measuring the funnel-likeness of dags.
J. Comb. Optim. , 39(1):216–245, 2020. Brendan Mumey, Samareh Shahmohammadi, Kathryn McManus, and Sean Yaw. Paritybalancing path flow decomposition and routing. In , pages 1–6. IEEE, 2015. Jan Peter Ohst.
On the Construction of Optimal Paths from Flows and the Analysis ofEvacuation Scenarios . PhD thesis, University of Koblenz and Landau, Germany, 2015. Nils Olsen, Natalia Kliewer, and Lena Wolbeck. A study on flow decomposition methods forscheduling of electric buses in public transport based on aggregated time–space network models.
Central European Journal of Operations Research , 2020. doi:10.1007/s10100-020-00705-6 . Mihaela Pertea, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell,and Steven L Salzberg. Stringtie enables improved reconstruction of a transcriptome fromrna-seq reads.
Nature biotechnology , 33(3):290–295, 2015. Krzysztof Pieńkosz and Kamil Kołtyś. Integral flow decomposition with minimum longestpath length.
European Journal of Operational Research , 247(2):414–420, 2015. Mingfu Shao and Carl Kingsford. Theory and a heuristic for the minimum path flow decom-position problem.
IEEE/ACM Transactions on Computational Biology and Bioinformatics ,16(2):658–670, 2017. Vorapong Suppakitpaisarn. An approximation algorithm for multiroute flow decomposition.
Electronic Notes in Discrete Mathematics , 52:367 – 374, 2016. INOC 2015 – 7th InternationalNetwork Optimization Conference. URL: , doi:https://doi.org/10.1016/j.endm.2016.03.048 . Alexandru I. Tomescu, Travis Gagie, Alexandru Popa, Romeo Rizzi, Anna Kuosmanen,and Veli Mäkinen. Explaining a weighted DAG with few paths for solving genome-guidedmulti-assembly.
IEEE ACM Trans. Comput. Biol. Bioinform. , 12(6):1345–1354, 2015. doi:10.1109/TCBB.2015.2418753 . Alexandru I Tomescu, Anna Kuosmanen, Romeo Rizzi, and Veli Mäkinen. A novel min-costflow method for estimating transcript expression with rna-seq.
BMC bioinformatics , 14(S5):S15,2013. Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly throughomnitigs.
Journal of Computational Biology , 24(6):590–602, 2017. Preliminary versionappeared in RECOMB 2016. Benedicte Vatinlen, Fabrice Chauvet, Philippe Chrétienne, and Philippe Mahey. Simplebounds and greedy algorithms for decomposing a flow into a minimal set of paths.
EuropeanJournal of Operational Research , 185(3):1390–1401, 2008. Jean Vuillemin. A data structure for manipulating priority queues.
Communications of theACM , 21(4):309–315, 1978. URL: https://doi.org/10.1145/359460.359478 . Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool fortranscriptomics.
Nature Reviews Genetics , 10(1):57–63, 2009. Lucia Williams, Gillian Reynolds, and Brendan Mumey. Rna transcript assembly using inexactflows. In2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)