Genome assembly, from practice to theory: safe, complete and linear-time
Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, Elia C. Zirondelli
GGenome assembly, from practice to theory:safe, complete and linear-time
Massimo Cairo , Romeo Rizzi , Alexandru I. Tomescu , and Elia C. Zirondelli Department of Computer Science, University of Helsinki, Finland, [email protected] Department of Computer Science, University of Verona, Italy, [email protected] Department of Mathematics, University of Trento, Italy, [email protected]
Abstract
Genome assembly asks to reconstruct an unknown string from many shorter substrings ofit. Even though it is one of the key problems in Bioinformatics, it is generally lacking majortheoretical advances. Its hardness stems both from practical issues (size and errors of real data),and from the fact that problem formulations inherently admit multiple solutions. Given these, attheir core, most state-of-the-art assemblers are based on finding non-branching paths ( unitigs )in an assembly graph. While such paths constitute only partial assemblies, they are likely to becorrect. More precisely, if one defines a genome assembly solution as a closed arc-covering walk of the graph, then unitigs appear in all solutions, being thus safe partial solutions.Until recently, it was open what are all the safe walks of an assembly graph. Tomescu andMedvedev (RECOMB 2016) characterized all such safe walks ( omnitigs ), thus giving the firstsafe and complete genome assembly algorithm. Even though omnitig finding was later improvedto quadratic time by Cairo et al. (ACM Trans. Algorithms 2019), it remained open whether thecrucial linear-time feature of finding unitigs can be attained with omnitigs. That is, whether all the strings that can be correctly assembled from a graph can be obtained in a time feasible forbeing implemented in practical genome assemblers.We answer this question affirmatively, by describing a surprising O ( m )-time algorithm to identify all maximal omnitigs of a graph with n nodes and m arcs, notwithstanding the existenceof families of graphs with Θ( mn ) total maximal omnitig size. This is based on the discovery ofa family of walks ( macrotigs ) with the property that all the non-trivial omnitigs are univocalextensions of subwalks of a macrotig. This has two consequences:1. A linear-time output-sensitive algorithm enumerating all maximal omnitigs.2. A compact O ( m ) representation of all maximal omnitigs, which allows, e.g., for O ( m )-timecomputation of various statistics on them.Our results close a long-standing theoretical question inspired by practical genome assemblers,originating with the use of unitigs in 1995. They are also crucial in covering problems incor-porating additional practical constraints. Thus, we envision our results to be at the core of areverse transfer from theory to practical and complete genome assembly programs, as has beenthe case for other key Bioinformatics problems. a r X i v : . [ c s . D M ] N ov Introduction
Theoretical and practical background of genome assembly.
Genome assembly is one ofthe flagship problems in Bioinformatics, along with other problems originating in–or highly moti-vated by–this field, such as edit distance computation, reconstructing and comparing phylogenetictrees, text indexing and compression. In genome assembly, we are given a collection of strings (or reads ) and we need to reconstruct the unknown string (the genome) from which they originate.This is motivated by sequencing technologies that are able to read either “short” strings (100-250length, Illumina technology), or “long” strings (10.000-50.000 length, Pacific Biosciences or OxfordNanopore technologies) in huge amounts from the genomic sequence(s) in a sample. For example,the SARS-CoV-2 genome was obtained in [60] from short reads using the MEGAHIT assembler [41].Other leading Bioinformatics problems have seen significant theoretical progress in major Com-puter Science venues, culminating (just to name a few) with both positive results, see e.g. [16, 59]for phylogeny problems, [21, 6, 34] for text indexing, [22, 7, 35] for text compression, and negativeresults, see e.g. [4, 1, 5, 20] for string matching problems. However, the genome assembly problemis generally lacking major theoretical advances.One reason for this stems from practice: the huge amount of data (e.g. the 3.1 Billion longhuman genome is read 50 times over) which impedes slower than linear-time algorithms, errors ofthe sequencing technologies (up to 15% for long reads), and various biases when reading certaingenomic regions [49]. Another reason stems from theory: historically, finding an optimal genomeassembly solution is considered NP-hard under several formulations [51, 33, 32, 45, 48, 29, 50], but,more fundamentally, even if one outputs a 3.1 Billion characters long string, this is likely incorrect,since problem formulations inherently admit a large number of solutions of such length [36].Given all these setbacks, most state-of-the-art assemblers, including e.g. MEGAHIT [41] (forshort reads), or wtdbg2 [54] (for long reads), generally employ a very simple and linear-time strat-egy, dating back to 1995 [32]. They start by building an assembly graph encoding the overlapsof the reads, such as a de Bruijn graph [52] or an overlap graph [47] (graphs are directed in thispaper). After some simplifications to this graph to remove practical artifacts such as errors, attheir core they find strings labeling paths whose internal nodes have in-degree and out-degree equalto 1 (called unitigs ), approach dating back to 1995 [32]. That is, they do not output entire genomeassemblies, but only shorter strings that are likely to be present in the sequenced genome, sinceunitigs do not branch at internal nodes.The issue of multiple solutions to a problem has deep roots in Bioinformatics, but is in factcommon to many real-world problems from other fields. In such problems, we seek ultimate knowl-edge of an unknown object (e.g., a genome) but have access only to partial observations from it(e.g., reads). A standard paradigm is to apply Occam’s razor principle which favors the simplestmodel explaining the data. As such, the reconstruction problem is cast in terms of an optimizationproblem, to be addressed by various mathematical, computational and technological paradigms.While this approach has been extremely successful in Bioinformatics, it is not always robust.First, the optimization problem might admit several optimal solutions, and thus several interpre-tations of the observed data. A standard way to tackle this is to enumerate all solutions [25, 38].Second, the problem formulation might be inaccurate, or the data might be incomplete, and thetrue solution might be a sub-optimal one. One could then enumerate all the first k -best solu-tions to it [18, 19], hoping that the true solution is among such first k ones. The motivation ofsuch enumeration algorithms is that e.g. later “one can apply more sophisticated quality criteria,wait for data to become available to choose among them, or present them all to human decision-makers” [18]. However, both approaches do not scale when the number of solutions is large, andare thus unfeasible in genome assembly. 1 afe and complete algorithms: A theoretical framing of practical genome assembly. With the aim of enhancing the widely-used practical approach of assembling just unitigs—as thosewalks considered to be present in any possible assembly solution—a result in a major Bioinformaticsvenue [57] asked what is the “limit” of the correctly reconstructible information from an assemblygraph . Moreover, is all such reconstructible information still obtainable in linear time , as in thecase of the popular unitigs? Variants of this question also appeared in [27, 8, 48, 55, 39, 9], whileother works already considered simple linear-time generalizations of unitigs [53, 46, 30, 36], withoutknowing if the “assembly limit” is reached.To make this question precise, [57] introduced the following safe and complete framework. Givena notion of solution to a problem (e.g. a type of walk in a graph), a partial solution (e.g. some shorterwalk in the graph) is called safe if it appears (e.g. is a subwalk) in all solutions. An algorithmreporting only safe partial solutions is called a safe algorithm . A safe algorithm reporting all safepartial solutions is called safe and complete . A safe and complete algorithm outputs all and onlywhat is likely part of the unknown object to be reconstructed, synthesizing all solutions from thepoint of view of correctness.
Safety generalizes the existing notion of persistency : a single node or edge was called persistent if it appears in all solutions [28, 15, 12], for example persistent edges for maximum bipartitematchings [15]. However, it also has roots in other Bioinformatics works [58, 13, 23, 61], whichconsidered the aligned symbols—the reliable regions —appearing in all optimal (and sub-optimal)alignments of two strings. The reliable regions of an alignment of proteins were shown in [58] tomatch in a significant proportion the true ones determined experimentally.There are many theoretical formulations of genome assembly as an optimization problem, e.g. ashortest common superstring of all the reads [51, 33, 32], or some type of shortest walk coveringall nodes or arcs of the assembly graph [53, 45, 46, 31, 29, 50, 48]. However, it is widely ac-knowledged [48, 50, 44, 49, 43, 37] that, apart from some being NP-hard, these formulations arelacking in several aspects, for example they collapse repeated regions of a genome. At present,given the complexity of the problem, there is no definitive notion of a “good” genome assemblysolution. Therefore, [57] considered as genome assembly solution any closed arc-covering walk ofa graph, where arc-covering means that it passes through each arc at least once. The main ben-efit of considering any arc-covering walk is that safe walks for them are safe also for any possiblerestriction such covering walks (e.g. by some additional optimality criterion ). Put otherwise, safewalks for all arc-covering walks are more likely to be correct than safe walks for some peculiar typeof arc-covering walks.Moreover, closed arc-covering walks in widely-used de Bruijn graphs are in one-to-one corre-spondence with circular strings having the same set of k -mers of the reads (a k -mer of a set ofstrings is any length- k string appearing as a substring of some string in the set). More precisely,consider the following setting mentioned in [57]. A de Bruijn graph of order k has the set of all( k − -mers of the reads as the set of nodes, and the set of all k -mers of the reads as arcs (fromtheir length-( k −
1) prefix to their length-( k −
1) suffix). The most basic notion of genome assemblysolution (for circular genomes) are circular strings having the same set of k -mers as the reads, whichcorrespond exactly to closed arc-covering walks of the de Bruijn graph of order k . See Figure 1and Appendix A for a more detailed description. For example, closed arc-covering walks are a common relaxation of the fundamental notions of closed Eulerianwalk (we now pass through each arc at least once ), and of closed Chinese postman walk (i.e. a closed arc-coveringwalk of minimum length ) [26], which were mentioned in [48] as unsatisfactory models of genome assembly. G GT TA ACGG GA CC CGTACCGTACGGACG CGT CCG ACG ACG GTA CGT CGG
CGC
TAC GTA GGA
GCG
ACC TAC GACC G T ACCGGGA C TAC G
Figure 1:
Left: A circular string and a set strings of length 3, drawn as black arrows. Middle: The samecircular string shown as linearized (in blue) and the same strings (in black). The two strings in bold overlapthe beginning and the end of the circular string. Right: The de Bruijn graph of order k = 3 built from theinput strings. The set of input string corresponds to the set of arcs. The closed arc-covering walk spellingthe circular string on the left is shown in red. However, the graph admits several closed arc-covering walks(also an Eulerian one), all spelling thus a circular string with the same set of k -mers as the circular stringon the left. Prior results on safety in closed arc-covering walks.
It is immediate to see that unitigs aresafe walks for closed arc-covering walks. A first safe generalization of unitigs consisted of those pathswhose internal nodes have only out-degree equal to 1 (with no restriction on their in-degree) [53].Further, these safe paths have been generalized in [46, 30, 36] to those partitionable into a prefixwhose nodes have in-degree equal to 1, and a suffix whose nodes have out-degree equal to 1.
All safe walks for closed arc-covering walks were characterized by [57, 56] with the notion of omnitigs ,see Definition 1 and Fig. 2. This lead to the first safe and complete genome assembly algorithm(obtained thus 20 years after unitigs were first considered), outputting all maximal omnitigs inpolynomial time (maximal omnitigs are those which are not sub-walks of other omnitigs).
Definition 1 (Omnitig) . A walk W = e . . . e (cid:96) is an omnitig if, for all ≤ i ≤ j ≤ (cid:96) , there isno non-empty path ( forbidden path ) from the tail of e j to the head of e i − , with first arc differentfrom e j and last arc different from e i − . e e i e i e j e ` P Figure 2:
The walk e . . . e (cid:96) is not an omnitig because there is a non-empty path forbidden path P . Furthermore, through experiments on “perfect” human read datasets, [57] also showed thatstrings labeling omnitigs are about 60% longer on average than unitigs, and contain about 60%more biological content on average. Thus, once other issues of real data (e.g. errors) are added tothe problem formulation, omnitigs (and the safe walks for such extended models) have the potentialto significantly improve the quality of genome assembly results. Nevertheless, for this to be possible,one first needs the best possible results for omnitigs (given e.g. the sheer size of the read datasets),and a full comprehension of them, otherwise, such extensions are hard to solve efficiently.Cairo et al. [11] recently proved that the length of all maximal omnitigs of any graph with n nodes and m arcs is O ( nm ), and proposed an O ( nm )-time algorithm enumerating all maximalomnitigs. This was also proven to be optimal, in the sense that they constructed families of graphswhere the total length of all maximal omnitigs is Θ( nm ). However, it was left open if it is necessaryto pay O ( nm ) even when the total length of the output is smaller. Moreover, that algorithm cannotbreak this barrier, because e.g. O ( m )-time traversals have to be done for O ( n ) cases.This theoretical question is crucial also from the practical point of view: assembly graphs havethe number of nodes and arcs in the order of millions, and yet the total length of the maximal3mnitigs is almost linear in the size of the graph. For example, the compressed (see Definition 37)de Bruijn graph of human chromosome 10 (length 135 million) has 467 thousand arcs [11, Table 1],and the length of all maximal omnitigs (i.e. their total number of arcs, not their total string length)is 893 thousand. Moreover, even though this chromosome is only about 4% of the full humangenome, the running time of the quadratic algorithm of [11] on its compressed de Bruijn graphis about 30 minutes. Both of these facts imply that a linear-time output sensitive enumerationalgorithm has also a big potential for practical impact. Our results.
Our main result is an O ( m )-size representation of all maximal omnitigs , based ona careful structural decomposition of the omnitigs of a graph. This is surprising, given that thereare families of graphs with Θ( nm ) total length of maximal omnitigs [11]. Theorem 2.
Given a strongly connected graph G with n nodes and m arcs, there exists a O ( m ) -size representation of all maximal omnitigs, consisting of a set M of walks ( maximal macrotigs ) oftotal length O ( n ) and a set F of arcs, such that every maximal omnitig is the univocal extension of either a subwalk of a walk in M , or of an arc in F .Moreover, M , F , and the endpoints of macrotig subwalks univocally extending to maximalomnitigs can be computed in time O ( m ) . Since the univocal extension U ( W ) of a walk W can be trivially computed in time linear in thelength of U ( W ), we get the linear-time output sensitive algorithm as an immediate corollary: Corollary 3.
Given a strongly connected graph G , it is possible to enumerate all maximal omnitigsof G in time linear in their total length. We obtain Theorem 2 using two interesting ingredients. The first is a novel graph structure( macronodes ), obtained after a compression operation of ‘easy’ nodes and arcs (Section 4). Thesecond is a connection to a recent result by Georgiadis et al. [24] showing that it is possible toanswer in O (1)-time strong connectivity queries under a single arc removal, after a linear-timepreprocessing of the graph (notice that a forbidden path is defined w.r.t. two arcs to avoid).Theorem 2 has additional practical implications. First, omnitigs are also representable in thesame (linear) size as the commonly used unitigs. Second, maximal macrotigs enable various O ( m )-time operations on maximal omnitigs (without listing them explicitly), by pre-computing the uni-vocal extensions from any node, needed in Theorem 2. For example, given that the number ofmaximal omnitigs is O ( m ) [11], this implies the following result: Corollary 4.
Given a strongly connected graph G with m arcs, it is possible to compute the lengthsof all maximal omnitig in total time O ( m ) . Corollary 4 leads to a linear-time computation of various statistics about maximal omnitigs,such as minimum, maximum, and average length (useful e.g. in [14]). One can also use this to filterout subfamilies of them (e.g. those of length smaller and / or larger than a given value) beforeenumerating them explicitly. Note that the total length of the maximal omnitigs is at least m , since every arc is an omnitig. The univocal extension U ( W ) of a walk W is obtained by appending to W the longest path whose nodes (exceptfor the last one) have out-degree 1, and prepending to W the longest path whose nodes (except for the first one) havein-degree 1; see Section 2 for the formal definition. ignificance of our results. This paper closes the issue of finding safe walks for a fundamentalmodel of genome assembly ( any closed arc-covering walk), a long-standing theoretical question,inspired by practical genome assemblers, and originating with the use of unitigs in 1995 [32].However, we envision a reverse transfer from theory to practical and complete genome assemblyprograms, as has been the case in other Bioinformatics problems.Trivially, safe walks for all closed arc-covering walks are also safe for more specific types ofarc-covering walks. Moreover, while a genome soltuion defined as a single closed arc-covering walkdoes not incorporate several practical issues of real data, in a follow-up work [10] we show thatomnitigs are the basis of more advanced models handling many practical aspects. For example, toallow more types of genomes to be assembled, one can define an assembly solution as a set of closedwalks that together cover all arcs [2], which is the case in metagenomic sequencing of bacteria.For linear chromosomes (as in eukaryotes such as human), or when modeling missing sequencingcoverage, one can analogously consider one, or many, such open walks [56, 57]. Safe walks forall these models are subsets of omnitigs [2, 10]. Moreover, when modeling sequencing errors, ormutations present e.g. only in the mother copy of a chromosome (and not in the father’s copy),one can require some arcs not to be covered by a solution walk, or even to be “invisible” from thepoint of view safety. Finding safe walks for such models is also based on first finding omnitigs-likewalks [10].Notice that such separation between theoretical formulations and their practical embodimentsis common for many classical problems in Bioinformatics. For example, computing edit distance isoften replaced with computing edit distance under affine gap costs [17], or enhanced with variousheuristics as in the well-known BLAST aligner [3]. Also text indexes such as the FM-index [21] areextended in popular read mapping tools (e.g. [42, 40]) with many heuristics handling errors andmutations in the reads.Finally, our results show that safe partial solutions enjoy interesting combinatorial properties,further promoting the persistency and safety frameworks. For real-world problems admitting mul-tiple solutions, safe and complete algorithms are more pragmatic than the classical approach ofoutputting an arbitrary optimal solution. They are also more efficient than enumerating all solu-tions, or only the first k -best solutions, because they already synthesize all that can be correctlyreconstructed from the input data. We highlight here our key structural and algorithmic contributions, and give the formal details inSections 4 to 6. We start with the minimum terminology needed to understand this section, anddefer the rest of the notation to Section 3.
Terminology.
Functions t ( · ) and h ( · ) denote the tail node and the head node, respectively, of anarc or walk. We classify the nodes and arcs of a strongly connected graph as follows (see Figure 3a): • a node v is a join node if its in-degree d − ( v ) satisfies d − ( v ) >
1, and a join-free node otherwise.An arc f is called a join arc if h ( f ) is a join node, and a join-free arc otherwise. • a node v is a split node if its out-degree d + ( v ) satisfies d + ( v ) >
1, and a split-free node otherwise. An arc g is called a split arc if t ( g ) is a split node, and a split-free arc otherwise. • a node or arc is called bivalent if it is both join and split, and it is called biunivocal if it isboth split-free and join-free. 5 f g b b g f b b g ′ f g u ℳ u v ℳ v ℳ v (a) v f g g f b b g ′ f g u ℳ u v ℳ v ℳ v (b) Figure 3:
Left: Given a bivalent node v , the macronode M v is the subgraph of G induced by the nodesreaching v with a split-free path (in red), and the nodes reachable from v with a join-free path (in blue).These two types of nodes induce the two trees of the macronode. By definition, every arc with endpointsin different macronodes are bivalent (in green). The remaining bivalent arcs have endpoints in the samemacronode (in purple). Right: The only omnitig traversing the bivalent node v is f g ; e.g., by the X-intersection Property neither f g is an omnitig ( b f f is a forbidden path) nor f g is an omnitig ( g g b is a forbidden path). Extending the micro-omnitig f g to the right we notice that f g g is an omnitig andby the Y-intersection Property f g g (cid:48) is not an omnitig ( g b is a forbidden path). Hence, the only maximalright-micro omnitig is f g g b , and the only maximal left-micro omnitig is b f f g . Merging the two on f g , we obtain the maximal microtig b f f g g b . A walk W is split-free (resp., join-free ) if all its arcs are split-free (resp., join-free). Given a walk W , its univocal extension U ( W ) is defined as W − W W + , where W − is the longest join-free path to t ( W ) and W + is the longest split-free path from h ( W ) (observe that they are uniquely defined). Structure.
The main structural insight of this paper is that omnitigs enjoy surprisingly limitedfreedom, in the sense that any omnitig can be seen as a concatenation of walks in a very specificset. In order to give the simplest exposition, we first simplify the graph by contracting biunivocalnodes and arcs. The nodes of the resulting graph can now be partitioned into macronodes (seeFigure 3a and Definition 13), where each macronode M v is uniquely identified by a bivalent node v (its center ). We can now split the problem by first finding omnitigs inside each macronode, andthen characterizing the ways in which omnitigs from different macronodes can combine.We discover a key combinatorial property of how omnitigs can be extended: there are at mosttwo ways that any omnitig can traverse a macronode center (see also Figure 3b): Theorem 5 (X-intersection Property) . Let v be a bivalent node. Let f and f be distinct joinarcs with h ( f ) = h ( f ) = v ; let g and g be distinct split arcs with t ( g ) = t ( g ) = v . We have: i ) If f g and f g are omnitigs, then d + ( v ) = d − ( v ) = 2 . ii ) If f g is an omnitig, then there are no omnitigs f g (cid:48) with g (cid:48) (cid:54) = g , nor f (cid:48) g with f (cid:48) (cid:54) = f . In order to prove the X-intersection Property, we prove an even more fundamental property:once an omnitig traverses a macronode center, for any node it meets after the center node, there is atmost one way of continuing from that node (Y-intersection Property, Corollary 17), see Figure 3b.The basic intuition is that if there are more than one possibilities, then strong connectivity createsforbidden paths.Given an omnitig f g traversing the bivalent node v , we define the maximal right-micro omnitig as the longest extension f gW in the macronode M v (see Figure 3b and Definition 15). The maximalleft-micro omnitig is the symmetrical omnitig W f g . By Theorem 5, there are at most two maximal6ight-micro omnitigs and two maximal left-micro omnitigs. The merging of a maximal left- andright-micro omnitig on f g is called a maximal microtig (see Figure 3b and Definition 15; notice thata microtig is not necessarily an omnitig). These at most two maximal microtigs represent “forcedomnitig tracks” that must be followed by any omnitig crossing v .We now describe how omnitigs can advance from one macronode to another. Notice that anyarc having endpoints in different macronodes is a bivalent arc (Lemma 14). In Lemma 20 we provethat for every maximal microtig ending with a bivalent arc b , there is at most one maximal microtigstarting with b . As such, when an omnitig track exits a macronode, there is at most one way ofconnecting it with an omnitig track from another macronode. It is natural to merge all omnitigtracks (i.e. maximal microtigs) on all bivalent arcs between different macronodes, and thus obtain maximal macrotigs (Definition 23 and Fig. 6). The total size of all maximal macrotigs is O ( n )(Theorem 27), and they are a representation of all maximal omnitigs, except for those that areunivocal extensions of the arcs of F , see below and Lemma 28. Algorithms.
Our algorithms first build the set M of maximal macrotigs, and then identifymaximal omnitigs inside them. The set F of arcs univocally extending to the remaining maximalomnitigs will be the set of bivalent arcs not appearing in M (Lemma 28).Crucial to the algorithms is an extension primitive deciding what new arc (if any) to choosewhen extending an omnitig (recall that the X- and Y-intersection Properties limits the number ofsuch arcs to one). Suppose we have an omnitig f W , with f a join arc, and we need to decide ifit can be extended with an arc g out-going from h ( W ). Naturally, this extension can be foundby checking that there is no forbidden path from t ( g ) = h ( W ). However, this forbidden path canpotentially end in any node of W . Up to this point, [56, 57, 11] need to do an entire O ( m ) graphtraversal to check if any node of W is reachable by a forbidden path. We prove here a new keyproperty: Theorem 6 (Extension Property) . Let f W be an omnitig in G , where f is a join arc. Then f W g is an omnitig if and only if g is the only arc with t ( g ) = h ( W ) such that there exists a path from h ( g ) to h ( f ) in G (cid:114) f . Thus, for each arc g with t ( g ) = h ( W ), we can do a single reachability query under one arcremoval: “does h ( g ) reach h ( f ) in G (cid:114) f ?” Since the target of the reachability query is also thehead of the arc excluded f , then we can apply an immediate consequence of the results of [24]: Theorem 7 ([24]) . Let G be a strongly connected graph with n nodes and m arcs. It is possible tobuild an O ( n ) -space data structure that, after O ( m + n ) -time preprocessing, given a node w and anarc f , tests in O (1) worst-case time if there is a path from w to h ( f ) in G (cid:114) f . Using the Extension Property and Theorem 7, we can thus pay O (1) time to check each out-outgoing arc g , before discovering the one (if any) with which to extend f W . In Section 6 wedescribe how to transform the graph to have constant degree, so that we pay O (1) per node.This transformation also requires slight changes to the maximal omnitig enumeration algorithm tomaintain the linear-time output sensitive complexity (see Section 6.3). We first use the ExtensionProperty when building the left- and right-maximal micro omnitigs, and then when identifyingmaximal omnitigs inside macrotigs, as follows.Once we have the set M of maximal macrotigs, we scan each macrotig with two pointers, a leftone always on a join arc f , and a right one always on a split arc g (see Figure 4 and Algorithm 5).Both pointers move from left to right in such a way that the subwalk between them is always anomnitig. The subwalk is grown to the right by moving the right pointer as long as it remains an7 g f g b = g b g f g b b f g f g b f u v w ℳ w ℳ v ℳ u b f g f g b = g b g f g b b f g f g b f u v w ℳ w ℳ v ℳ u b Figure 4:
Any maximal omnitig is identified (in solid blue) either by a macrotig interval (from a join arc f to a split arc g ; left), or by a bivalent arc b not appearing in any macrotig (right). The full maximal omnitigis obtained by univocal extension (dotted blue), extension which may go outside of the maximal macrotig. omnitig (checked with the Extension Property). When growing to the right is no longer possible,the omnitig is shrunk from the left by moving the left pointer. This technique runs in time linearto the total length of the maximal macrotigs, namely O ( n ).In Figure 5 we work out all these notions on a concrete example. All nodes have in- and out-degree at most 2 (a) Nodes and arcs color-coded as in Figure 3a. (b) Maximal microtigs.(c) Maximal macrotigs. (d) The maximal omnitigs obtained from maximalmacrotigs (univocal extensions are dotted). All othermaximal omnitigs are univocal extensions of the biva-lent arcs not appearing in macrotigs.
Figure 5:
A concrete example of the main notions of this paper. In Figures 5b to 5d walks have differentcolors for visual distinguishability.
In this paper, a graph is a tuple G = ( V, E ), where V is a finite set of nodes , E is a finite multi-setof ordered pair of nodes called arcs . Parallel arcs and self-loops are allowed. For an arc e ∈ E ( G ),we denote G (cid:114) e = ( V, E (cid:114) { e } ). The reverse graph G R of G is obtained by reversing the directionof every arc. In the rest of this paper, we assume a fixed strongly connected graph G = ( V, E ),with | V | = n and | E | = m ≥ n .A walk in G is a sequence W = ( v , e , v , e , . . . , v (cid:96) − , e (cid:96) , v (cid:96) ), (cid:96) ≥
0, where v , v , . . . , v (cid:96) ∈ V ,and each e i is an arc from v i − to v i . Sometimes we drop the nodes v , . . . , v (cid:96) of W , and write W e . . . e (cid:96) . If an arc e appears in W , we write e ∈ W . We say that W goes from t ( W ) = v to h ( W ) = v (cid:96) , has length (cid:96) , contains v , . . . , v (cid:96) − as internal nodes , starts with e , endswith e (cid:96) , and contains e , . . . , e (cid:96) − as internal arcs . A walk W is called empty if it has length zero,and non-empty otherwise. There exists exactly one empty walk (cid:15) v = ( v ) for every node v ∈ V , and t ( (cid:15) v ) = h ( (cid:15) v ) = v . A walk W is called closed if it is non-empty and t ( W ) = h ( W ), otherwise it is open . The concatenation of walks W and W (cid:48) (with h ( W ) = t ( W (cid:48) )) is denoted W W (cid:48) .A walk W = ( v , e , v , . . . , e (cid:96) , v (cid:96) ) is called a path when the nodes v , v , . . . , v (cid:96) are all distinct,with the exception that v (cid:96) = v is allowed (in which case we have either a closed or an emptypath). To simplify notation, we may denote a walk W = ( v , e , v , . . . , e (cid:96) , v (cid:96) ) as a sequence of arcs,i.e. W = e . . . e (cid:96) . Subwalks of open walks are defined in the standard manner. For a closed walk W = e . . . e (cid:96) − , we say that W (cid:48) = e (cid:48) . . . e (cid:48) j is a subwalk of W if there exists i ∈ { , . . . , (cid:96) − } suchthat for every k ∈ { , . . . , j } it holds that e (cid:48) k = e ( i + k ) mod (cid:96) .A closed arc-covering walk exists if and only if the graph is strongly connected. We are interestedin the (safe) walks that are subwalks of all closed arc-covering walks, characterized in [57]. Theorem 8 ([57]) . Let G be a strongly connected graph different from a closed path. Then a walk W is a subwalk of all closed arc-covering walks of G if and only if W is an omnitig . Observe that W is an omnitig in G if and only if W R is an omnitig in G R . Moreover, anysubwalk of an omnitig is an omnitig. For every arc e , its univocal extension U ( e ) is an omnitig.A walk W satisfying a property P is right-maximal (resp., left-maximal ) if there is no walk W e (resp., eW ) satisfying P . A walk satisfying P is maximal if it is left- and right-maximal w.r.t. P .Notice that if G is a closed path, then every walk of G is an omnitig. As such, it is relevant tofind the maximal omnitigs of G only when G is different from a closed path. Thus, in the rest ofthis paper our strongly connected graph G is considered to be different from a closed path, evenwhen we do not mention it explicitly. In this section, unless otherwise stated, we assume that the input graph is compressed , in the sensethat it has no biunivocal nodes and arcs. In some algorithms we will also require that the graphhas constant in- and out-degree. In Section 6 we show how these properties can be guaranteed, bytransforming any strongly connected graph G with m arcs, in time O ( m ), into a compressed graphof constant degree and with O ( m ) nodes and arcs.In a compressed graph all arcs are split, join or bivalent. Moreover, in compressed graphs, thefollowing observation holds. Observation 9.
Let G be a compressed graph. Let f and g be a join and a split arc, respectively,in G . The following holds: ( i ) if f W g is a walk, then W has an internal node which is a bivalent node; ( ii ) if gW f is a walk, then gW f contains a bivalent arc. In the rest of this paper we will use the following technical lemmas (omitted proofs are inSection 6.2.).
Lemma 10.
Every maximal omnitig of a compressed graph contains both a join arc and a splitarc. Moreover, it has a bivalent arc or an internal bivalent node.
Lemma 11.
Let e be a join or a split arc. No omnitig can traverse e twice. Lemma 12.
Let u be a bivalent node. No omnitig contains u twice as an internal node. .1 Macronodes We now introduce a natural partition of the nodes of a compressed graph; each class of such apartition (i.e. a macronode ) contains precisely one bivalent node. We identify each class with theunique bivalent node they contain. All other nodes belonging to the same class are those that eitherreach the bivalent node with a join-free path or those that are reached by the bivalent node witha split-free path (recall Figure 3a).
Definition 13 (Macronode) . Let v be a bivalent node of G . Consider the following sets: R + ( v ) := { u ∈ V ( G ) : ∃ a join-free path from v to u } ; R − ( v ) := { u ∈ V ( G ) : ∃ a split-free path from u to v } . The subgraph M v induced by R + ( v ) ∪ R − ( v ) is called the macronode centered in v . Lemma 14.
In a compressed graph G , the following properties hold:i) The set { V ( M v ) : v is a bivalent node of G } is a partition of V ( G ) .ii) In a macronode M v , R + ( v ) and R − ( v ) induce two trees with common root v , but oriented inopposite directions. Except for the common root, the two trees are node disjoint, all nodes in R − ( v ) being join nodes and all nodes in R + ( v ) being split nodes.iii) The only arcs with endpoints in two different macronodes are bivalent arcs. To analyze how omnitigs can traverse a macronode and the degrees of freedom they have inchoosing their directions within the macronode, we introduce the following definitions.
Central-micro omnitigs are the smallest omnitigs that cross the center of a macronode.
Left- and right-micro omnitigs start from a central-micro omnitig and proceed to the periphery of a macronode.Finally, we combine left- and right-micro omnitigs into microtigs (which are not necessarily omnitigsthemselves); recall Figure 3b.
Definition 15 (Micro omnitigs, microtigs) . Let f be a join arc and g be a split arc, such that f g is an omnitig. • The omnitig f g is called a central-micro omnitig . • An omnitig f gW ( W f g , resp.) that does not contain a bivalent arc as an internal arc iscalled a right-micro omnitig (respectively, left-micro omnitig ). • A walk W = W f gW , where W f g and f gW are, respectively, a left-micro omnitig, and aright-micro omnitig, is called a microtig . Given a join arc f , we first find central micro-omnitigs (of the type f g ) with the genericfunction RightExtension ( G, f, W ) from Algorithm 1, where W is a join-free path (possibly empty).This extension uses the following weak version of the Extension Property (since W is join-free). Tobuild up the intuition, we also give a self-contained proof of this weaker result. Lemma 16 (Weak form of the Extension Property (Theorem 6)) . Let f W be an omnitig in G ,where f is a join arc and W is a join-free path. Then f W g is an omnitig if and only if g is theonly arc with t ( g ) = h ( W ) such that there exists a path from h ( g ) to h ( f ) in G (cid:114) f . roof. To prove the existence of an arc g , which satisfies the condition, consider any closed path P f (cid:48) in G , where f (cid:48) is an arbitrary sibling join arc of f . Notice that W is a prefix of P f (cid:48) , since f W is an omnitig, since otherwise one can easily find a forbidden path for the omnitig f W as asubpath of
P f (cid:48) , from the head of the very first arc of
P f (cid:48) that is not in W to h ( f (cid:48) ). Therefore, let g be the the first arc of P f (cid:48) after the prefix W , in such a way that the suffix of P f (cid:48) starting from h ( g ) is a path to h ( f ) in G (cid:114) f .For the direct implication, assume that there is a path P in G (cid:114) f from h ( g (cid:48) ), where g (cid:48) siblingof g and g (cid:48) (cid:54) = g , to h ( f ). Then, this forbidden path P contradicts the fact that f W g is an omnitig.For the reverse implication, assume that f W g is not an omnitig. Then take any forbidden path P for f W g . Since f W is an omnitig, P must start with some g (cid:48) sibling arc of g, g (cid:48) (cid:54) = g . Since W isjoin-free, then P must end in h ( f ) with the last arc different from f . Therefore, P is a path from h ( g (cid:48) ) to h ( f ) in G (cid:114) f .Not only Lemma 16 gives us an efficient extension mechanism, but it also immediately impliesthe Y-intersection Property (for clarity of reusability, we state both its symmetric variants). Corollary 17 (Y-intersection Property) . Let f W g be an omnitig, where f is a join arc, and g isa split arc.i) If W is a join-free path (possibly empty), then for any g (cid:48) a sibling split arc of g , the walk f W g (cid:48) is not an omnitig.ii) If W is a split-free path (possibly empty), then for any f (cid:48) a sibling join arc of f , the walk f (cid:48) W g is not an omnitig.
We now use the Y-intersection Property to prove the X-intersection Property.
Proof of the X-intersection Property (Theorem 5).
For point i ), assume there exists an arc g ,distinct from g and g , such that t ( g ) = v . Consider any shortest closed path g P (with P possibly empty), which exists by the strong connectivity of G . Let f be the last arc of P . If f (cid:54) = f then g P is a forbidden path for the omnitig f g , since g (cid:54) = g . Otherwise, if f = f then g P is a forbidden path for the omnitig f g , since g (cid:54) = g . In both cases we reached a contradiction,therefore g and g are the only arcs in G with t ( g ) = t ( g ) = v . To prove that f and f are theonly arcs in G with h ( f ) = h ( f ) = v one can proceed by symmetry.Point ii ) follows from Corollary 17 (by taking the W path of its statement to be empty) andfrom the symmetric analogue of Corollary 17.Given an omnitig f g , we obtain the maximal right-micro omnitig with function MaximalRightMicroOmnitig ( G, f, g ) from Algorithm 1. This works by extending f g , as much aspossible, with the function
RightExtension ( G, f, W ) (where initially W = g ). This extension stopswhen reaching the periphery of the macronode (i.e. a bivalent arc). Lemma 18.
The functions in Algorithm 1 are correct. Moreover, assuming that the graph hasconstant degree, we can preprocess it in time O ( m + n ) time, so that RightExtension ( G, f, W ) runsin constant time, and MaximalRightMicroOmnitig ( G, f, g ) runs in time linear in its output size.Proof. For
RightExtension ( G, f, W ), recall Lemma 16 and Theorem 7 and that the input graph isa compressed graph, and as such every node has constant degree.For
MaximalRightMicroOmnitig ( G, f, g ), notice that every iteration of the while loop increasesthe output by one arc and takes constant time, since
RightExtension ( G, f, W ) runs in O (1) time.11 lgorithm 1: Functions
RightExtension and
MaximalRightMicroOmnitig . Function
RightExtension ( G, f, W ) Input :
The compressed graph G , f W omnitig with W join-free. Returns :
The unique arc e such that f W e is an omnitig, if it exists. Otherwise, nil . S ← { e ∈ E ( G ) | t ( e ) = h ( W ) and there is a path from h ( e ) to h ( f ) in G (cid:114) f } if there is exactly one arc e ∈ S then return e return nil Function
MaximalRightMicroOmnitig ( G, f, g ) Input :
The compressed graph G , f g omnitig with f join arc and g split arc. Returns :
The path W such that f gW is a maximal right-micro omnitig. W ← empty path while True do if f gW ends with a bivalent arc then return W e ← RightExtension ( G, f, W ) if e = nil then return W W ← W e
Algorithm 2 is the procedure to obtain all maximal microtigs of a compressed graph. It firstfinds all central micro-omnitigs f g (with
RightExtension ( G, f, ∅ )), and it extends each to the right(i.e. forward in G ) and to the left (i.e. forward in G R ) with MaximalRightMicroOmnitig .To prove the correctness of Algorithm 2, we need to show some structural properties of micro-omnitigs and microtigs, as follows.
Lemma 19.
Let f g be a central-micro omnitig. The following hold:i) There exists at most one maximal right-micro omnitig f gW , and at most one maximal left-micro omnitig
W f g .ii) There exists a unique maximal microtig containing f g .Proof.
We prove only the first of the two symmetric statements in i ). If g is a bivalent arc, the claimtrivially holds by definition of maximal right-micro omnitig. Otherwise, a minimal counterexampleconsists of two right-micro omnitigs f gP g and f gP g (with P a join-free path possibly empty),with g and g distinct sibling split arcs. Since gP is a join-free path, the fact that both f gP g f gP g are omnitigs contradicts the Y-intersection Property (Corollary 17).For ii ), given f g , by i ) there exists at most one maximal left-micro omnitig W f g and at mostone maximal right-micro omnitig f gW , as such there is at most one maximal microtig W f gW . Lemma 20.
Let e be an arc. The following hold:i) if e is not a bivalent arc, then there exists at most one maximal microtig containing e .ii) if e is a bivalent arc, there exist at most two maximal microtigs containing e , of which atmost one is of the form eW , and at most one is of the form W e .Proof. By symmetry, in i ) we only prove the case in which e is a split-free arc. Notice that byLemma 14, h ( e ) belongs to a uniquely determined macronode M u of G ; let P be the split-free pathin G , from h ( e ) to u . Let f be the last arc of eP ( f = e if P is empty). By the X-intersection12 lgorithm 2: Function
AllMaximalMicrotigs Function
AllMaximalMicrotigs ( G ) Input :
The compressed graph G . Returns :
All the maximal microtigs in G . S ← ∅ foreach bivalent node u in G do foreach join arc f with h ( f ) = u do foreach split arc g with t ( g ) = u do if g = RightExtension ( G, f, ∅ ) then (cid:46)f g is a central-micro omnitig (cid:46) applied symmetrically for left- and right-micro omnitigs W ← MaximalRightMicroOmnitig ( G R , g, f ) W ← MaximalRightMicroOmnitig ( G, f, g ) add W f gW to S return S Property (Theorem 5), there exists at most one split arc g with t ( g ) = u = h ( f ) such that f g is anomnitig; if it exists, f g is a central-micro omnitig, hence by Lemma 19, there is at most one maximalleft-micro omnitig W f g . Finally, if such a maximal left-micro omnitig exists, eP g is a subwalk of
W f g , by the Y-intersection Property (Corollary 17). Otherwise, a minimal counterexample consistsof paths f Rg (subpath of eP g ) and f Rg (subpath of W f g ), where f (cid:54) = f and R is a split-freepath, since it is subpath of the split-free path eP ; since both f Rg and f Rg are omnitigs, thiscontradicts the Y-intersection Property.For ii ), we again prove only one of the symmetric cases. The proof is identical to the above,since by Lemma 14, h ( e ) belongs to a unique macronode M v of G . As such, e belongs to atmost one maximal microtig eW in M v . Symmetrically, t ( e ) belongs to a uniquely determinedmacronode M v of G . Thus, e belongs to at most one maximal microtig W e within M v . Theorem 21 (Maximal microtigs) . The maximal microtigs of any strongly connected graph G with n nodes, m arcs, and arbitrary degree have total length O ( n ) , and can be computed in time O ( m ) ,with Algorithm 2.Proof. First we prove the O ( n ) bound on the total length. As we explain in Section 6 we cantransform G into a compressed graph G (cid:48) such that G (cid:48) has n (cid:48) ≤ n nodes and m (cid:48) ≤ m arcs.Since G (cid:48) has at most n (cid:48) macronodes (recall that macronodes partition the vertex set, Lemma 14),and every macronode has at most two maximal microtigs, then number of maximal microtigs is atmost 2 n (cid:48) . The total length of all maximal microtigs is bounded as follows. Every internal arc ofa maximal microtig is not a bivalent arc, by definition. Since every non-bivalent arc appears in atmost one maximal microtig (Lemma 20), and there are at most n (cid:48) non-bivalent arcs in any graphwith n (cid:48) nodes, then the number of internal arcs in all maximal microtigs is at most n (cid:48) . Summingup for each maximal microtigs its two non-internal arcs (i.e., its first and last arc), we obtain thatthe total length of all maximal microtigs is at most 2 n (cid:48) + n (cid:48) = 3 n (cid:48) , thus O ( n ).As mentioned, in Section 6 we show how to transform G into a compressed graph G (cid:48) with O ( m )arcs, O ( m ) nodes, and constant degree. On this graph we can apply Algorithm 2. Since every13ode of the graph has constant degree, the if check in Line 6 runs a number of times linear in thesize O ( m ) of the graph. Checking the condition in Line 6 takes constant time, by Lemma 18; inaddition, the condition is true for every central-micro omnitig f g of the graph. The then blockcomputes a maximal microtig and takes linear time in its size, Lemma 18. By Lemma 20 we findevery microtig in linear total time. In this section we analyze how omnitigs go from one macronode to another. Macronodes areconnected with each other by bivalent arcs (Lemma 14), but merging microtigs on all possiblebivalent arcs may create too complicated structures. However, this can be avoided by a simpleclassification of bivalent arcs: those that connect a macronode with itself ( self-bivalent ) and thosethat connect two different macronodes ( cross-bivalent ), recall Figure 5.
Definition 22 (Self-bivalent and cross-bivalent arcs) . A bivalent arc b is called a self-bivalent arc if U ( b ) goes from a bivalent node to itself. Otherwise it is called a cross-bivalent arc . A macrotig is now obtained by merging those microtigs from different macronodes which overlaponly on a cross-bivalent arc, see also Figure 6.
Definition 23 (Macrotig) . Let W be any walk. W is called a macrotig if1. W is an microtig, or2. By writing W = W b W b . . . b k − W k − b k W k , where b , . . . , b k are all the internal bivalentarcs of W , the following conditions hold:(a) the arcs b , . . . , b k are all cross-bivalent arcs, and(b) W b , b W b , . . . , b k − W k − b k , b k W k are all microtigs. Notice that the above definition does not explicitly forbid two different macrotigs of the form W bW and W bW . However, Lemma 20 shows that there cannot be two different microtigs bW and bW , thus we immediately obtain: Lemma 24.
For any macrotig W there exists a unique maximal macrotig containing W .Proof. W.l.o.g., a minimal counterexample consists of a non-right-maximal macrotig
W b , such thatthere exist two distinct microtigs bW and bW (notice that b is a cross-bivalent arc). By Lemma 20applied to b , we obtain bW = bW , a contradiction.The macrotig definition also does not forbid a cross-bivalent arc to be used twice inside amacrotig. In Lemma 26 below we prove that also this is not possible, using the following result. Lemma 25 ([11]) . For any two distinct non-sibling split arcs g, g (cid:48) , write g ≺ g (cid:48) if there exists anomnitig gP g (cid:48) where P is split-free. Then, the relation ≺ is acyclic. Lemma 26.
Let W be a macrotig and let e be an arc of W . If e is self-bivalent, then e appears atmost twice in W (as first or as last arc of W ). Otherwise, e appears only once.Proof. If e is self-bivalent, then Definition 23 implies that e is either the first arc of W , the last arcof W , or both. Thus, e appears at most twice.Suppose now that e is not self-bivalent. We first consider the case when e is a split arc. We aregoing to prove that between any two consecutive non-self-bivalent split arcs the relation ≺ from14 g f g b = g b g f g b b f g f g b f u v w ℳ w ℳ v ℳ u b Figure 6:
Three macronodes M u , M v , M w (as gray areas) with arcs color-coded as in Figure 3a. Blackwalks mark their five maximal microtigs: b g . . . b , b i . . . f i g i . . . b i +1 ( i ∈ { , , } ), b . . . f g ( g = b ).The maximal macrotig M is obtained by overlapping them on the cross-bivalent arcs b , b , b , b , i.e. M = b . . . b . . . b . . . b . . . b . . . b . Notice that no arc is contained twice, with the exception of the cross-bivalentarc b , appearing as the first and last arc of M (Lemma 26). Bivalent nodes (e.g. u, v ) can appear (at most)twice in M (by the X-intersection Property and Lemma 19). Algorithm 3:
Function
AllMaximalMacrotigs . Function
AllMaximalMacrotigs ( G ) Input :
The compressed graph G . Returns :
All the maximal macrotigs in G . S ← AllMaximalMicrotigs ( G ) while ∃ W b ∈ S and bW ∈ S with b cross-bivalent arc and non-empty W , W do remove W b and bW from S add W bW to S return S Lemma 25 holds. Indeed, let g and g (cid:48) be two consecutive (i.e. closest distinct) non-self-bivalentsplit arcs along W : that is gP g (cid:48) subwalk of W , with P a split-free path. Notice that g and g (cid:48) arenot sibling arcs; since otherwise, g is a self-bivalent arc, by Observation 9. If t ( g (cid:48) ) is not a bivalentnode, then P is empty. In this case, g is a join-free arc, so gg (cid:48) is an omnitig; as such, g ≺ g (cid:48) .Otherwise, if t ( g (cid:48) ) is a bivalent node, then gP g (cid:48) is a left-micro omnitig and so it is an omnitig; assuch, again, g ≺ g (cid:48) .Suppose for a contradiction that e is traversed twice. Since there are no internal self-bivalentarcs (as argued at the beginning of the proof), this would result in a cycle in the relation ≺ , whichcontradicts Lemma 25.When e is a non-self-bivalent join arc, we proceed symmetrically. First, notice that the relationdefined in Lemma 25 is symmetric: if f and f (cid:48) are two distinct non-sibling join arcs such that f P f (cid:48) ,with P a join-free path, then f ≺ f (cid:48) . The claim above can be symmetrically adapted to hold forany two closest distinct non-self-bivalent join arcs f and f (cid:48) within a macrotig (i.e. correspondingto a subwalk of W of the form f P f (cid:48) , with P a join-free path). Moreover, f and f (cid:48) are not siblings;since otherwise, f (cid:48) is a self-bivalent arc, by Observation 9.Hence, by the acyclicity property of the relation ≺ on the reverse graph, the claim also holdsfor non-self-bivalent join arcs.Therefore, we can construct all maximal macrotigs by repeatedly joining microtigs overlappingon cross-bivalent arcs, as long as possible, as in Algorithm 3.15 heorem 27 (Maximal macrotigs) . The maximal macrotigs of any strongly connected graph G with n nodes, m arcs, and arbitrary degree have total length O ( n ) , and can be computed in time O ( m ) , with Algorithm 3.Proof. By Theorem 21, G has O ( n ) maximal microtigs, of total length O ( n ). By Lemma 26, everymaximal microtig is contained in a unique maximal macrotig (and it appears only once inside sucha macrotig), and the length of each maximal macrotig is at most the sum of the lengths of itsmaximal microtigs; thus, we have that the total length of all maximal macrotigs is at most O ( n ).Using Algorithm 2, we can get all the O ( n ) maximal microtigs of G in time O ( m ) (Theorem 21).Once we have them, we can easily implement Algorithm 3 in O ( m )-time. The correctness of thisalgorithm is guaranteed by Lemma 26. We begin by proving the first part of Theorem 2. Theorem 27 guarantees that the total length ofmaximal macrotigs is O ( n ). Thus, it remains to prove the following lemma, since Lemma 24 showsthat any macrotig is a subwalk of a maximal macrotig. Lemma 28 (Maximal omnitig representation) . Let W be a maximal omnitig. The followings hold: i ) If W contains an internal bivalent node, then W is of the form U ( f W (cid:48) g ) , where f is thefirst join arc of W and g (cid:54) = f is the last split arc of W , and W (cid:48) is a possibly empty walk.Moreover, f W (cid:48) g is a macrotig. ii ) Otherwise, W is of the form U ( b ) , where b is a bivalent arc, and b does not belong to anymacrotig.Proof. To prove i ), let u be an internal bivalent node of W , and let f u and g u be, respectively, thejoin arc and the split arc of W with h ( f u ) = u = t ( g u ); both such f u and g u exist, since u is aninternal node of W . Therefore, since W contains at least f u and g u , let f and g be, respectivelythe first join arc and the last split arc of W . Observe that f is either f u or it appears before f u in W ; likewise, g is either g u or it appears after g u in W . Thus, f comes before g , and we can write W = W − f W (cid:48) gW + , where W (cid:48) is the subwalk of W , possibly empty, from h ( f ) to t ( g ). Therefore,by the maximality of W , we have W = W − f W (cid:48) gW + = U ( f W (cid:48) g ).To prove that the subwalk f W (cid:48) g of W is a macrotig, we prove by induction that any walk ofthe form f W (cid:48) g , where f is a join arc and g is a split arc, is a macrotig. The induction is on thelength of W (cid:48) . Case 1: W (cid:48) contains no internal bivalent arcs. Since f W (cid:48) g contains a bivalent node (Observa-tion 9), it is of the form f W (cid:48) g = W (cid:48) f (cid:48) g (cid:48) W (cid:48) , with h ( f (cid:48) ) = t ( g (cid:48) ) = u bivalent node. Noticethat W (cid:48) f (cid:48) g (cid:48) W (cid:48) is an microtig and thus it is a macrotig, by definition. Case 2: f W (cid:48) g contains an internal bivalent arc b , i.e. f W (cid:48) g = W (cid:48) bW (cid:48) , with W (cid:48) , W (cid:48) non empty.By induction, W (cid:48) b and bW (cid:48) are macrotigs and both contain a bivalent node as internal node.Suppose b is a self-bivalent arc, then both W (cid:48) b and bW (cid:48) would contain the same bivalentnode u as internal node, contradicting Lemma 12. Thus, b is a cross-bivalent arc and W (cid:48) bW (cid:48) is also a macrotig, by definition.For ii ), notice that if W contains no internal bivalent node then it contains a unique bivalentarc b , by Lemma 10 and Observation 9. Thus, by the maximality of W , it holds that W = U ( b ).It remains to prove that there is no macrotig containing b .16uppose for a contradiction that there is a maximal left-micro omnitig M containing b . Bydefinition, M is of the form bW M f M g M . Notice that W g M is an omnitig, because M is an omnitigand the arcs of W before b are join-free, so W g M can have no forbidden path. This contradicts thefact that W is maximal.Symmetrically, we have that there is no maximal right-micro omnitig containing b . Thus, bydefinition, b appears in no microtig, and thus in no macrotig. Remark 29.
The number of maximal omnitigs containing an internal bivalent node (i.e., univocalextensions of a maximal macrotig subwalk) is O ( n ) , by maximality and by the fact that the totallength of maximal macrotigs is O ( n ) (Theorem 27). Next, we are going to prove the second, algorithmic, part of Theorem 2. By Theorem 27 we cancompute the maximal macrotigs of G in time O ( m ). We can trivially obtain in O ( m ) time the set F of arcs not appearing in the maximal macrotigs. It remains to show how to obtain the subwalksof the maximal macrotigs univocally extending to maximal omnitigs.We first prove an auxiliary lemma needed for the proof of the Extension Property (Theorem 6). Lemma 30.
Let f W be an omnitig, where f is a join arc. Let P be a path from t ( P ) = h ( W ) to anode in W , such that the last arc of P is not an arc of f W . Then no internal node of P is a nodeof W .Proof. Consider P W the longest suffix of P , such that no internal node of P W is a node of W .If P W = P , the lemma trivially holds. Let now W = ( u , e , u , e , . . . , e k , u k ). Let u i = t ( P W )and u j = h ( P W ). If i ≥ j , then P W is a forbidden path for f W ,a contradiction. Hence, assume i < j < k . Let f (cid:48) W Q be a closed path. Consider the walk Z = P W e j +1 . . . e k Q . Notice that e i +1 / ∈ Z and f / ∈ Z . Thus Z can transformed in a forbidden path for f W , from u i to h ( f ). Proof of the Extension Property (Theorem 6).
As seen in Lemma 16, at least one g exists whichsatisfies the condition. Assume g is a split arc, otherwise the statement trivially holds.First, assume that there is a g (cid:48) sibling split arc of g and a path P from h ( g ) to h ( f ) in G (cid:114) f .We prove that there exists a forbidden path for f W g . Let P W be the prefix of P ending in the firstoccurrence of a node in W (i.e., no node of P W belongs to W , except for h ( P W )). Notice that g (cid:48) P W is a forbidden path for the omnitig f W g (it is possible, but not necessary, that h ( P W ) = h ( f )).Second, take any forbidden path P for the omnitig f W g . We prove that there exists a g (cid:48) siblingsplit arc of g and a path from h ( g ) to h ( f ) in G (cid:114) f . Notice that t ( P ) = h ( W ) = t ( g ), otherwise P would be a forbidden path for f W . As such, P starts with a split arc g (cid:48) (cid:54) = g and, by Lemma 30, P does not contain f . Thus, the suffix of P from h ( g (cid:48) ) is a path in G (cid:114) f from h ( g (cid:48) ) to h ( f ).To describe the algorithm that identifies all maximal omnitigs (Algorithm 5), we first intro-duce an auxiliary procedure (Algorithm 4), which uses the Extension Property (Theorem 7) andTheorem 6 to find the unique possible extension of an omnitig. Corollary 31.
Algorithm 4 is correct. Moreover, assuming that the graph has constant degree, wecan preprocess it in time O ( m + n ) time, so that Algorithm 4 runs in constant time. Maximal omnitigs are identified with a two-pointer scan of maximal macrotigs (Algorithm 5):a left pointer always on a join arc f and a right pointer always on a split arc g , recall Figure 4. Forthe sake of completeness, we write Algorithm 5 so that it also outputs the maximal omnitigs. InSection 6.3 we explain what changes are needed when the graph does not have constant degree. Lemma 32 (Maximal omnitig enumeration) . Algorithm 5 is correct and, if the compressed graphhas constant degree, it runs in time linear in the total size of the graph and of its output. lgorithm 4: Function
IsOmnitigRightExtension Function
IsOmnitigRightExtension ( G, f, g ) Input :
The compressed graph G . A join arc f and a split arc g such that thereexists a walk f W g where f W is an omnitig. Returns :
Whether f W g is also an omnitig. S ← { g (cid:48) ∈ E ( G ) | t ( g (cid:48) ) = t ( g ) and there is a path from h ( g (cid:48) ) to h ( f ) in G (cid:114) f } return True if S = { g } and False otherwise
Proof.
Algorithm 3 returns every maximal macrotig in O ( m ) time, by Theorem 27.By Lemma 28, any maximal omnitig W is either of the form U ( f W (cid:48) g ) (where f W (cid:48) g is amacrotig, and thus also a subwalk of a maximal macrotig, by Lemma 24), or of the form W = U ( b ),where b is a bivalent arc not appearing in any macrotig.In the latter case, such omnitigs are outputted in Line 2. In the former case, it remains to provethat the external while cycle, in Line 5, outputs all the maximal omnitigs of the form U ( f W (cid:48) g )where f W (cid:48) g is contained in a maximal macrotig f ∗ Xg ∗ .At the beginning of the first iteration, W = U ( X [ f..g (cid:48) ]) is left-maximal since f = f ∗ . The firstinternal while cycle, in Line 6, ensures that W = U ( X [ f..g ]) is also right-maximal, at which point itis printed in output. Then, the second internal while cycle, in Line 10, ensures that W = U ( X [ f..g (cid:48) ])is a left-maximal omnitig, and the external cycle repeats.To prove the running time bound, observe that each iteration of the foreach cycle takes timelinear in the total size of the maximal macrotig X and of its output (by Corollary 31), and thatthe total size of all maximal macrotigs is linear, by Theorem 27. In this section, we describe three transformations of the given graph G to guarantee the assumptionof compression and constant degree on every node. It is immediate to see that they and their inversescan be performed in linear time. The first transformation allows us to reduce to the case in which the graph has constant out-degree(see Figure 7 for an example).
Transformation 1.
Given G , for every node v with d + ( v ) > , let e , e , . . . , e k be the arcs out-going from v . Replace v with the path ( v , e (cid:48) , v , e (cid:48) , . . . , e (cid:48) k − , v k − ) , where v , . . . , v k − are newnodes, and e (cid:48) , . . . , e (cid:48) k − are new edges. Each arc e i with t ( e i ) = v in G now has t ( e i ) = t ( e (cid:48) i ) = v i ,except for e k which has t ( e k ) = v k − . By also applying the symmetric transformation, the problem on the original graph G is thusreduced to a graph G (cid:48) with constant out- and in-degree. Notice that the number of arcs of G (cid:48) is still O ( m ), where m is the number of arcs of the original graph. As such, we can obtain the macrotigsof G (cid:48) in O ( m ) time. The trivial strategy to obtain all maximal omnitigs of G is to enumerate allmaximal omnitigs of G (cid:48) , and from these contract all the new arcs introduced by the transformation(while also removing duplicate maximal omnitigs, if necessary). However, thus may invalidate thelinear-time complexity of the enumeration step, since the length of the maximal omnitigs of G may18 lgorithm 5: Computing all maximal omnitigs
Input :
The compressed graph G . Outputs :
All maximal omnitigs of G . B ← { b bivalent arc | b does not occur in any W ∈ AllMaximalMacrotigs ( G ) } foreach b ∈ B do output U ( b ) foreach f ∗ Xg ∗ ∈ AllMaximalMacrotigs ( G ) do (cid:46) With the notation X [ f..g ], we refer to the subwalk of f ∗ W g ∗ starting with theoccurrence of f in f ∗ X (unique by Lemma 26) and ending with the occurrence of g in Xg ∗ (unique by Lemma 26). f ← f ∗ , g ← nil , g (cid:48) ← first split arc in Xg ∗ while g (cid:48) (cid:54) = nil do while g (cid:48) (cid:54) = nil and IsOmnitigRightExtension ( f, g (cid:48) ) do (cid:46) Grow X [ f..g ] to the right as long as possible g ← g (cid:48) g (cid:48) ← next split arc in Xg ∗ after g(cid:46)X [ f..g ] cannot be grown to the right anymore output U ( X [ f..g ]) while g (cid:48) (cid:54) = nil and not IsOmnitigRightExtension ( f, g (cid:48) ) do (cid:46) Shrink X [ f..g ] from the left until it can be grown to the right again i ← index of next join arc in f ∗ X after f v e ... e k T1 −−→ v e e v e e e k − v k − e k − e k Figure 7:
An example of Transformation 1 ( T
1) applied to the node v , where e , . . . , e k ∈ δ + ( v ) are thearcs with tail equal to v . be super-linear in total maximal omnitig length of G , see Figure 8. In Section 6.3 we explain howwe can easily modify the maximal omnitig enumeration step to maintain the O ( m ) output-sensitivecomplexity.To prove the correctness of Transformation 1, we proceed as follows. Let c e ( G ) be the graphobtained from G by contracting an arc e ( contracting e means that we remove e and identify itsendpoints). For every walk W of G , we denote by c e ( W ) the walk of c e ( G ), obtained from W byremoving every occurrence of e (here we regard walks as sequences of arcs). In the following, weregard c e as a surjective function from the family of walks of G to the family of walks of c e ( G ). Observation 33.
When e is a split-free or join-free arc, then c e is a bijection when restricted tothe closed (arc-covering) walks, or to the open walks of G whose first and last arc are different than e . Lemma 34.
Let e be a join-free arc of G . A walk W (cid:48) of c e ( G ) is an omnitig of c e ( G ) if and only wW e f wW eP PR ve e e m … … e ′ m − 2 e ′ ′ e ′ ′ m − 2 e e e m − 2 e m e ′ e m − 1 f wW e f wW eP PR ve e e m … … e ′ m − 2 e ′ ′ e ′ ′ m − 2 e e e m − 2 e m e ′ e m − 1 Figure 8:
Left: A graph G made up of a single node and m ≥ e , . . . , e m . Its m maximalomnitigs are e , . . . , e m . Right: The graph G (cid:48) obtained from G by applying Transformation 1 and itssymmetric transformation; the nodes of G (cid:48) have in-degree and out-degree at most 2. Notice that the numberof arcs of G (cid:48) is O ( m ). The m maximal omnitigs of G (cid:48) are of the form U ( e i ) = e (cid:48) · · · e (cid:48) i − e i e (cid:48)(cid:48) i − · · · e (cid:48)(cid:48) (for i ∈ { , . . . , m } ). Notice that their total length is Θ( m ), thus one cannot enumerate all maximal omnitigs of G (cid:48) and convert these to maximal omnitigs of G . However, one can stop all univocal extensions of the arcs e i when reaching arcs introduced by the transformations in G (cid:48) , see Section 6.3. if there exists an omnitig W of G such that W (cid:48) = c e ( W ) .Proof. Consider the shortest walk W of G such that W (cid:48) = c e ( W ). Notice that the first and last arcof W are different than e . Moreover, W (cid:48) is an omnitig of c e ( G ) iff W is an omnitig of G . Indeed,for every circular covering C of G it holds that C avoids W iff c e ( C ) avoids W (cid:48) . Corollary 35.
Let e be a join-free arc of G . A walk W (cid:48) of c e ( G ) is a maximal omnitig of c e ( G ) ifand only if there exists a maximal omnitig W of G such that W (cid:48) = c e ( W ) .Proof. Let W be a maximal omnitig of G . Then c e ( W ) is an omnitig of c e ( G ) by Lemma 34.Moreover, if W (cid:48) was an omnitig of c e ( G ) strictly containing c e ( W ), then there would exist anomnitig W of G such that W (cid:48) = c e ( W ), by Lemma 34. Clearly, W would contain W and contradictits maximality. Therefore, c e ( W ) is a maximal omnitig of c e ( G ).For the converse, let W (cid:48) be a maximal omnitig of c e ( G ). Let W be the shortest and uniqueminimal walk of G such that W (cid:48) = c e ( W ). By Lemma 34, W is an omnitig of G . Let W be anymaximal omnitig of G containing W . We claim that c e ( W ) = W (cid:48) = c e ( W ), which concludes theproof. If not, then c e ( W ) would strictly contain W (cid:48) and contradict its maximality since also c e ( W )would be an omnitig of c e ( G ) by Lemma 34. Lemma 36.
Let G be a graph and let G (cid:48) be the graph obtained by applying Transformation 1 to G . Then a walk W of G is a maximal omnitig of G if and only if there exists a maximal omnitig W (cid:48) of G (cid:48) such that W is the string obtained from W (cid:48) by suppressing all the arcs introduced withthe transformation.Proof. Notice that G is obtained by applying c e to each arc e introduced by Transformation 1, thatis, to each arc of G (cid:48) that is not an arc of G . Notice that W is the string obtained from W (cid:48) bysuppressing all the arcs introduced with the transformation if and only if W is obtained from W (cid:48) by contracting each arc e introduced by Transformation 1. Apply Corollary 35. We start by recalling the definition of compressed graph.
Definition 37 (Compressed graph) . A graph G is compressed if it contains no biunivocal nodesand no biunivocal arcs. v e v e v ‘ − v ‘ e ‘ T2 −−→ v v ‘ ev v ‘ e T3 −−→ v Figure 9:
An example of Transformation 2 ( T
2) applied to the path P = ( v , e , . . . , e (cid:96) , v (cid:96) ), where v , . . . , v (cid:96) − are biunivocal nodes and e is the new arc from v to v (cid:96) . The Transformation 3 ( T
3) com-presses biunivocal arcs.
To obtain a compressed graph, we introduce two transformations. The first one removes biu-nivocal nodes, by replacing those paths whose internal nodes are biunivocal with a single arc fromthe tail of the path to its head (see Figure 9 for an example).
Transformation 2.
Given G , for every longest path P = ( v , e , . . . , e (cid:96) − , v (cid:96) ) , (cid:96) ≥ , such that v , . . . , v (cid:96) − are biunivocal nodes, we remove v , . . . , v (cid:96) − and their incident arcs from G , and weadd a new arc from v to v (cid:96) . This transformation is widely used in the genome assembly field, and it clearly preserves themaximal omnitigs of G : if P = ( v , e , . . . , e (cid:96) − , v (cid:96) ) , (cid:96) ≥ v , . . . , v (cid:96) − are biunivocalnodes, in any closed arc-covering walk of G , whenever e appears it is always followed by e , . . . , e (cid:96) .The last transformation contracts the biunivocal arcs of the graph (see Figure 9 for an example). Transformation 3.
Given G , we contract every biunivocal arc e , namely we set t ( e (cid:48) ) = t ( e ) forevery out-going arc from h ( e ) and remove the node h ( e ) . Also this transformation preserves the maximal omnitigs of G because every maximal omnitigwhich contains an endpoint of e , also contains e . Notice that after Transformations 2 and 3, themaximum in-degree and the maximum out-degree are the same as in the original graph.In the remainder of this section we prove some lemmas stated in Section 4. Lemma 14.
In a compressed graph G , the following properties hold:i) The set { V ( M v ) : v is a bivalent node of G } is a partition of V ( G ) .ii) In a macronode M v , R + ( v ) and R − ( v ) induce two trees with common root v , but oriented inopposite directions. Except for the common root, the two trees are node disjoint, all nodes in R − ( v ) being join nodes and all nodes in R + ( v ) being split nodes.iii) The only arcs with endpoints in two different macronodes are bivalent arcs.Proof. For i ), let u and v be distinct bivalent nodes and suppose that there exists x ∈ V ( M u ) ∩ V ( M v ). W.l.o.g., assume x is a join node (the case where x is a split node is symmetric). Bydefinition, x ∈ R − ( u ) ∩ R − ( v ) holds. Let P u and P v be split-free paths from x to u and to v ,respectively. Notice that x can not be a bivalent node, since otherwise from x no split-free pathcan start. Since the out-degree of x is one, P u and P v share a prefix of length at least one, butsince u and v are distinct bivalent nodes, P u and P v differ by at least one arc. Let e be the first arcsuch that e ∈ P u , but e / ∈ P v , and let e (cid:48) be its sibling arc, with e (cid:48) / ∈ P u , but e (cid:48) ∈ P v . Notice that t ( e ) = w is a join node, since it belongs to split-free paths, but it also has out-degree two, since w = t ( e ) = t ( e (cid:48) ); hence w is an internal bivalent node of split-free paths, a contradiction.Properties ii ) and iii ) trivially follow from the definition of macronode.21 emma 10. Every maximal omnitig of a compressed graph contains both a join arc and a splitarc. Moreover, it has a bivalent arc or an internal bivalent node.Proof.
Consider an omnitig W composed only of split-free arcs. Notice first that W is a path.Consider any arc e , with h ( e ) = t ( W ) and observe that eW is an omnitig, since the only out-going arcs of internal nodes of eW are arcs of eW ; thus there is no forbidden path between anytwo internal nodes of eW . Therefore, W is not a maximal omnitig. Symmetrically, no maximalomnitig is composed only of join-free arcs. This already implies the first claim in the statement:any maximal omnitig W contains at least one join arc f and at least one split arc g . If f = g then W contains the bivalent arc f . Otherwise, either W contains a subwalk of the form f W (cid:48) g or itcontains a subwalk of the form gW (cid:48) f , where W (cid:48) might be an empty walk. In the first case W hasan internal node which is bivalent, by Observation 9( i ). In the second case W contains a bivalentarc, by Observation 9( ii ). Lemma 11.
Let e be a join or a split arc. No omnitig can traverse e twice.Proof. By symmetry, we only consider the case of two sibling split arcs g and g (cid:48) . Since prefixesand suffixes of omnitigs are omnitigs, then a minimal violating omnitig would be of the form gZg ,with g / ∈ Z . Since G is strongly connected, then there exists a simple cycle C of G with g (cid:48) ∈ C andwith g (cid:48) as its first arc. Notice that g / ∈ C , since C is simple. Consider then the first node u sharedby both C and Z , and let e be the arc of C with h ( e ) = u . Clearly, e / ∈ Z ; in addition, e (cid:54) = g , since C is a path. Let C u represent the prefix of C ending in u . Therefore, C u is a forbidden path forthe omnitig gZg , since it starts from t ( g ) = t ( g (cid:48) ), with g (cid:48) (cid:54) = g , and it ends in u with e / ∈ Z . Lemma 12.
Let u be a bivalent node. No omnitig contains u twice as an internal node.Proof. Suppose for a contradiction, there exist an omnitig W that contains u twice as internalnode. Since u is an internal node of W , we can distinguish the case in which an omnitig containstwice a central-micro omnitig that traverses u , and the case in which an omnitig contains boththe central-micro omnitigs that traverse u . In the first case, let f g be the central-micro omnitigof an omnitig W that traverses u . Notice that f is a join arc contained twice in W , contradictingLemma 11. In the latter case, let f g and f g the two central-micro omnitigs that traverse u ,with f (cid:54) = f and g (cid:54) = g . Consider W to be a minimal violating omnitig of the form f g ¯ W f g .Notice that u / ∈ ¯ W , by minimality; hence g ¯ W f is a forbidden path, contradicting W being anomnitig. Given the input strongly connected graph G with m arcs, and non-constant degree, denote by G (cid:48) the graph with constant in-degree and out-degree obtained by applying Transformation 1 andits symmetric. The trivial strategy to obtain the set of maximal omnitigs of G , given the set ofmaximal omnitigs G (cid:48) , is to:1. Contract in the maximal omnitigs all the arcs which were introduced by Transformation 1.2. Remove any duplicate omnitig which may occur due to this contraction (i.e., two differentmaximal omnitigs in G (cid:48) which result in the same walk in the G , after the contraction).In general, the above procedure may require more than linear time in the final output size, recallFigure 8. 22e avoid this, as follows. Let M and M (cid:48) denote the set of maximal macrotigs of G and G (cid:48) ,respectively, and let F and F (cid:48) denote the set of bivalent arcs not appearing in any macrotig, of G and G (cid:48) , respectively (recall Theorem 2).First, since G (cid:48) has O ( m ) arcs, then also the maximal macrotigs M (cid:48) have total length O ( m ),and both M (cid:48) and F (cid:48) can be obtained in O ( m ) time. From M (cid:48) , one can obtain M in time O ( m ),by contracting the arcs introduced by the transformation. However, while contracting such arcs,we must keep track of the pair of arcs ( f, g ) corresponding to maximal omnitigs, as follows.We modify Algorithm 5 to also report, for each macrotig X (cid:48) of G (cid:48) and for each maximal omnitigof the form U ( X (cid:48) [ f..g ]) (in the order they were generated by the algorithm), the indexes of the arcs f and g in X (cid:48) . We now contract the arcs of X (cid:48) by removing from X (cid:48) every occurrence of the arcsintroduced by the transformation, and updating the indexes of f and g so that they still point atthe first and last arc of the walk obtained from X (cid:48) [ f..g ], after the contraction. Second, to avoidduplicates, we scan the pair of indexes of f and g along each macrotig, and remove any duplicatedpair (if duplicates are present, they must occur consecutively, and thus they can be removed inlinear time).Second, the transformations do not introduce bivalent arcs, thus F = F (cid:48) . This also impliesthat the arcs introduced by the transformation appear either inside macrotigs, or inside univocalextensions U ( · ). Having the set of maximal macrotigs M and the new arc pairs ( f, g ) inside themaximal macrotigs in M , it now suffices to perform the univocal extensions U ( · ) inside the originalgraph G . We thank Sebastian Schmidt for useful comments, including the observation that the bound onthe total length of all maximal macrotigs can be improved to O ( n ) (from O ( m ) initially), ShahbazKhan for helpful discussions and comments, and Bastien Cazaux for discussions on the shortestsuperstring problem. This work was partially funded by the European Research Council (ERC)under the European Union’s Horizon 2020 research and innovation programme (grant agreementNo. 851093, SAFEBIO) and by the Academy of Finland (grants No. 322595, 328877). References [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results forLCS and other sequence similarity measures. In Venkatesan Guruswami, editor,
IEEE 56thAnnual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA,17-20 October, 2015 , pages 59–78. IEEE Computer Society, 2015.[2] Nidia Obscura Acosta, Veli M¨akinen, and Alexandru I. Tomescu. A safe and complete algo-rithm for metagenomic assembly.
Algorithms for Molecular Biology , 13(1):3:1–3:12, 2018.[3] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basiclocal alignment search tool.
Journal of molecular biology , 215(3):403–410, 1990.[4] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratictime (unless SETH is false). In Rocco A. Servedio and Ronitt Rubinfeld, editors,
Proceedingsof the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015,Portland, OR, USA, June 14-17, 2015 , pages 51–58. ACM, 2015.235] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? InIrit Dinur, editor,
IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA , pages 457–466.IEEE Computer Society, 2016.[6] Djamal Belazzougui. Linear time construction of compressed text indices in compact space. InDavid B. Shmoys, editor,
Symposium on Theory of Computing, STOC 2014, New York, NY,USA, May 31 - June 03, 2014 , pages 148–193. ACM, 2014.[7] Djamal Belazzougui and Simon J. Puglisi. Range predecessor and lempel-ziv parsing. In RobertKrauthgamer, editor,
Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016 , pages 2053–2071. SIAM, 2016.[8] S´ebastien Boisvert, Fran¸cois Laviolette, and Jacques Corbeil. Ray: simultaneous assemblyof reads from a mix of high-throughput sequencing technologies.
Journal of computationalbiology , 17(11):1519–1533, 2010.[9] G. Bresler, M. Bresler, and D. Tse. Optimal Assembly for High Throughput Shotgun Sequenc-ing.
BMC Bioinformatics , 14(Suppl 5):S18, 2013.[10] Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I Tomescu, andElia C Zirondelli. Genome assembly, a universal theoretical framework: unifying and general-izing the safe and complete algorithms.
Submitted , November 2020.[11] Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, and Alexandru I.Tomescu. An Optimal O ( nm ) Algorithm for Enumerating All Walks Common to All ClosedEdge-covering Walks of a Graph. ACM Trans. Algorithms , 15(4):48:1–48:17, 2019.[12] Katar´ına Cechl´arov´a. Persistency in the assignment and transportation problems.
Mat. Meth.OR , 47(2):243–254, 1998.[13] Kun-Mao Chao, Ross C. Hardison, and Webb Miller. Locating well-conserved regions withina pairwise alignment.
CABIOS , 9(4):387–396, 1993.[14] Rayan Chikhi and Paul Medvedev. Informed and automated k -mer size selection for genomeassembly. Bioinformatics , 30(1):31–37, 06 2013.[15] Marie Costa. Persistency in maximum cardinality bipartite matchings.
Oper. Res. Lett. ,15(3):143–9, 1994.[16] Bart(cid:32)lomiej Dudek and Pawe(cid:32)l Gawrychowski. Computing quartet distance is equivalent tocounting 4-cycles. In
Proceedings of the 51st Annual ACM SIGACT Symposium on Theory ofComputing , STOC 2019, pages 733–743, New York, NY, USA, 2019. Association for ComputingMachinery.[17] Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison.
Biological sequenceanalysis: probabilistic models of proteins and nucleic acids . Cambridge university press, 1998.[18] David Eppstein. K-best enumeration.
Bulletin of the EATCS , 115, 2015.[19] David Eppstein. k -best enumeration. In Encyclopedia of Algorithms . 2015.2420] Massimo Equi, Roberto Grossi, Veli M¨akinen, and Alexandru I. Tomescu. On the complexityof string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini,and Stefano Leonardi, editors, , volume 132 of
LIPIcs , pages55:1–55:15. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019.[21] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In , pages 390–398. IEEE Computer Society, 2000.[22] Paolo Ferragina, Igor Nitto, and Rossano Venturini. On the bit-complexity of lempel-zivcompression. In Claire Mathieu, editor,
Proceedings of the Twentieth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009 ,pages 768–777. SIAM, 2009.[23] A Friemann and S Schmitz. A new approach for displaying identities and differences amongaligned amino acid sequences.
Comput Appl Biosci , 8(3):261–265, Jun 1992.[24] Loukas Georgiadis, Giuseppe F Italiano, and Nikos Parotsidis. Strong connectivity in directedgraphs under failures, with applications. In
Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 1880–1899. SIAM, 2017.[25] Roberto Grossi.
Enumeration of Paths, Cycles, and Spanning Trees , pages 640–645. SpringerNew York, New York, NY, 2016.[26] Meigu Guan. Graphic programming using odd and even points.
Chinese Math. , 1:237–277,1962.[27] A. Gu´enoche. Can we recover a sequence, just knowing all its subsequences of given length?
Computer Applications in the Biosciences , 8(6):569–574, 1992.[28] P. L. Hammer, P. Hansen, and B. Simeone. Vertices belonging to all or to no maximum stablesets of a graph.
SIAM Journal on Algebraic Discrete Methods , 3(4):511–522, 1982.[29] Iu, V. L. Florent’ev, A. A. Khorlin, K. R. Khrapko, and V. V. Shik. Determination of thenucleotide sequence of DNA using hybridization with oligonucleotides. A new method.
DokladyAkademii nauk SSSR , 303(6):1508–1511, 1988.[30] Benjamin Grant Jackson.
Parallel methods for short read assembly . PhD thesis, Iowa StateUniversity, 2009.[31] Evgeny Kapun and Fedor Tsarev. De Bruijn superwalk with multiplicities problem is NP-hard.
BMC Bioinformatics , 14(Suppl 5):S7, 2013.[32] John D. Kececioglu and Eugene W. Myers. Combinatorial algorithms for DNA sequenceassembly.
Algorithmica , 13(1/2):7–51, 1995.[33] John Dimitri Kececioglu.
Exact and approximation algorithms for DNA sequence reconstruc-tion . PhD thesis, University of Arizona, Tucson, AZ, USA, 1992.[34] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWTconstruction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors,
Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC2019, Phoenix, AZ, USA, June 23-26, 2019 , pages 756–767. ACM, 2019.2535] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors.In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors,
Proceedings of the 50thAnnual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA,USA, June 25-29, 2018 , pages 827–840. ACM, 2018.[36] Carl Kingsford, Michael C Schatz, and Mihai Pop. Assembly complexity of prokaryoticgenomes using short reads.
BMC Bioinformatics , 11(1):21, 2010.[37] Carl Kingsford, Michael C Schatz, and Mihai Pop. Assembly complexity of prokaryoticgenomes using short reads.
BMC bioinformatics , 11(1):21, 2010.[38] Masashi Kiyomi.
Reverse Search; Enumeration Algorithms , pages 1840–1842. Springer NewYork, New York, NY, 2016.[39] Ka-Kit Lam, Asif Khalak, and David Tse. Near-optimal assembly for shotgun sequencing withnoisy reads.
BMC Bioinform. , 15(S-9):S4, 2014.[40] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2.
NatureMethods , 9(4):357, 2012.[41] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. Megahit:an ultra-fast single-node solution for large and complex metagenomics assembly via succinctde bruijn graph.
Bioinformatics , 31(10):1674–1676, 2015.[42] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheelertransform.
Bioinformatics , 25(14):1754–1760, 2009.[43] Veli M¨akinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu.
Genome-ScaleAlgorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing .Cambridge University Press, 2015.[44] Paul Medvedev. Modeling biological problems in computer science: a case study in genomeassembly.
Briefings in bioinformatics , 20(4):1376–1383, 2019.[45] Paul Medvedev and Michael Brudno. Maximum likelihood genome assembly.
Journal ofcomputational biology , 16(8):1101–1116, 2009.[46] Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability ofmodels for sequence assembly. In
WABI , pages 289–301, 2007.[47] Eugene W. Myers. The fragment assembly string graph. In
ECCB/JBI , page 85, 2005.[48] Niranjan Nagarajan and Mihai Pop. Parametric complexity of sequence assembly: theory andapplications to next generation sequencing.
Journal of computational biology , 16(7):897–908,2009.[49] Niranjan Nagarajan and Mihai Pop. Sequence assembly demystified.
Nature Reviews Genetics ,14(3):157–167, 2013.[50] Giuseppe Narzisi, Bud Mishra, and Michael C Schatz. On algorithmic complexity of biomolec-ular sequence assembly problem. In
Algorithms for Computational Biology , pages 183–195.Springer, 2014. 2651] Hannu Peltola, Hans S¨oderlund, Jorma Tarhio, and Esko Ukkonen. Algorithms for some stringmatching problems arising in molecular genetics. In
IFIP Congress , pages 59–64, 1983.[52] P. A. Pevzner. l-Tuple DNA sequencing: computer analysis.
Journal of Biomolecular Structure& Dynamics , 7(1):63–73, August 1989.[53] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNAfragment assembly.
Proceedings of the National Academy of Sciences , 98(17):9748–9753, 2001.[54] Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2.
Nature Methods ,17(2):155–158, 2020.[55] Ilan Shomorony, Samuel H. Kim, Thomas A. Courtade, and David N. C. Tse. Information-optimal genome assembly via sparse read-overlap graphs.
Bioinform. , 32(17):494–502, 2016.[56] Alexandru I. Tomescu and Paul Medvedev. Safe and Complete Contig Assembly Via Omnitigs.In Mona Singh, editor,
Research in Computational Molecular Biology - 20th Annual Confer-ence, RECOMB 2016, Santa Monica, CA, USA, April 17-21, 2016, Proceedings , volume 9649of
Lecture Notes in Computer Science , pages 152–163. Springer, 2016.[57] Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly through om-nitigs.
Journal of Computational Biology , 24(6):590–602, 2017.[58] Martin Vingron and Patrick Argos. Determination of reliable regions in protein sequencealignments.
Prot. Engin. , 3(7):565–569, 1990.[59] Virginia Vassilevska Williams, Joshua R. Wang, Richard Ryan Williams, and Huacheng Yu.Finding four-node subgraphs in triangle time. In
Proceedings of the Twenty-Sixth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January4-6, 2015 , pages 1671–1680, 2015.[60] Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao,Jun-Hua Tian, Yuan-Yuan Pei, Ming-Li Yuan, Yu-Ling Zhang, Fa-Hui Dai, Yi Liu, Qi-MinWang, Jiao-Jiao Zheng, Lin Xu, Edward C. Holmes, and Yong-Zhen Zhang. A new coronavirusassociated with human respiratory disease in china.
Nature , 579(7798):265–269, 2020.[61] M Zuker. Suboptimal sequence alignment in molecular biology. alignment with error analysis.
J Mol Biol , 221(2):403–420, Sep 1991.
A Bioinformatics motivation
As mentioned in the Introduction, closed arc-covering walks were considered in [56, 57] motivatedby the genome assembly problem from Bioinformatics. We briefly review that motivation here, forthe sake of completeness, and further refer the reader also to [49, 43].The genome assembly problem asks for the reconstruction of a genome string from a set R ofshort strings ( reads ) sequenced from the genome. From the read data, one usually builds a graph,and models the genome to be assembled as a certain type of walk in the graph.One of the most popular types of graphs is the so-called de Bruijn graph. For a fixed integer k ,shorter than the read length, the de Bruijn graph of order k is obtained by adding a node for every k − k −
1) appearing in the reads. Moreover, for every k -mer27 G GT TA ACGG GA CCC G TACCGGAC GCG GT TA ACGG GA CCCG GT TA ACGG GA CC CGTACCGTACGGACG CGT CCG ACG ACG GTA CGT CGG
CGC
TAC GTA GGA
GCG
ACC TAC GACC G T ACCGGGA C TAC G C G T ACCGGGA C TAC G (a) Top left: a circular string and a set of strings ( reads ) of length 3 sequenced from it, drawn as blackarrows. Bottom left: The same circular string now shown as linearized (in blue) and the same set of reads(in black) sequenced from it. Notice that the two reads in bold overlap the beginning and the end of thecircular string. Right: the de Bruijn graph G of order k = 3 built from the set of reads. Every distinct( k − k −
1) appearing in the reads is a node, and every k -mer appearing in the readsis an edge. We show as dotted-green a path appearing in all closed arc-coverings of G . We show the samepath also as a substring of the two circular strings from Figures 10b and 10c.) CG GT TA ACGG GA CCC G TACCGGAC GCG GT TA ACGG GA CCCG GT TA ACGG GA CC CGTACCGTACGGACG CGT CCG ACG ACG GTA CGT CGG
CGC
TAC GTA GGA
GCG
ACC TAC GACC G T ACCGGGA C TAC G C G T ACCGGGA C TAC G (b) Left: In red, a closed arc-covering of G , which is also Eulerian. On the right, the circular string “spelled”by it, obtained by naturally reading the k -mers of its edges and merging their ( k − set of k -mers as the string from Figure 10a, but is different from it. CG GT TA ACGG GA CCC G TACCGGAC GCG GT TA ACGG GA CCCG GT TA ACGG GA CC CGTACCGTACGGACG CGT CCG ACG ACG GTA CGT CGG
CGC
TAC GTA GGA
GCG
ACC TAC GACC G T ACCGGGA C TAC G C G T ACCGGGA C TAC G (c) Left: In red, a closed arc-covering of G spelling the same string as the circular string from Figure 10a. Figure 10:
An example of a circular string, a de Bruijn graph built from a set of reads sequenced from it,two closed arc-covering walks of the graph, and a safe path appearing in all closed arc-covering walks of thegraph. This safe path is biunivocal, but, as exemplified in Figure 5, other graphs can admit more complexsafe walks.
28n the reads, one also adds an arc from the node representing its length-( k −
1) prefix to the noderepresenting its length-( k −
1) suffix. See Figure 10a.Given such a graph, there are various ways of modeling the genome assembly solution. Assumingthat the genome is circular (like in the case of most bacteria), a basic approach is to model it asa closed