[PDF] A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs

Abstract

The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n^2).

Full PDF

AA Linear Time Algorithm for ConstructingHierarchical Overlap Graphs

Sangsoo Park ! Samsung Electronics, Korea

Sung Gwan Park ! Samsung Electronics, Korea

Bastien Cazaux ! LIRMM, Univ Montpellier, CNRS, France

Kunsoo Park ! Seoul National University, Korea

Eric Rivals ! LIRMM, Univ Montpellier, CNRS, France

Abstract

The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O ( || P || log n ) timeand O ( || P || ) space, where || P || is the sum of lengths of the strings in P . In this paper we presenta new algorithm to construct HOG in O ( || P || ) time and space. Hence, the construction time andspace of HOG are better than those of the overlap graph, which are O ( || P || + n ). Theory of computation → Pattern matching

Keywords and phrases overlap graph, hierarchical overlap graph, shortest superstring problem,border array

Digital Object Identifier

For a set of strings, a superstring of the set is a string that has all strings in the set asa substring. The shortest superstring problem is to find a shortest superstring of a set ofstrings. This problem is known to play an important role in

DNA assembly , which is theproblem to restore the entire genome from short sequencing reads. Despite its importance,the shortest superstring problem is known to be NP-hard [6]. As a result, extensive researchhas been done to find good approximation algorithms for the shortest superstring problem[2, 18, 11, 13, 19, 20].The shortest superstring problem is reduced to finding a shortest hamiltonian path ina graph that encodes overlaps between the strings [2, 12, 16], which is the distance graph or equivalent overlap graph . The overlap graph [15] of a set of strings is a graph in whicheach string constitutes a node and an edge connecting two nodes shows the longest overlapbetween them. Many approaches for approximating the shortest superstring problem focuson the overlap graph, and try to find good approximations of its hamiltonian path [11, 13].Given a set of strings P = { s , s , ..., s n } , computing the overlap graph of P is equivalentto solving the all-pair suffix-prefix problem , which is to find the longest overlap for every pairof strings in P . The best theoretical bound for this problem is O ( || P || + n ) [8], where || P || is the sum of lengths of the strings in P . Since the input size of the problem is O ( || P || ) andthe output size is O ( n ), this bound is optimal. There has also been extensive research onthe all-pair suffix-prefix problem in the practical point of view [7, 10, 17] because it is thefirst step in DNA assembly. © Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals;licensed under Creative Commons License CC-BY 4.042nd Conference on Very Important Topics (CVIT 2016).Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:9Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . D S ] F e b Figure 1

Data structures built with P = { aacaa, aagt, gtc } . (a) Aho-Corasick trie. (b) ExtendedHierarchical Overlap Graph. (c) Hierarchical Overlap Graph. Recently, Cazaux and Rivals [4, 5] proposed a new graph which stores the overlapinformation, called the hierarchical overlap graph (HOG). HOG is a graph with two types ofedges (which will be defined in Section 2) in which a node represents either a string or thelongest overlap between a pair of strings. The extended hierarchical overlap graph (EHOG) isalso a graph with two types of edges in which a node represents either a string or an overlapbetween a pair of strings (which may be not the longest one). For example, Figure 1 showsEHOG and HOG built with P = { aacaa, aagt, gtc } . Even though HOG and EHOG may bethe same for some input instances, there are the instances where the ratio of EHOG sizeover the HOG size tend to infinity with respect to the number of nodes. Therefore, HOGhas an advantage over EHOG in both practical and theoretical points of view.HOG also has an advantage compared to the overlap graph [5]. First, HOG uses only O ( || P || ) space, while the overlap graph needs O ( || P || + n ) space in total. For input instanceswith many short strings, HOG uses a considerably smaller amount of space than the overlapgraph. Second, HOG contains the relationship between the overlaps themselves, since theoverlaps appear as nodes in HOG. In contrast, the overlap graph stores only the lengths ofthe longest overlaps, and thus we cannot find the relationship between two overlaps easily.Therefore, HOG stores more information than the overlap graph, while using less space.There have been many works to compute HOG and EHOG efficiently. Computing theEHOG from P costs O ( || P || ) time, which is optimal [3]. For computing the HOG, Cazaux andRivals proposed an O ( || P || + n ) time algorithm using O ( || P || + n × min( n, max {| s | : s ∈ P } ))space [5]. Recently, Park et al. [14] gave an O ( || P || log n ) time algorithm using O ( || P || ) spaceby using the segment tree data structure.In this paper we present a new algorithm to compute HOG, which uses O ( || P || ) time andspace, which are both optimal. The algorithm is based on the Aho-Corasick trie [1] and theborder array [9]. Therefore, the construction time and space of HOG are better than thoseof the overlap graph, which are O ( || P || + n ), and this fact may lead to many applications ofHOG. For example, consider the problem of finding optimal cycle cover in the overlap graph . Park, S. G. Park, B. Cazaux, K. Park, and E. Rivals 23:3 built with a set P = { s , s , ..., s n } of strings. Typically this problem needs to be solvedin finding good approximations of shortest superstrings. A greedy algorithm to solve theoptimal cycle cover problem on the overlap graph was given in [2], which takes O ( || P || + n )time. Recently, Cazaux and Rivals proposed an O ( || P || ) time algorithm to solve the optimalcycle cover problem given the HOG or EHOG of P [4]. By using our result in this paper, theoptimal cycle cover problem can be solved in O ( || P || ) time and space by using HOG insteadof the overlap graph.The rest of the paper is organized as follows. In Section 2 we give preliminary informationfor HOG and formalize the problem. In Section 3 we present an O ( || P || ) time and spacealgorithm for computing HOG. In Section 4 we conclude and discuss a future work. We consider strings over a constant-size alphabet Σ. The length of a string s is denoted by | s | . Given two integers 1 ≤ l ≤ r ≤ | s | , the substring of s which starts from l and ends at r isdenoted by s [ l..r ]. Note that s [ l..r ] is a prefix of s when l = 1, and a suffix of s when r = | s | .If a prefix (suffix) of s is different from s , we call it a proper prefix (suffix) of s . Given twostrings s and t , an overlap from s to t is a string which is both a proper suffix of s and aproper prefix of t . Given a set P = { s , s , ..., s n } of strings, the sum of | s i | ’s is denoted by || P || . We define hierarchical overlap graph and extended hierarchical overlap graph as in [5]. ▶ Definition 1. (Hierarchical Overlap Graph) Given a set P = { s , s , . . . , s n } , we define Ov ( P ) as the set of the longest overlap from s i to s j for 1 ≤ i, j ≤ n . The hierarchical overlapgraph of P , denoted by HOG( P ), is a directed graph with a vertex set V = P ∪ Ov ( P ) ∪ { ϵ } and an edge set E = E ∪ E , where E = { ( x, y ) ∈ V × V | x is the longest proper prefix of y } and E = { ( x, y ) ∈ V × V | y is the longest proper suffix of x } . ▶ Definition 2. (Extended Hierarchical Overlap Graph) Given a set P = { s , s , . . . , s n } ,we define Ov + ( P ) as the set of all overlaps from s i to s j for 1 ≤ i, j ≤ n . The extendedhierarchical overlap graph of P , denoted by EHOG( P ), is a directed graph with a vertex set V + = P ∪ Ov + ( P ) ∪ { ϵ } and an edge set E + = E +1 ∪ E +2 , where E +1 = { ( x, y ) ∈ V + × V + | x is the longest proper prefix of y } and E +2 = { ( x, y ) ∈ V + × V + | y is the longest proper suffixof x } .Figure 1 shows the Aho-Corasick trie [1], EHOG, and HOG built with P = { aacaa, aagt, gtc } .It is shown in [5] that EHOG is a contracted form of the Aho-Corasick trie and HOG is acontracted form of EHOG.As in the Aho-Corasick trie, each node u in HOG or EHOG corresponds to a string(denoted by S ( u )), which is a concatenation of all labels on the path from the root (noderepresenting ϵ ) to u .There are two types of edges in EHOG and HOG as in the Aho-Corasick trie: a tree edgeand a failure link. An edge ( u, v ) is a tree edge (an edge in E +1 or E , solid line in Figure 1)in an EHOG (HOG), if S ( u ) is the longest proper prefix of S ( v ) in the EHOG (HOG). It isa failure link (an edge in E +2 or E , dotted line in Figure 1) in an EHOG (HOG), if S ( v ) isthe longest proper suffix of S ( u ) in the EHOG (HOG). C V I T 2 0 1 6

Given a set P = { s , s , . . . , s n } of strings, we can build an EHOG of P in O ( || P || )time and space [5]. Furthermore, given EHOG( P ) and Ov ( P ), we can compute HOG( P ) in O ( || P || ) time and space [5]. Therefore, the bottleneck of computing HOG( P ) is computing Ov ( P ) efficiently. In this section we introduce an algorithm to build the HOG of P = { s , s , . . . , s n } in O ( || P || )time. We assume that there are no two different strings s i , s j ∈ P such that s i is a substringof s j for simplicity of presentation. Our algorithm directly computes HOG( P ) (and Ov ( P ))from the Aho-Corasick trie of P in O ( || P || ) time.Let’s assume we have the Aho-Corasick trie of P including the failure links. We define R ( u ) as follows: R ( u ) = { i ∈ { , . . . , n } | S ( u ) is a proper prefix of s i } . (1)That is, R ( u ) is a set of string indices in the subtree rooted at u if u is an internal node, oran empty set if u is a leaf node.For each input string s i , we will do the following operation separately, which is to find thelongest overlap from s i to any string in P . Consider a path ( v , v , . . . , v l ) which starts fromthe leaf representing s i and follows the failure links until it reaches the root, i.e., S ( v ) = s i and v l is the root of the tree. By definition of the failure link, the strings corresponding tonodes appearing in the path are the suffixes of s i . If there are an index j and a node v k onthe path such that j ∈ R ( v k ), S ( v k ) is both a suffix of s i and a proper prefix of s j , so S ( v k )is an overlap from s i to s j . S ( v k ) for 0 < k ≤ l is the longest overlap from s i to s j if and only if j ∈ R ( v k ) and thereis no m such that 0 ≤ m < k and j ∈ R ( v m ). If there exists such m , then S ( v m ) is a longeroverlap from s i to s j than S ( v k ), so S ( v k ) is not the longest overlap. Therefore, we get thefollowing lemma. ▶ Lemma 3. S ( v k ) is the longest overlap from s i to s j if and only if j ∈ R ( v k ) − R ( v k − ) − . . . − R ( v ).Therefore, if | R ( v k ) − R ( v k − ) − . . . − R ( v ) | > S ( v k ) is the longest overlap from s i to s j for j ∈ R ( v k ) − R ( v k − ) − . . . − R ( v ), and thus v k ∈ Ov ( P ). Therefore, we aim tocompute | R ( v k ) − R ( v k − ) − . . . − R ( v ) | for every 0 < k ≤ l .Given an index k , we define k + 1 auxiliary sets of indices I k ( k ) , I k ( k − , . . . , I k (0) in arecursive manner as follows. I k ( k ) = R ( v k ) I k ( m ) = I k ( m + 1) − R ( v m ) for m = k − , k − , . . . , I k (0) is R ( v k ) − R ( v k − ) − . . . − R ( v ) in Lemma 3 and we want tocompute | I k (0) | . For every 0 ≤ m < k , I k ( m ) = I k ( m + 1) − R ( v m ) ⊆ I k ( m + 1) and thus | I k ( m ) | = | I k ( m + 1) | − | I k ( m + 1) − I k ( m ) | holds. By summing up all these equations for0 ≤ m < k , we get | I k (0) | = | I k ( k ) | − P k − m =0 | I k ( m + 1) − I k ( m ) | . Since I k ( k ) = R ( v k ) and I k ( m + 1) − I k ( m ) = I k ( m + 1) − ( I k ( m + 1) − R ( v m )) = I k ( m + 1) ∩ R ( v m ), we have | I k (0) | = | R ( v k ) | − k − X m =0 | I k ( m + 1) ∩ R ( v m ) | . (2) . Park, S. G. Park, B. Cazaux, K. Park, and E. Rivals 23:5 Figure 2

Aho-Corasick trie with P = { caccgc, ccgcg, ccgca, cgct, gcc } . We also define a new function up ( u ) for a node u as follows. ▶ Definition 4.

Given a node u in the Aho-Corasick trie, up ( u ) is defined as the firstancestor of u (except u itself) in the path that starts at u and follows the failure links untilit reaches the root node. We define an ancestor on the tree which consists of tree edges inthe Aho-Corasick trie.Note that up ( u ) is well defined when u is not the root node, since the root node is always anancestor of u . When u is the root node, up ( u ) is empty.Now we analyze the value of | I k ( m + 1) ∩ R ( v m ) | in Equation (2) for each 0 ≤ m < k as follows. We use a path ( v , v , ..., v ) in Figure 2 as a running example, i.e., l = 5 and0 < k ≤ ▶ Lemma 5. | I k ( m + 1) ∩ R ( v m ) | is | R ( v m ) | if up ( v m ) = v k ; it is 0 otherwise. Proof.

We divide the relationship between v m and v k into cases. v m is outside the subtree rooted at v k Let’s assume that I k ( m + 1) ∩ R ( v m ) is not empty and there exists j ∈ I k ( m + 1) ∩ R ( v m ).Then j ∈ R ( v k ) ∩ R ( v m ) should hold since I k ( m +1) ⊆ I k ( k ) = R ( v k ). Therefore, both v m and v k should be the ancestors of the leaf corresponding to s j . Because | S ( v m ) | > | S ( v k ) | , v k should be an ancestor of v m . Since v m is outside the subtree rooted at v k , v k cannotbe an ancestor of v m , which is a contradiction. Therefore such j does not exist, whichshows that I k ( m + 1) ∩ R ( v m ) = ∅ and | I k ( m + 1) ∩ R ( v m ) | = 0.For example, consider the case with k = 4 and m = 3 in Figure 2. Since I (4) = R ( v ) = { , , , } and R ( v ) = { } , we can see that I (4) ∩ R ( v ) = ∅ . v m is inside the subtree rooted at v k In this case, v k is an ancestor of v m and we further divide it into cases. a. There exists q such that m < q < k and v q is an ancestor of v m .We get R ( v m ) ⊆ R ( v q ) because v q is an ancestor of v m . Since I k ( m + 1) = R ( v k ) − R ( v k − ) − ... − R ( v m +1 ) and m < q < k , we have I k ( m +1) ⊆ R ( v k ) − R ( v q ). Therefore, I k ( m + 1) ∩ R ( v m ) ⊆ ( R ( v k ) − R ( v q )) ∩ R ( v q ) = ∅ . That is, I k ( m + 1) ∩ R ( v m ) = ∅ and | I k ( m + 1) ∩ R ( v m ) | = 0. C V I T 2 0 1 6 b. For any q such that m < q < k , v q is not an ancestor of v m .Here we show that R ( v m ) ⊆ I k ( m + 1). Let’s consider an index x ∈ R ( v m ). Since v k isan ancestor of v m , we have x ∈ R ( v k ). Moreover, for any q such that m < q < k , neither v q is an ancestor of v m nor v m is an ancestor of v q . That is, R ( v q ) ∩ R ( v m ) = ∅ andthus x / ∈ R ( v q ). Therefore, we have x ∈ I k ( m + 1) = R ( v k ) − R ( v k − ) − . . . − R ( v m +1 ).In conclusion, R ( v m ) ⊆ I k ( m + 1) and thus | I k ( m + 1) ∩ R ( v m ) | = | R ( v m ) | .For example, consider the case with k = 4 and m = 1 in Figure 2. Since I (2) = R ( v ) − R ( v ) − R ( v ) = { , , } and R ( v ) = { , } , we can see that R ( v ) ⊆ I (2)and I (2) ∩ R ( v ) = R ( v ).In summary, | I k ( m + 1) ∩ R ( v m ) | = | R ( v m ) | in case 2(b), and 0 otherwise. In case 2(b), v k is an ancestor of v m and there is no q such that m < q < k and v q is an ancestor of v m .In other words, v k is the first ancestor of v m in the path starting from v m and following thefailure links repeatedly, which means that up ( v m ) = v k . ◀▶ Theorem 6.

For every 0 < k ≤ l , | I k (0) | = | R ( v k ) | − P v m | R ( v m ) | , where 0 ≤ m < k and up ( v m ) = v k . Proof.

From Equation (2), we have | I k (0) | = | R ( v k ) | − P k − m =0 | I k ( m + 1) ∩ R ( v m ) | . ByLemma 5, we have P k − m =0 | I k ( m + 1) ∩ R ( v m ) | = P v m : up ( v m )= v k | R ( v m ) | . By merging thetwo equations, we have the theorem. ◀ Now let’s consider the relationship between u and up ( u ). S ( up ( u )) is a proper suffix of S ( u ) because up ( u ) can be reached from u through failure links. Furthermore, S ( up ( u )) is aproper prefix of S ( u ) because up ( u ) is an ancestor of u . That is, S ( up ( u )) is a border [9] of S ( u ). Moreover, we visit every suffix of u in the decreasing order of lengths and S ( up ( u )) isthe first border we visit, so S ( up ( u )) is the longest border of S ( x ). Since each node in theAho-corasick trie corresponds to a prefix of some s i , we can compute up ( u ) for all nodes u by computing the border array of every s i as follows. Let pnode i ( l ) be the node whichcorresponds to s i [1 ..l ], and border i ( l ) be the length of the longest border of s i [1 ..l ]. Then wehave the following equation for every s i and 1 ≤ l ≤ | s i | : up ( pnode i ( l )) = pnode i ( border i ( l )) . (3)If we store pnode i and border i using arrays, we can compute pnode i , border i , and up ( u )in O ( || P || ) time and space, because border i can be computed in O ( || P || ) time using analgorithm in [9].For example, let’s consider Figure 2, which is an Aho-Corasick trie built with a set P = { s = caccgc, s = ccgcg, s = ccgca, s = cgct, s = gcc } of strings. For each string,we compute its corresponding border array, and get border = (0 , , , , , border =(0 , , , , border = (0 , , , , border = (0 , , , border = (0 , , pnode i ’s by traversing the Aho-Corasick trie. Now we can compute up by using pnode i and border i . For example, let’s consider v = pnode (4), which represents ccgc . Since thelongest border of ccgc is c , which has length 1, we have border (4) = 1. As a result, we have up ( v ) = up ( pnode (4)) = pnode ( border (4)) = pnode (1) = v by Equation (3). Note that v represents c , which is the longest border of ccgc .We are ready to describe an algorithm to compute HOG of P in O ( || P || ) time andspace. First, we build the Aho-Corasick trie with P and a border array for each s i . Byusing the border arrays, we compute up ( u ) for every node u except the root. Next, wecompute | R ( u ) | for each node u by the post-order traversal of the Aho-Corasick trie. For . Park, S. G. Park, B. Cazaux, K. Park, and E. Rivals 23:7 Algorithm 1

Build HOG in linear time procedure Build-HOG ( P ) Build the Aho-Corasick trie with P Compute border arrays border i for 1 ≤ i ≤ n Compute up ( u ) for each node u Compute | R ( u ) | for each node u Mark root as included in HOG( P ) For each node u , initialize Child( u ) with an empty set for i ← to n do u ← leaf corresponding to s i in Aho-Corasick trie Mark u as included in HOG( P ) while u ̸ = root do I ( u ) ← | R ( u ) | for all u ′ ∈ Child( u ) do I ( u ) ← I ( u ) − | R ( u ′ ) | if I ( u ) > then Mark u as included in HOG(P) Child( u ) ← an empty set Add u to Child( up ( u )) u ← failure link of u Build HOG(P) with marked nodeseach string s i , we start from the leaf node corresponding to s i and follow the failure linksuntil we reach the root. For each node v k that we visit, we compute its corresponding | I k (0) | = | R ( v k ) | − P v m : up ( v m )= v k | R ( v m ) | . If | I k (0) | >

0, we mark v k to be included in HOG.Algorithm 1 shows an algorithm to compute HOG. Lines 2-5 compute the preliminaries for thealgorithm, while lines 6-19 compute the nodes to be included in HOG. Note that the loop oflines 8-19 works separately for each input string s i . We consider v k in the order of increasing k , and thus if up ( v m ) = v k , then m < k . Hence, Child( v k ) in line 13 stores every v m suchthat up ( v m ) = v k by line 18 of previous iterations. For each node u = v k in lines 11-19, I ( u )correctly computes | I k (0) | since we get | R ( v k ) | in line 12 and subtract every | R ( v m ) | where v k = up ( v m ) in lines 13-14. According to Theorem 6, lines 12-14 correctly computes | I k (0) | .We build HOG( P ) in line 20 by removing the unmarked nodes and contracting the edgeswhile traversing the Aho-Corasick trie once, as in [5].For example, let’s consider a set P = { s = caccgc, s = ccgcg, s = ccgca, s = cgct, s = gcc } of strings. Figure 2 shows the Aho-Corasick trie built with P . Consider a path startingfrom a node representing s and following the failure links until the root node. The path( v , v , v , v , v , v ) is marked with dotted lines in Figure 2. By definition of up , we get up ( v ) = up ( v ) = up ( v ) = v and up ( v ) = up ( v ) = v . Therefore, we can compute | I k (0) | ’s as follows. | I (0) | = | R ( v ) | = 0 | I (0) | = | R ( v ) | = 2 | I (0) | = | R ( v ) | = 1 | I (0) | = | R ( v ) | = 1 | I (0) | = | R ( v ) | − | R ( v ) | − | R ( v ) | − | R ( v ) | = 4 − − − C V I T 2 0 1 6

Note that | R ( v ) | = 0 by definition of R ( u ). Since v , v , v , and v have positive | I k (0) | ’s,we mark them to be included in HOG. We do this process for every s i .Now we show that Algorithm 1 runs in O ( || P || ) time and space. Computing an Aho-Corasick trie, a border array for each string, and up ( u ) and | R ( u ) | for each node u costs O ( || P || ) time and space. Furthermore, for a given index i , lines 13-14 are executed at most | s i | times since line 18 is executed at most | s i | times, and thus the sum of | Child( u ) | is at most | s i | . Therefore, lines 9-19 run in O ( | s i | ) time for given i , and thus lines 8-19 run in O ( || P || )time in total. Also they use O ( | s i | ) additional space to store the Child list. Lastly, we canbuild HOG( P ) with marked nodes in O ( || P || ) space and time [5]. Therefore, Algorithm 1runs in O ( || P || ) time and space. We remark that Algorithm 1 can be modified so that itbuilds the HOG from an EHOG instead of an Aho-Corasick trie, while it still costs O ( || P || )time and space. ▶ Theorem 7.

Given a set P of strings, HOG( P ) can be built in O ( || P || ) time and space. We have presented an O ( || P || ) time and space algorithm to build the HOG, which improvesupon an earlier O ( || P || log n ) time solution. Since the input size of the problem is O ( || P || ),the algorithm is optimal.There are some interesting topics about HOG and EHOG which deserve the future work.As mentioned in the introduction, the shortest superstring problem gained a lot of interest[2, 18, 11, 13]. Since many algorithms dealing with the shortest superstring problem arebased on the overlap graph, HOG may give better approximation algorithms for the shortestsuperstring problem by using the additional information that HOG has when compared tothe overlap graph. Acknowledgements

S. Park, S.G. Park and K. Park were supported by Institute for Information & communicationsTechnology Promotion(IITP) grant funded by the Korea government(MSIT) (No. 2018-0-00551, Framework of Practical Algorithms for NP-hard Graph Problems). B. Cazaux andE. Rivals acknowledge the funding from Labex NumeV (GEM flagship project, ANR 2011-LABX-076), and from the Marie Skłodowska-Curie Innovative Training Networks ALPACA(grant 956229).

References A. V. Aho and M. J. Corasick. Efficient string matching: An aid to bibliographic search.

Communications of the ACM , 18(6):333–340, 1975. doi:10.1145/360825.360855 . A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortestsuperstrings.

Journal of the ACM , 41(4):630–647, 1994. B. Cazaux, R. Cánovas, and E. Rivals. Shortest DNA cyclic cover in compressed space. In

DCC , pages 536–545, 2016. B. Cazaux and E. Rivals. A linear time algorithm for shortest cyclic cover of strings.

Journalof Discrete Algorithms , 37:56–67, 2016. B. Cazaux and E. Rivals. Hierarchical overlap graph.

Information Processing Letters ,155:105862, 2020. J. Gallant, D. Maier, and J. Astorer. On finding minimal length superstrings.

Journal ofComputer and System Sciences , 20(1):50 – 58, 1980. doi:10.1016/0022-0000(80)90004-5 . . Park, S. G. Park, B. Cazaux, K. Park, and E. Rivals 23:9 G. Gonnella and S. Kurtz. Readjoiner: A fast and memory efficient string graph-basedsequence assembler.

BMC Bioinformatics , 13(1):82, 2012. doi:10.1186/1471-2105-13-82 . D. Gusfield, G. M. Landau, and B. Schieber. An efficient algorithm for the all pairs suffix-prefixproblem.

Information Processing Letters , 41(4):181–185, 1992. D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt. Fast pattern matching in strings.

SIAMJournal on Computing , 6(2):323–350, 1977. doi:10.1137/0206024 . J. Lim and K. Park. A fast algorithm for the all-pairs suffix–prefix problem.

TheoreticalComputer Science , 698:14–24, 2017. doi:10.1016/j.tcs.2017.07.013 . M. Mucha. Lyndon words and short superstrings. In

SODA , pages 958–972. SIAM, 2013. E. W. Myers. The fragment assembly string graph.

Bioinformatics , 21 Suppl 2:ii79–ii85, 2005. K. Paluch. Better approximation algorithms for maximum asymmetric traveling salesman andshortest superstring, 2014. arXiv:1401.3670 . S. G. Park, B. Cazaux, K. Park, and E. Rivals. Efficient construction of hierarchical overlapgraphs. In

SPIRE , 2020. H. Peltola. Algorithms for some string matching problems arising in molecular genetics. In

IFIP Congress , pages 53–64, 1983. P. A. Pevzner, H. Tang, and M. S. Waterman. An Eulerian path approach to DNA fragmentassembly.

Proceedings of the National Academy of Sciences , 98(17):9748–9753, 2001. M. H. Rachid and Q. Malluhi. A practical and scalable tool to find overlaps between sequences.

BioMed Research International , 2015, 2015. doi:10.1155/2015/905261 . Z. Sweedyk. A 2 -approximation algorithm for shortest superstring. SIAM Journal onComputing , 29(3):954–986, 2000. J. Tarhio and E. Ukkonen. A greedy approximation algorithm for constructing shortestcommon superstrings.

Theoretical Computer Science , 57(1):131–145, 1988. E. Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings.

Algorithmica , 5(1):313–323, 1990., 5(1):313–323, 1990.