[PDF] Optimal Construction of Hierarchical Overlap Graphs

Abstract

Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches. For a given set P of n strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring O(||P||+n^2) time using superlinear space, where ||P|| is the cumulative sum of the lengths of strings in P. This was improved by Park et al. [SPIRE20] to O(||P||\log n) time and O(||P||) space using segment trees, and further to O(||P||\frac{\log n}{\log \log n}) for the word RAM model. Both these results described an open problem to compute HOG in optimal O(||P||) time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures. At its core, our solution improves the classical result [IPL92] for a special case of the All Pairs Suffix Prefix (APSP) problem from O(||P||+n^2) time to optimal O(||P||) time, which may be of independent interest.

Full PDF

OOptimal Construction of Hierarchical Overlap Graphs

Shahbaz Khan ! University of Helsinki, Finland

Abstract

Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substringsof a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problemuse assembly graphs, such as de Bruijn graphs or overlap graphs , which maintain partial information aboutsuch overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap informationstored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations ofboth these approaches.For a given set P of n strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20]requiring O ( || P || + n ) time using superlinear space, where || P || is the cummulative sum of the lengths ofstrings in P . This was improved by Park et al. [SPIRE20] to O ( || P || log n ) time and O ( || P || ) space usingsegment trees, and further to O ( || P || log n log log n ) for the word RAM model. Both these results described an openproblem to compute HOG in optimal O ( || P || ) time and space. In this paper, we achieve the desired optimalbounds by presenting a simple algorithm that does not use any complex data structures. Mathematics of computing → Trees; Theory of computation → Datacompression; Theory of computation → Pattern matching

Keywords and phrases

Hierarchical Overlap Graphs, String algorithms, Genome assembly

Funding

This work was funded by the European Research Council (ERC) under the European Union’sHorizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO).

Acknowledgements

I would like to thank Alexandru I. Tomescu for helpful discussions, and for critical reviewand insightful suggestions which helped me in refining the paper.

Genome assembly is one of the oldest and most fundamental problems in Bioinformatics [16]. Due topractical limitations, sequencing an entire genome as a single complete string is not possible, rathera collection of the substrings of the genome (called reads ) are sequenced. The goal of a sequencingtechnology is to produce a collection of reads that cover the entire genome and have sufficient overlapamongst the reads. This allows the source genome to be reconstructed by ordering the reads usingthis overlap information. The genome assembly problem thus aims at computing the source genomegiven such a collection of overlapping reads. Most approaches of genome assembly capture thisoverlap information into an assembly graph , which can then be efficiently processed to assemble thegenome. The prominent approaches use assembly graphs such as de Bruijn graphs [17] and

Overlapgraphs [12] (also called string graphs [13]), which have been shown to be successfully used is variouspractical assemblers [23, 3, 14, 2, 18, 19].The de Bruijn graphs are built over k length substrings (or k -mers) of the reads as nodes, andarcs denoting k − k -mers. Their prominent advantage is that theirsize is linear in that of the input. However, their limitations include losing information about therelationship of k -mers with the reads, and not being able to represent overlaps of size other than k − a r X i v : . [ c s . D S ] F e b Optimal construction of HOG

As a result, Hierarchical Overlap Graphs (HOG) were proposed in [6, 7] as an alternative toovercome such limitations of the two types of assembly graphs. The HOG has nodes for all thelongest overlaps between every pair of strings, and edges connecting strings to their suffix and prefix,using linear space. Note that Overlap graphs have edges representing longest overlaps between stringsrequiring quadratic size, whereas HOG has additional nodes for longest overlaps between stringsrequiring linear size by exploiting pairs of strings having the same longest overlaps. Thus, it is apromising alternative to both de Bruijn graph and Overlap graph to better solve the problem ofgenome assembly. Also, since it maintains if two pairs of strings have the same overlap, it also hasthe potential to better solve the approximate shortest superstring problem [22] having applications inboth genome assembly and data compression [20, 4]. Some applications of HOG have been studiedin [6, 5].Cazaux and Rivals [7] presented the first algorithm to build HOG efficiently. They showed howHOG can be computed for a set of n strings P in O ( || P || + n ) time, where || P || represents thecummulative sum of lengths of strings in P . However, they required O ( || P || + n × min( n, max p ∈ P | p | ))space, which is superlinear in input size. Park et al. [15] improved it to O ( || P || log n ) time requiringlinear space using Segment trees [8], assuming a constant sized alphabet set. For the word RAM model,they further improved it to O ( || P || log n log log n ) time. For practical implementation, both these resultsbuild HOG using an intermediate Extended HOG (EHOG) which reduces the memory footprint ofthe algorithm. Also, both these results [7, 15] mentioned as an open problem the construction ofHOG using optimal O ( || P || ) time and space. We answer this open question positively using a simplealgorithm, which results in the following ▶ Theorem 1 (Optimal HOG) . For a set of strings P , the Hierarchical Overlap Graph can be computedusing O ( || P || ) time and space. Moreover, unlike [15] our algorithm does not use any complex data structures for its implementa-tion. Also, we do not assume any limitations on the alphabet set. Finally, like [7, 15] our algorithmcan also use EHOG as an intermediate step for improving memory footprint in practice. Note thatthe size EHOG and HOG can even be identical for some instances, but their ratio can tend to infinityfor some families of graphs [7]. Thus, despite the existence of optimal algorithm for computingEHOG, an optimal algorithm for computing HOG is significant from both theoretical and practicalviewpoints.

Outline of the paper

We first describe notations and preliminaries that are used in our paper in Section 2. In Section 3,we briefly describe the previous approaches to compute HOG. Thereafter, Section 4 describes ourcore result in three stages for simplicity of understanding, each building over the previous to give theoptimal algorithm. Finally, we present the conclusions in Section 5.

Given a finite set P = { p , ..., p n } of n non-empty strings over a finite set of alphabets, we denotethe size of a string p i by | p i | and the cummulative size of P by || P || = P ni =1 | p i | ( ≥ n as strings notempty). For a string p , any substring that starts from the first character of p is called a prefix of p ,whereas any substring which ends at the last character of p is called a suffix of p . A prefix or suffixof p is called proper if it is not same as complete p . For an ordered pair of string ( p , p ), a stringis called their overlap if it is both a proper suffix of p and a proper prefix of p , where ov ( p , p )denotes the longest such overlap. Also, for the set of strings P , Ov ( P ) denotes the set of all ov ( p i , p j ) . Khan 3 Figure 1

Given P = { aabaa, aadbd, dbdaa } , the figure shows from left to right the Aho-Corasick Trie ( A ),Extended Hierarchical Overlap Graph ( E ) and Hierarchical Overlap Graph ( H ) of P . for 1 ≤ i, j ≤ n . An empty string is denoted by ϵ . We also use the notions of HOG, EHOG andAho-Corasick trie as follows. ▶ Definition 2 (Hierarchical Overlap Graph [7]) . Given a set of strings P = { p , · · · , p n } , its Hier-archical Overlap Graph is a directed graph H = ( V, E ) , where V = P ∪ Ov ( P ) ∪ { ϵ } and E = E ∪ E , having E = { ( x, y ) : x is the longest proper prefix of y in V } as tree edges , and E = { ( x, y ) : y is the longest proper suffix of x in V } as suffix links .Extended HOG of P (referred as E ) is also similarly defined [7], having additional nodes corres-ponding to every overlap (not just longest) between each pair of strings in P , with the same definitionof edges. The construction of both these structures uses Aho-Corasick Trie [1] which is computable in O ( || P || ) time and space. The Aho-Corasick Trie of P (referred as A ) contains all prefixes of stringsin P as nodes, with the same definition for edges. All these structures are essentially trees havingthe empty string ϵ as the root, and the strings of P as its leaves . A tree edge ( x, y ) is labelled withthe substring of y not present in x . Hence, despite being a graph due to the presence of suffix links (also called failure links ), we abuse the notions used for tree structures when applying to A , E or H (ignoring suffix links). Also, while referring to a node v of A , E or H , we represent its correspondingstring with v as well.Consider Figure 1 for a comparison of A , E and H for P = { aabaa, aadbd, dbdaa } . Since A contains all prefixes as nodes, the tree edges have labels of a single alphabet. However, E contains alloverlaps among strings of P , so it can potentially have lesser internal nodes ( { a, aa, db, dbd } ) than A . Further, H contains only longest overlaps so it can potentially have even lesser internal nodes( { aa, dbd } ).Now, to compute E or H one must only remove some internal nodes from A and adjust the edgelabels accordingly. This requires the computation of all overlaps among strings in P for E , which isfurther restricted to only the longest overlaps for H . For a string p i ∈ P (leaf of A ), all its prefixes areits ancestors in A , whereas all its suffixes are on the path following the suffix links from it (referredas suffix path ). Thus, every internal node is implicitly the prefix of its descendant leaves, and to bean overlap it must merely be a suffix of some string in P [22]. Hence to compute internal nodes of E (or overlap) from A one simply traverses the suffix paths from all the leaves of A , and remove thenon-traversed internal nodes (see Figure 1). However, to compute H from A (or E ) we need to findonly the longest overlaps, so we use the following criterion to identify the internal nodes of H . ▶ Lemma 3.

An internal node v in A (or E ) of P , is ov ( p i , p j ) for two strings p i , p j ∈ P iff v is anoverlap of ( p i , p j ) and no descendant of v is an overlap of ( p i , p j ) . Optimal construction of HOG

Proof.

The ancestor of a node v in A is its proper prefix and hence is shorter than v . Since twointernal nodes of A which are both overlaps of ( p i , p j ), they are prefixes of p j and hence have anancestor-descendant relationship, where the descendant is longer in length. Thus, the longest overlap ov ( p i , p j ) cannot have a descendant which is an overlap of ( p i , p j ). ◀ Hence to compute Ov ( P ) (or nodes of H ), we need to check each internal node v if it is the lowestoverlap (in A ) for some pair ( p i , p j ). This implies that v is a suffix of some p i , such that for somedescendant leaf p j , no suffix of p i is on path from v to p j (see Figure 1). Cazaux and Rivals [7] were the first to study H , where they used E [5] as an intermediate step in thecomputation of H . They showed that E can be constructed in O ( || P || ) time and space from A [1],which itself is computable in O ( || P || ) time and space. In order to compute H , the main bottleneck is the computation of Ov ( P ), after which we simply remove the internal nodes not in Ov ( P ) from E (or A ), in O ( || P || ) time and space. They gave an algorithm to compute Ov ( P ) in O ( || P || + n )time using O ( || P || + n × min( n, max {| p i |} )) space. This procedure was recently improved by Park etal. [15] to require O ( || P || log n ) time and O ( || P || ) space using segment trees, assuming constant sizedalphabet set. For the word RAM model they further improve the time to O ( || P || log n log log n ). The mainideas of the previous results can be summarized as follows. Computing Ov ( P ) in O ( || P || + n ) time [7] The algorithm computes Ov ( P ) by considering the internal nodes in a bottom-up manner, where anode is processed after its descendants. For each internal node u , they compute the list R l ( u ) (called L u in our algorithm) of all leaves having u as a suffix. Now, while processing a node u , they checkwhether u is a suffix of some leaf v for which the path to at least one of its descendant leaf (say x )does not have a suffix of v , implying u = ov ( x, v ). To perform this task, they maintain a bit-vectorfor all leaves (suffix v ), which is marked if no such descendant path exists from u for such leaves.For a leaf v , the bit is implicitly marked if all children of u have the bit for v marked. Otherwise, if v ∈ R l ( u ) it is marked adding u to H , else left unmarked. The space requirement is dominated bythat of this bit-vector, and it is computed only the branching nodes, taking total O ( || P || + n ) time. Computing Ov ( P ) in O ( || P || log n ) time [15] The algorithm firstly orders the strings in P lexicographically in O ( || P || ) time (requires constantsized alphabet set). This allows them to define an interval of leaves which are the descendants of eachinternal node in E . Now, for each leaf u (suffix) they start with an unmarked array corresponding toall leaves (prefix). Then starting from u they follow its suffix path and at each internal node v , checkif some descendant leaf x (prefix) is unmarked. In such a case v = ov ( x, u ) and hence v is added to H . Before moving further in the next suffix path the interval corresponding to all the descendantleaves (prefix) of the v is marked in the array. Since both query and update (mark) over an intervalcan be performed in O (log n ) time using a segment tree, the total time taken is O ( || P || log n ) using O ( || P || ) space. Our main contribution is an alternate procedure to compute Ov ( P ) in O ( || P || ) time and space whichresults in an optimal algorithm for computing H for P in O ( || P || ) time and space. Our overallapproach is similar to that of the original algorithm [7] with the exception of a procedure to mark the . Khan 5 internal nodes that belong to H , i.e., Mark H . The algorithm except for the procedure Mark H takes O ( || P || ) time and space (also shown in [7]). We describe our algorithm for Mark H in three stages,first for a single prefix leaf requiring O ( || P || ) time, and then for all prefix leaves requiring overall O ( || P || + n ) time (similar to [7]), and finally improving it to overall O ( || P || ) time, which is optimal.The algorithm can be applied on any of A or E , both computable in O ( || P || ) time and space. We first describe our overall approach in Algorithm 1. After computing A , for each internal node v , we compute the list L v of all the leaves having v as its suffix. As described earlier, this can bedone by following the suffix path of each leaf x , adding x to L y for every internal node y on thepath. Using this information of suffix (in L v ) and prefix (implicit in A ) we mark the nodes of A tobe added in the HOG H . We shall describe this procedure Mark H later on. Thereafter, in orderto compute H we simply merge the unmarked internal nodes of A with its parents. This process iscarried on using a DFS traversal of A (ignoring suffix links) where for each unmarked internal node v , we move all its edges to its parent, prepending their labels with the label of the parent edge of v . Algorithm 1

Hierarchical Overlap Graphs

A ←

Aho-Corasik Trie of P // Trie with suffix links foreach internal node v of A do L v ← ∅ // List of leaves with suffix v foreach leaf x of A do // Compute all L v y ← Suffix link of x in A while y ̸ = ϵ do // ϵ is the root of A Add x to L y y ← Suffix link of y in A in H ←

Mark H ( A , L ) // Procedure to mark nodes of H in flags in H foreach node v ∈ A in DFS order do // Compute H if in H [ v ] = 0 then Merge v with its parentAs previously described, A can be computed in O ( || P || ) time and space [1]. Computing L v forall v ∈ A requires each leaf p i to follow its suffix path in O ( | p i | ) time, and add p i to at most | p i | different L y , requiring total O ( || P || ) time for all p i ∈ P . This also limits the size of L v for all v ∈ A to O ( || P || ). Since merge operation on a node v requires O ( deg ( v )) cost, computing H using in H requires total O ( |A| ) = O ( ||P|| ) time as well. Thus, we have the following theorem (also proved in[7]). ▶ Theorem 4.

For a set of strings P , the computation of Hierarchical Overlap Graph except forMark H operation requires O ( || P || ) time and space. H We shall describe our procedure to mark the nodes of H in three stages for simplicity of understanding.First, we shall describe how to mark all internal nodes representing all longest overlaps ov ( · , v ) froma single leaf v (prefix) in A , using O ( || P || ) time. Thereafter, we extend this to compute such overlapsfrom all leaves in A together using O ( || P || + n ) time. Finally, we shall improve this to our finalprocedure requiring optimal O ( || P || ) time. All the three procedures require O ( || P || ) space. Optimal construction of HOG

Figure 2

Overlaps of v with all leaves, where c = ov ( x, v ) and b = ov ( z, v ) are in Ov ( P ). Marking all nodes ov ( · , v ) for a leaf v In order to compute all longest overlaps of a leaf v (see Figure 2), we need to consider all its prefixes(ancestors in A ) according to Lemma 3. Here the internal nodes a, b and c are prefixes of v and alsosuffixes of x , whereas z only has suffixes a and c . Thus, we have L a = L b = { x, z } and L c = { x } .Thus, given that a, b and c are ancestors of v , a and b are valid overlaps of ( x, v ) and ( z, v ), whereas c is only a valid overlap of ( x, v ). Using Lemma 3, for being the longest overlap of a pair of strings,no descendant should be an overlap of the same pair of strings. Hence, c = ov ( x, v ) and b = ov ( z, v ),but a is not the longest overlap for any pair of strings because of b and c . Processing L u for allnodes on the ancestors of the leaf (prefix) requires O ( || P || ) time. Thus, a simple way to mark all thelongest overlaps of strings with prefix v in O ( || P || ) time, is as follows: Mark H for ov ( · , v ) : Traverse the ancestral path of v from the root to v , storing the last internal node y containing x in L y for each leaf x of A . On reaching v , mark the stored internal nodes for each x . Algorithm 2

Mark H ( A , L ) foreach internal node v of A do in H [ v ] ← // Flag for membership in H foreach leaf v of A do in H [ v ] ← // Leaves implicitly in H in H [ ϵ ] ← // Root implicitly in H foreach leaf v of A do S v ← ∅ // Stack of exposed suffix foreach node v ∈ A in DFS order do // Compute all in H [ v ] if internal node v first visited thenforeach leaf x in L v do Push v on S x // Expose v on stacks of L v if internal node v last visited thenforeach leaf x in L v do Pop v from S x // Remove v from stacks of L v if leaf v visited thenforeach leaf node x doif S x ̸ = ∅ then in H [Top of S x ] ← // Mark ov ( x, v )Return in H . Khan 7Marking all nodes in Ov ( P ) We now describe how to perform this procedure for all leaves (prefix) together (see Algorithm 2)using stacks to keep track of the last encountered internal node for each leaf (suffix). The mainreason behind using stacks is to avoid processing L u multiple times (for different prefixes). For eachinternal node, we initialize the flag denoting membership in H to zero, whereas the root and leavesof A are implicitly in H . For each leaf (suffix) we initialize an empty stack. Now, we traverse A inDFS order (ignoring suffix links). As in the case for single leaf (prefix), the stack S x maintains thelast internal node v containing a leaf x (suffix) in L v . This node v is added to the stack S x of theleaf x (suffix) when v is first visited by the traversal, and removed from the stack S x when it is lastvisited. This exposes the previously added internal nodes on the stack. Finally, on visiting a leaf v (prefix), each non-empty stack S x of a leaf x (suffix) exposes the internal node last added on its top,which is the longest overlap ov ( x, v ) by Lemma 3. We mark such internal nodes as being present in H . The correctness follows from the same arguments used for the first approach.In order to analyze the procedure we need to consider the processing of L v and S x for all v, x ∈ A ,in addition to traversing A . Since the total size of all L v is O ( || P || ), processing it twice (on the firstand last visit of v ) requires O ( || P || ) time. This also includes the time to push and pop nodes fromthe stacks, requiring O (1) time while processing L v . However, on visiting the leaf (prefix) by thetraversal, we need to evaluate all S x and mark the top of non-empty stacks. Since we consider n leaves (prefix), each processing all stacks of n leaves (suffix), we require O ( n ) time. For analyzingsize, we need to consider only S x in addition to L v . Since the nodes in all S x are added once fromsome L v , the total size of all stacks S x is bounded by the size of all lists L v , i.e. O ( || P || ) (as provedearlier). Thus, this procedure requires O ( || P || + n ) time and O ( || P || ) space to mark all nodes in Ov ( P ). Optimizing Mark H As described earlier, the only operation not bounded by O ( || P || ) time is the marking of internalnodes, while processing the leaves (prefix) considering the stacks of all leaves (suffix). Note that thisprocedure is overkill as the same top of the stack can be marked again when processing differentleaves (prefix), whereas total nodes entering and leaving stacks are O ( || P || ). Thus, we ensure that wedo not have to process stacks of all leaves (suffix) on processing the leaves (prefix) of A , and instead,we only process those stacks which were not processed earlier to mark the same top. Note that thesame internal node may be marked again when exposed in different stacks, but we ensure that it isnot marked again while processing the same stack.Consider Algorithm 3, we maintain a list S of non-empty stacks whose tops are not marked. Now,whenever a new node is added to a stack, it clearly has an unmarked top, so is added to S . And whena node is removed from the stack, it is added to S if the new top is not previously marked. Since S is a list, its members are additionally maintained using flags in S for each stack corresponding toleaves (suffix) of A , so that the same stack is not added multiple times in S . Now, on processing theleaves (prefix) of A , we only process the stacks in S , marking their tops and removing them from S .Clearly, stacks are added to S only while processing L v , hence overall we can mark O ( |L v | ) nodes forall v , requiring total O ( || P || ) time. And the time taken in removing stacks from S is bounded by thetotal size of all S x , which is also O ( || P || ). Thus, we can perform Mark H using optimal O ( || P || ) timeand space, which results in our main result (using Theorem 4). ▶ Theorem 1 (Optimal HOG) . For a set of strings P , the Hierarchical Overlap Graph can be computedusing O ( || P || ) time and space. Optimal construction of HOG

Algorithm 3

Mark H ( A , L ) foreach internal node v of A do in H [ v ] ← // Flag for membership in H foreach leaf v of A do in H [ v ] ← // Leaves implicitly in H in H [ root ] ← // Root implicitly in H foreach leaf v of A do S v ← ∅ // Stack of exposed suffix S ← ∅ // List of stacks with unmarked tops foreach leaf v of A do in S [ v ] ← // Flag for membership of S v in S foreach node v ∈ A in inorder do // Compute all in H [ v ] if internal node v first visited thenforeach leaf x in L v do // Expose v on stacks of L v Push v on S x if in S [ x ] = 0 then // Add S x to S if not present in S [ x ] ← S x to S if internal node v last visited thenforeach leaf x in L v do // Remove v from stacks of L v Pop v from S x if S x ̸ = ∅ and in H [ Top of S x ] = 0 and in S [ x ] = 0 then // S x non-empty with unmarked top and not present in S in S [ x ] ← S x to S if leaf v visited thenforeach S x ∈ S do in H [Top of S x ] ← // Mark ov ( x, v ) in S [ x ] ← S x from S // Remove S x with marked top from S Return in H Genome assembly is one of the most prominent problems in Bioinformatics, and it traditionallyrelies on de Bruijn graphs or Overlap graphs, each having limitations of either loss of information orquadratic space requirements. Hierarchical Overlap Graphs provide a promising alternative that mayresult in better algorithms for genome assembly. The previous results on computing these graphswere not scalable (due to the quadratic time-bound) or required complicated data structures (segmenttrees). Moreover, computing HOG in optimal time and space was mentioned as an open problem inboth the previous results [7, 15]. We present a simple algorithm that achieves the desired bounds,using only elementary data structures such as stacks and lists. We hope our algorithm directly, orafter further simplification, results in a greater adaptability of HOGs in developing better genomeassembly algorithms. . Khan 9

References Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search.

Commun. ACM , 18(6):333–340, 1975. Dmitry Antipov, Anton I. Korobeynikov, Jeffrey S. McLean, and Pavel A. Pevzner. hybridspades: analgorithm for hybrid assembly of short and long reads.

Bioinform. , 32(7):1009–1015, 2016. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S.Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son K. Pham, Andrey D. Prjibelski, Alex Pyshkin,Alexander Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. Spades:A new genome assembly algorithm and its applications to single-cell sequencing.

J. Comput. Biol. ,19(5):455–477, 2012. Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis. Linear approximation ofshortest superstrings.

J. ACM , 41(4):630–647, 1994. Rodrigo Cánovas, Bastien Cazaux, and Eric Rivals. The compressed overlap index.

CoRR ,abs/1707.05613, 2017. Bastien Cazaux, Rodrigo Cánovas, and Eric Rivals. Shortest DNA cyclic cover in compressed space.In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, , pages 536–545.IEEE, 2016. Bastien Cazaux and Eric Rivals. Hierarchical overlap graph.

Inf. Process. Lett. , 155, 2020. Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars.

Computational geometry:algorithms and applications, 3rd Edition . Springer, 2008. Hieu Dinh and Sanguthevar Rajasekaran. A memory-efficient data structure representing exact-matchoverlap graphs with application for next-generation DNA assembly.

Bioinform. , 27(14):1901–1907,2011. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefixproblem.

Inf. Process. Lett. , 41(4):181–185, 1992. Jihyuk Lim and Kunsoo Park. A fast algorithm for the all-pairs suffix-prefix problem.

Theor. Comput.Sci. , 698:14–24, 2017. Eugene W. Myers. The fragment assembly string graph. In

ECCB/JBI , page 85, 2005. Eugene W. Myers. The fragment assembly string graph.

Bioinformatics , 21(2):79–85, January 2005. doi:10.1093/bioinformatics/bti1114 . Sergey Nurk, Dmitry Meleshko, Anton I. Korobeynikov, and Pavel A. Pevzner. metaspades: A newversatile de novo metagenomics assembler. In Mona Singh, editor,

Research in Computational MolecularBiology - 20th Annual Conference, RECOMB 2016, Santa Monica, CA, USA, April 17-21, 2016,Proceedings , volume 9649 of

Lecture Notes in Computer Science , page 258. Springer, 2016. Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. Efficient construction of hierarchicaloverlap graphs. In Christina Boucher and Sharma V. Thankachan, editors,

String Processing andInformation Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15,2020, Proceedings , volume 12303 of

Lecture Notes in Computer Science , pages 277–290. Springer, 2020. Hannu Peltola, Hans Söderlund, Jorma Tarhio, and Esko Ukkonen. Algorithms for some string matchingproblems arising in molecular genetics. In

IFIP Congress , pages 59–64, 1983. P. A. Pevzner. l-Tuple DNA sequencing: computer analysis.

Journal of Biomolecular Structure &Dynamics , 7(1):63–73, August 1989. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to dna fragmentassembly.

Proceedings of the National Academy of Sciences of the United States of America , 98(17):9748–9753, 2001. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using thefm-index.

Bioinform. , 26(12):367–373, 2010. Z. Sweedyk. A 2½-approximation algorithm for shortest superstring.

SIAM J. Comput. , 29(3):954–986,1999. William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithmfor the all-pairs suffix-prefix problem.

J. Discrete Algorithms , 37:34–43, 2016. Esko Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings.

Al-gorithmica , 5(3):313–323, 1990. Daniel R Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de bruijngraphs.

Genome research , 18(5):821—829, May 2008. doi:10.1101/gr.074492.107doi:10.1101/gr.074492.107