Algorithms and Complexity on Indexing Founder Graphs
Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, Veli Mäkinen
AAlgorithms and Complexity on IndexingFounder Graphs ∗ Massimo Equi, Tuukka Norri, Jarno Alanko,Bastien Cazaux, Alexandru I. Tomescu, and Veli M¨akinenDepartment of Computer Science, University of Helsinki, Finland.Contact: veli.makinen@helsinki.fiFebruary 26, 2021
Abstract
We introduce a compact pangenome representation based on an optimal segmen-tation concept that aims to reconstruct founder sequences from a multiple sequencealignment (MSA). Such founder sequences have the feature that each row of the MSA isa recombination of the founders. Several linear time dynamic programming algorithmshave been previously devised to optimize segmentations that induce founder blocks thatthen can be concatenated into a set of founder sequences. All possible concatenationorders can be expressed as a founder graph . We observe a key property of such graphs:if the node labels (founder segments) do not repeat in the paths of the graph, suchgraphs can be indexed for efficient string matching. We call such graphs repeat-freefounder graphs when constructed from a gapless MSA and repeat-free elastic foundergraphs when constructed from a general MSA with gaps.We give a linear time algorithm and a parameterized near linear time algorithm toconstruct a repeat-free founder graph and a repeat-free elastic founder graph, respec-tively. The two algorithms combine techniques from the founder segmentation algo-rithms (Cazaux et al., SPIRE 2019) and fully-functional bidirectional Burrows-Wheelerindex (Belazzougui and Cunial, CPM 2019). We derive a tailored succinct index struc-ture to support queries of arbitrary length in the paths of a repeat-free (elastic) foundergraph. In addition, we show how to turn a repeat-free (elastic) founder graph into a
Wheeler graph (Gagie et al., TCS 2017) in polynomial time. This reduction enables theuse the versatile machinery (Alanko et al., SODA 2020) developed for Wheeler graphs.Furthermore, we show that a property such as repeat-freeness is essential for index-ability. In particular, we show that unless the Strong Exponential Time Hypothesis(SETH) fails, one cannot build an index on an elastic founder graph in polynomialtime to support fast queries. ∗ This is an extended version of a conference paper in WABI 2020 [35]. a r X i v : . [ c s . D S ] F e b Introduction
Computational pangenomics [1] ponders around the problem of expressing a reference genomeof a species in a more meaningful way than as a string of symbols. The basic problem in suchgeneralized representations is that one should still be able to support string matching typeof operations on the content. Another problem is that any representation generalizing set ofsequences also expresses sequences that may not be part of the real pangenome. That is, agood representation should have a feature to control over-expressiveness and simultaneouslysupport efficient queries.In this paper, we develop the theory around one promising pangenome representationcandidate, the founder graph . This graph is a natural derivative of segmentation algorithms[2, 3] related to founder sequences [4].Consider a set of individuals represented as lists of variations from a common referencegenome. Such a set can be expressed as a variation graph or as a multiple sequence alignment .The former expresses reference as a backbone of an automaton, and adds a subpath for eachvariant. The latter inputs all variations of an individual to the reference, creating a row foreach individual into a multiple alignment. Figure 1 shows an example of both structureswith 6 very short genomes.A multiple alignment of much fewer founder sequences can be used to approximate theinput represented as a multiple alignment as well as possible, meaning that each original rowcan be mapped to the founder multiple alignment with a minimum amount of row changes(discontinuities). Finding an optimal set of founders is NP-hard [5], but one can solverelaxed problem statements in linear time [2,3], which are sufficient for our purposes. As anexample on the usefulness of founders, Norri et al. [2] showed that, on a large public datasetof haplotypes of human genome, the solution was able to replace 5009 haplotypes with only130 founders so that the average distance between row jumps was over 9000 base pairs [2].This means that alignments of short reads (e.g. 100 bp) very rarely hit a discontinuity, andthe space requirement drops from terabytes to just tens of gigabytes. Figure 1 shows sucha solution on our toy example.A block graph is a labelled directed acyclic graph consisting of consecutive blocks , wherea block represents a set of sequences of the same length as parallel (unconnected) nodes.There are edges only from nodes of one block to the nodes of the next block. A founder graph is a block graph with blocks representing the segments of founder sequences correspondingto an optimal segmentation [2]. Fig. 1 visualises such a founder graph: There the founderset is divided into 3 blocks with the first, the second, and the third containing sequences oflength 4, 4, and 3, respectively. The coloured connections between sequences in consecutiveblocks define the edges. Such graphs interpreted as automata recognise the input sequencesjust like variation graphs, but otherwise recognise a much smaller subset of the language.With different optimisation criteria to compute the founder blocks, one can control theexpressiveness of this pangenome representation.In this paper, we show that there is a natural subclass of founder graphs that admitefficient index structures to be built that support exact string matching on the paths ofsuch graphs. Moreover, we give a linear time algorithm to construct such founder graphsfrom a given multiple alignment. The construction algorithm can also be adjusted to producea subclass of degenerate strings [6], which also support efficient indexing.2
T A T G A T T G A GT T A C T C G A T A TC A T G G A T T G A GA T A T T C G A A T CC A T G C G T A A T CT T A C C G T A T A T (a)
A T A T C G T A T A TT T A C G A T T A T CC A T G T C G A G A G (b)
ATC AT AT TCG GTC ACG TG TA GTA AT GTC (c)
Z Z
C A T GT T A CA T A T G A T TC G T AT C G A G A GT A TA T C (d)
Figure 1: (a) Input multiple alignment, (b) a set of founders with common recombinationpositions as a solution to a relaxed version of founder reconstruction, (c) a variation graphencoding the input and (d) a founder graph. Here the input alignment and the resultingvariation graph are unrealistically bad; the example is made to illustrate the founders.The founder graph definition given above only makes sense if we assume that our inputmultiple alignment is gapless , meaning that the alignment is simply produced by puttingstrings of equal length under each other, like in Figure 1. For the sake of presentation, wedevelop our theory framework first for founder graphs under gapless multiple alignments.Then we extend the framework to cover the general case introducing elastic founder graphs ,which are a generalization of elastic strings [7]We start in Sect. 2 by putting the work into the context of related work. In Sect. 3we introduce the basic notions and tools. In Sect. 4 we study the property, namely, the repeat-freeness , of founder graphs that enable indexing. In Sect. 5 we give the linear timeconstruction algorithm. In Sect. 6 we develop a succinct index structure that supportsexact string matching in linear time. In Sect. 7 we show how to convert repeat-free foundergraphs into Wheeler graphs. This gives an alternative index structure with more generalfunctionalities. In Sect. 8 we consider the general case of having gap symbols in multiplealignment. In Sect. 9 we show that some property like repeat-freeness is necessary for beingable to index founder graphs: We show that unless the Strong Exponential Time Hypothesis(SETH) fails, one cannot index elastic founder graphs in polynomial time to support lineartime queries.
Indexing directed acyclic graphs (DAGs) for exact string matching on its paths was firststudied by Sir´en et al. in WABI 2011 [8]. A generalization of Burrows-Wheeler trans-form [9] was proposed that supported near-linear time queries. However, the proposedtransformation can grow exponentially in size in the worst case. Many practical solutionshave been proposed since then, that either limit the search to short queries or use more time3n queries [10, 11, 12, 13, 14, 15]. More recently, such approaches have been captured by thetheory on
Wheeler graphs [16, 17, 18].Since it is NP-hard to recognize if a given graph is Wheeler [17], it is of interest to lookfor other graph classes that could provide some indexability functionality. Unfortunately,quite simple graphs turn out to be hard to index [19,20] (under the Strong Exponential TimeHypothesis). But as we will see later, further restrictions on graphs change the situation:We show that there exists a family of founder graphs that can be indexed in linear time tosupport linear time queries.Block graphs have also tight connection to generalized degenerate (GD) strings and their elastic version. These can also be seen as DAGs with a very specific structure. Matching aGD string is computationally easier and even linear time online algorithms can be achievedto compare two such strings, as analyzed by Alzamel et al. [6]. The elastic counterpartrequires more care, as studied by Bernardini et al. [7]. Our results on founder graphs canbe casted on GD strings and elastic strings, as we will show later.Finally, our indexing solution has connections to succinct representations of de Bruijngraphs [21, 22, 23]. Compared to de Bruijn graphs that are cyclic and have limited memory( k -mer length), our solution retains the linear structure of the block graph. We denote integer intervals by [ i..j ]. Let Σ = { , . . . , σ } be an alphabet of size | Σ | = σ . A string T [1 ..n ] is a sequence of symbols from Σ, i.e. T ∈ Σ n , where Σ n denotes the set ofstrings of length n under the alphabet Σ. A suffix of string T [1 ..n ] is T [ i..n ] for 1 ≤ i ≤ n .A prefix of string T [1 ..n ] is T [1 ..i ] for 1 ≤ i ≤ n . A substring of string T [1 ..n ] is T [ i..j ] for1 ≤ i ≤ j ≤ n . Substring T [ i..j ] where j < i is defined as the empty string .The lexicographic order of two strings A and B is naturally defined by the order ofthe alphabet: A < B iff A [1 ..i ] = B [1 ..i ] and A [ i + 1] < B [ i + 1] for some i ≥
0. If i + 1 > min( | A | , | B | ), then the shorter one is regarded as smaller. However, we usually avoidthis implicit comparison by adding end marker to the strings.Concatenation of strings A and B is denoted AB . As mentioned in the introduction, our goal is to compactly represent a gapless multiple se-quence alignment (MSA) using a founder graph. In this section we formalize these concepts.A gapless multiple sequence alignment
MSA [1 ..m, ..n ] is a set of m strings drawn fromΣ, each of length n . Intuitively, it can be thought of as a matrix in which each row is one ofthe m strings. Such a structure can be partitioned into what we call a segmentation , thatis, a collection of sets of shorter strings that can represent the original alignment. Definition 1 (Segmentation) . Let
MSA [1 ..m, ..n ] be a gapless multiple alignment and let R , R , . . . , R m be the strings in MSA. A segmentation S of MSA is a set of b sets of strings S , S , . . . , S b such that for each ≤ i ≤ b there exist interval [ x ( i ) ..y ( i ) ] such that S i =4 Z C A T GT T A CA T A T G A T TC G T AT C G A G A GT A TA T C
Figure 2: An example of two strings,
GCG and
ATTCGATA , occurring in G ( S ). { R t [ x ( i ) ..y ( i ) ] | ≤ t ≤ m } . Furthermore, it holds x (1) = 1 , y ( b ) = n , and y ( i ) = x ( i +1) − for all ≤ i < b , so that S , S , . . . , S b covers the MSA. We call W ( S i ) = y ( i ) − x ( i ) + 1 thewidth of S i . A segmentation of a MSA can naturally lead to the construction of a founder graph. Letus first introduce the definition of a block graph.
Definition 2 (Block Graph) . A block graph is a graph G = ( V, E, (cid:96) ) where (cid:96) : V → Σ + is a function that assigns a string label to every node and for which the following propertieshold.1. { V , V , . . . , V b } is a partition of V , that is, V = V ∪ V ∪ . . . ∪ V b and V i ∩ V j = ∅ for all i (cid:54) = j ;2. if ( v, w ) ∈ E then v ∈ V i and w ∈ V i +1 for some ≤ i ≤ b − ;3. if v, w ∈ V i then | (cid:96) ( v ) | = | (cid:96) ( w ) | for each ≤ i ≤ b and if v (cid:54) = w , (cid:96) ( v ) (cid:54) = (cid:96) ( w ) . As a convention we call every V i a block and every (cid:96) ( v ) a segment .Given a segmentation S of a MSA, we can define the founder graph as a block graph induced by S . The idea is to have a graph in which the nodes represents the strings in S while the edges retain the information of how such strings can be recombined to spell anysequence in the original MSA. Definition 3 (Founder Graph) . A founder graph is a block graph G ( S ) = ( V, E, (cid:96) ) induced by segmentation S as follows: For each ≤ i ≤ b we have S i = { (cid:96) ( v ) : v ∈ V i } and ( v, w ) ∈ E if and only if there exists i ∈ [1 ..b − and t ∈ [1 ..m ] such that v ∈ V i , w ∈ V i +1 and R t [ j..j + | (cid:96) ( v ) | + | (cid:96) ( w ) | −
1] = (cid:96) ( v ) (cid:96) ( w ) with j = 1 + (cid:80) i − h =1 W ( S h ) . We regard the edges of founder and block graphs to be directed from left to right.Consider a path P in G ( S ) between any two nodes. The label (cid:96) ( P ) of P is the concatenationof labels of the nodes in the path. Let Q be a query string. We say that Q occurs in G ( S )if Q is a substring of (cid:96) ( P ) for any path P of G ( S ). Figure 2 illustrates such queries.In our example in Figure 1, the intervals corresponding to the segmentation would be[1 .. , [5 .. , [9 .. A trie [24] of a set of strings is a rooted directed tree with outgoing edges of each nodelabeled by distinct characters such that there is a root to leaf path spelling each string in5he set; shared part of the root to leaf paths to two different leaves spell the common prefixof the corresponding strings. Such a trie can be computed in O ( N log σ ) time, where N isthe total length of the strings, and it supports string queries that require O ( q log σ ) time,where q is the length of the queried string.An Aho-Corasick automaton [25] is a trie of a set of strings with additional pointers(fail-links). While scanning a query string, these pointers (and some shortcut links onthem) allow to identify all the positions in the query at which a match for any of the stringsoccurs. Construction of the automaton takes the same time as that of the trie. Queries take O ( q log σ + occ ) time, where occ is the number of matches.A suffix array [26] of string T is an array SA [1 ..n + 1] such that SA [ i ] = j if T (cid:48) [ j..n + 1]is the j -th smallest suffix of string T (cid:48) = T , where T ∈ { , , . . . , σ } n , and is the endmarker. Thus, SA [1] = n + 1. Burrows-Wheeler transform
BWT[1 ..n +1] [9] of string T is such that BWT[ i ] = T (cid:48) [ SA [ i ] − T (cid:48) = T and T (cid:48) [ −
1] is regarded as T (cid:48) [ n + 1] = .A bidirectional BWT index [27,28] is a succinct index structure based on some auxiliarydata structures on BWT. Given a string T ∈ Σ n , with σ ≤ n , such index occupying O ( n log σ ) bits of space can be built in randomized O ( n ) time and it supports finding in O ( q )time if a query string Q [1 ..q ] appears as substring of T [28]. Moreover, the query returns aninterval pair ([ i..j ],[ i (cid:48) ..j (cid:48) ]) such that suffixes of T starting at positions SA [ i ] , SA [ i +1] , . . . , SA [ j ]share a common prefix matching the query. Interval [ i (cid:48) ..j (cid:48) ] is the corresponding interval inthe suffix array of the reverse of T . Let ([ i..j ],[ i (cid:48) ..j (cid:48) ]) be the interval pair corresponding toquery substring Q [ l..r ]. A bidirectional backward step updates the interval pair ([ i..j ],[ i (cid:48) ..j (cid:48) ])to the corresponding interval pair when the query substring Q [ l..r ] is extended to the leftinto Q [ l − ..r ] ( left extension ) or to the right into Q [ l..r + 1] ( right extension ). Such steptakes constant time [28].A fully-functional bidirectional BWT index [29] expands the steps to allow contractingsymbols from the left or from the right. That is, substring Q [ l..r ] can be modified into Q [ l + 1 ..r ] ( left contraction ) or to Q [ l..r −
1] ( right contraction ) and the the correspondinginterval pair can be updated in constant time.Among the auxiliary structures used in BWT-based indexes, we explicitly use the rank and select structures: String B [1 ..n ] from binary alphabet is called a bitvector . Operation rank ( B, i ) returns the number of 1s in B [1 ..i ]. Operation select ( B, j ) returns the index i containing the j -th 1 in B . Both queries can be answered in constant time using an indexrequiring o ( n ) bits in addition to the bitvector itself [30].We summarize a result that is sufficient for our purposes: Lemma 1 ( [29]) . Given a text T of length n from an alphabet { , , . . . , σ } , it is possible toconstruct in O ( n ) randomized time and O ( n log σ ) bits of space a bidirectional BWT indexthat supports left extensions and right contractions in O (1) time. We now describe how to replace the randomized linear time construction of the bidirec-tional BWT index with a deterministic one, so that left extensions and right contractionsare supported in O (log σ ) time. We need only a single-directional subset of the index ofBelazzougui and Cunial [29], consisting of an index on the BWT, augmented with balanced6arentheses representations of the topologies of the suffix tree of T and of the suffix link treeof the reverse of T , such that the nodes corresponding to maximal repeats in both topologiesare marked [29].Belazzougui et al. showed how to construct the BWT of a string O ( n ) deterministictime and O ( n log σ ) bits of space [28]. Their algorithm can be used to construct both theBWT of T and the BWT of the reverse of T in O ( n ) time. We can build the bidirectionalBWT index [27, 31] by indexing both BWTs as wavelet trees [32]. The bidirectional indexcan then be used to construct both of the required tree topologies in O ( n log σ ) time usingbidirectional extension operations and the counter-based topology construction method ofBelazzougui et al. [28]. See the supplement of a paper on variable order Markov models byCunial et al. [33] for more details on the construction of the tree topologies. The topologiesare then indexed for various navigational operations required by the contraction operationdescribed in [29].This index enables left extensions in O (log σ ) time using the BWT of T , and right con-tractions in constant time using the succinct tree topologies as shown in [29]. We summarizethe results in the Lemma below. Lemma 2.
Given a text T of length n from an alphabet { , , . . . , σ } , it is possible toconstruct in O ( n log σ ) deterministic time and O ( n log σ ) bits of space a bidirectional BWTindex that supports left extensions in O (log σ ) time and and right contractions in O (1) time. We note that it may be possible to improve the time of left extensions to O (log log σ ) byreplacing monotone minimum perfect hash functions by slower yet deterministic linear timeconstructable data structures in all constructions leading to Theorem 6.7 of Belazzougui etal. [28]. We now show that there exists a family of founder graphs that admit a polynomial timeconstructable index structure supporting fast string matching. First, a trivial observation:the input multiple alignment is a founder graph for the segmentation consisting of only onesegment. Such founder graph (set of sequences) can be indexed in linear time to supportlinear time string matching [28]. Now, the question is, are there other segmentations thatallow the resulting founder graph to be indexed in polynomial time? We show that this isthe case.
Definition 4.
Founder graph G ( S ) is repeat-free if each (cid:96) ( v ) for v ∈ V occurs exactly oncein G ( S ) . Our example graph (Fig. 1) is not quite repeat-free, as
TAT occurs also as substring ofpaths starting with
ATAT . Proposition 1.
Repeat-free founder graphs can be indexed in polynomial time to supportpolynomial time string queries.
To prove the proposition, we construct such an index and show how queries can beanswered efficiently. 7et P ( v ) be the set of all paths starting from node v and ending in a sink node. Let P ( v, i ) be the set of suffix path labels { (cid:96) ( L )[ i.. ] | L ∈ P ( v ) } for 1 ≤ i ≤ | (cid:96) ( v ) | . Considersorting P = ∪ v ∈ V, ≤ i ≤| (cid:96) ( v ) | P ( v, i ) in lexicographic order. Then one can binary search anyquery string Q in P to find out if it occurs in G ( S ) or not. The problem with this approachis that P is of exponential size.However, if we know that G ( S ) is repeat-free, we know that the lexicographic order of (cid:96) ( L )[ i.. ], L ∈ P ( v ), is fully determined by the prefix (cid:96) ( v )[ i.. | (cid:96) ( v ) | ] (cid:96) ( w ) of (cid:96) ( L )[ i.. ], where w is the node following v on the path L . Let P (cid:48) ( v, i ) denote the set of suffix path labelscut in this manner. Now the corresponding set P (cid:48) = ∪ v ∈ V, ≤ i ≤| (cid:96) ( v ) | P (cid:48) ( v, i ) is no longer ofexponential size. Consider again binary searching a string Q on sorted P (cid:48) . If Q occurs in P (cid:48) then it occurs in G ( S ). If not, Q has to have some (cid:96) ( v ) for v ∈ V as its substring inorder to occur in G ( S ).To figure out if Q contains (cid:96) ( v ) for some v ∈ V as its substring, we build an Aho-Corasickautomaton [25] for { (cid:96) ( v ) | v ∈ V } . Scanning this automaton takes O ( | Q | log σ ) time andreturns such v ∈ V if it exists.To verify such potential match, we need several tries [24]. For each v ∈ V , we buildtries R ( v ) and F ( v ) on the sets { (cid:96) ( u ) − | ( u, v ) ∈ E } and { (cid:96) ( w ) | ( v, w ) ∈ E } , respectively,where X − denotes the reverse x | X | x | X |− · · · x of string X = x x · · · x | X | .Assume now we have located (using the Aho-Corasick automaton) v ∈ V with (cid:96) ( v ) suchthat (cid:96) ( v ) = Q [ i..j ], where v is at the k -th block of G ( S ). We continue searching Q [1 ..i − R ( v ). If we reach a leaf after scanning Q [ i (cid:48) ..i − Q [1 ..i (cid:48) −
1] on trie R ( v (cid:48) ), where v (cid:48) ∈ V is the node at block k − G ( S )corresponding to the leaf we reached in the trie. If the search succeeds after reading Q [1]we have found a path in G ( S ) spelling Q [1 ..j ]. We repeat the analogous procedure with Q [ j..m ] starting from trie F ( v ). That is, we can verify a candidate occurrence of Q in G ( S )in O ( | Q | log σ ) time, as the search in the tries takes O (log σ ) time per step.We are now ready to specify a theorem that reformulates Proposition 1 in detailed form. Theorem 1.
Let G = ( V, E ) be a repeat-free founder graph with blocks V , V , . . . , V b suchthat V = V ∪ V ∪ · · · ∪ V b . We can preprocess an index structure for G in O (( N + W | E | ) log σ ) time, where { , . . . , σ } is the alphabet for node labels, W = max v ∈ V | (cid:96) ( v ) | , N = (cid:80) v ∈ V | (cid:96) ( v ) | , and σ ≤ N . Given a query string Q [1 ..q ] ∈ { , . . . , σ } q , we can use theindex structure to find out if Q occurs in G . This query takes O ( | Q | log σ ) time.Proof. To see that the approach stated above works correctly, we need to show that itsuffices to verify exactly one arbitrary candidate occurrence identified by the Aho-Corasickautomaton. Indeed, for contradiction, assume our verification fails in finding Q startingfrom a candidate match Q [ i..j ] = (cid:96) ( v ), but there is another candidate match Q [ i (cid:48) ..j (cid:48) ] = (cid:96) ( w ), v (cid:54) = w , resulting in an occurrence of Q in G . First, we can assume it holds that [ i..j ] is notincluded in [ i (cid:48) ..j (cid:48) ] and [ i (cid:48) ..j (cid:48) ] is not included in [ i..j ], since such nested cases would contradictthe repeat-free property of G . Now, starting from the candidate match Q [ i (cid:48) ..j (cid:48) ], we will findan occurrence of Q [ i..j ] in G when extending to the left or to the right. This occurrencecannot be (cid:96) ( v ), as we assumed the verification starting from Q [ i..j ] = (cid:96) ( v ) fails. That is,we found another occurrence of (cid:96) ( v ) in G , which is a contradiction with the repeat-freeproperty. Hence, the verification starting from an arbitrary candidate match is sufficient.8ith preprocessing time O ( N log σ ) we can build the Aho-Corasick automaton [25]. Thetries can be built in O (log σ )( (cid:80) v ∈ V ( (cid:80) ( u,v ) ∈ E | (cid:96) ( u ) | + (cid:80) ( v,w ) ∈ E | (cid:96) ( w ) | )) = O ( | E | W log σ )time. The search for a candidate match and the following verification take O ( | Q | log σ ) time.We are left with the case of short queries not spanning a complete node label. To avoidthe costly binary search in sorted P (cid:48) , we instead construct the unidirectional BWT index [28]for the concatenation C = (cid:81) i ∈{ , ,...,b } (cid:81) v ∈ V i , ( v,w ) ∈ E (cid:96) ( v ) (cid:96) ( w )0. Concatenation C is thus astring of length O ( | E | W ) from alphabet { , , , . . . , σ } . The unidirectional BWT index for C can be constructed in O ( | C | ) time, so that in O ( | Q | ) time, one can find out if Q occursin C [28]. This query equals that of binary search in P (cid:48) .We remark that founder graphs have a connection with generalized degenerate strings (GD strings) [6]. In a GD string, sets of strings of equal length are placed one after theother to represent in a compact way a bigger set of strings. Such set contains all possibleconcatenations of those strings, which are obtained by scanning the individual sets from leftto right and selecting one string from each set. The length of the strings in a specific setis called width , and the sum of all the widths of all sets in a GD string is the total width .Given two GD strings of the same total width it is possible to determine if the intersectionof the sets of strings that they represent is non empty in linear time in the size of the GDstrings [6]. Thus, the special case in which one of the two GD string is just a standardstring can be seen also as a special case of a founder graph in which every segment is fullyconnected with the next one and the length of the query string is equal to the maximallength of a path in the graph.We consider the question of indexing GD strings (fully connected block graphs) to searchfor queries Q shorter than the total width. We can exploit the repeat-free property to yieldsuch an index. Corollary 1.
The results of Theorem 1 hold for a repeat-free GD string a.k.a. a fullyconnected repeat-free founder graph.
Observe that max(
N, W | E | ) ≤ mn , where m and n are the number of rows and numberof columns, respectively, in the multiple sequence alignment from where the founder graph isinduced. That is, the index construction algorithms of the above theorems can be seen to bealmost linear time in the (original) input size. We study succinct variants of these indexesin Sect. 6, and also improve the construction and query times to linear as side product. Now that we know how to index repeat-free founder graphs, we turn our attention to theconstruction of such graphs from a given MSA. For this purpose, we will adapt the dynamicprogramming segmentation algorithms for founders [2, 3].The idea is as follows. Let S be a segmentation of MSA [1 ..m, ..n ]. We say S is valid ifit induces a repeat-free founder graph G ( S ) = ( V, E ). We build such valid S for prefixes ofMSA from left to right, minimizing the maximum block length needed.9 .1 Characterization lemma Given a segmentation S and founder graph G ( S ) = ( V, E ) induced by S , we can ensurethat it is valid by checking if, for all v ∈ V , (cid:96) ( v ) occurs in the rows of the MSA only in theinterval of the block V i , where V i is the block of V such that v ∈ V i . Lemma 3 (Characterization) . Let x ( i ) = 1+ (cid:80) i − h =1 W ( S h ) . A segmentation S is valid if andonly if, for all blocks V i ⊆ V , ≤ t ≤ m and j (cid:54) = x ( i ) , if v ∈ V i then R t [ j..j + | (cid:96) ( v ) | − (cid:54) = (cid:96) ( v ) .Proof. To see that this is a necessary condition for the validity of S , notice that each row ofMSA can be read through G , so if (cid:96) ( v ) occurs elsewhere than inside the block, then theseextra occurrences make S invalid. To see that this is a sufficient condition for the validityof S , we observe the following:a) For all ( v, w ) ∈ E , (cid:96) ( v ) (cid:96) ( w ) is a substring of some row of the input MSA.b) Let ( x, u ) , ( u, y ) ∈ E be two edges such that U = (cid:96) ( x ) (cid:96) ( u ) (cid:96) ( y ) is not a substring of anyrow of input MSA. Then any substring of U either occurs in some row of the input MSAor it includes (cid:96) ( u ) as its substring.c) Thus, any substring of a path in G either is a substring of some row of the input MSA,or it includes (cid:96) ( u ) of case b) as its substring.d) Let α be a substring of a path of G that includes (cid:96) ( u ) as its substring. If (cid:96) ( z ) = α forsome z ∈ V , then (cid:96) ( u ) appears at least twice in the MSA. Substring α makes S invalidonly if (cid:96) ( u ) does. Among the valid segmentations, we wish to select an optimal segmentation under somegoodness criteria. Earlier work [2, 3] has considered various goodness criteria, solving theassociated segmentation problems in linear time. In this paper, we focus on minimizing themaximum width of the segments. For example, the (non-valid) segmentation in Figure 1 hasscore 4, as the intervals of the segments are [1 .. , [5 .. , and [9 .. W ( S ) =4 − W ( S ) = 8 − W ( S ) = 11 − s ( j (cid:48) ) be the score of aminimum scoring valid segmentation S , S , . . . , S b of prefix MSA [1 ..m, ..j (cid:48) ], where the scoreis defined as max i :1 ≤ i ≤ b W ( S i ). We can compute s ( j ) = min j (cid:48) :0 ≤ j (cid:48) ≤ v ( j ) max( j − j (cid:48) , s ( j (cid:48) )) , (1)10here v ( j ) is the largest integer such that segment MSA [1 ..m, v ( j ) + 1 ..j ] is valid. Thesegment is valid iff each substring MSA [ i, v ( j ) + 1 ..j ], for 1 ≤ i ≤ m , occurs as many timesin MSA [1 ..m, v ( j ) + 1 ..j ] as in the whole MSA. If such v ( j ) does not exist for some j , we set v ( j ) = 0. The intuition is that [ j (cid:48) + 1 ..j ] forms the last valid segment of the segmentation,and since s ( j (cid:48) ) is the score of optimal valid segmentation of MSA [1 ..m, ..j (cid:48) ], the maximum of s ( j (cid:48) ) and the length of the last segment decides the score s ( j ). To initialize the recurrence,one can set s (0) = 0 and s ( j ) = ∞ for 1 ≤ j ≤ J for which v ( j ) is undefined. The recurrencecan then be applied for j ∈ [ J + 1 ..n ].We can compute the score s ( n ) of the optimal segmentation of MSA in O ( ns max ) timeafter preprocessing values v ( j ) in O ( mns max log σ ) time, where s max = max j : v ( j ) > s ( j ). Forthe former, one can start comparing max( j − j (cid:48) , s ( j (cid:48) )) from j (cid:48) = v ( j ) decreasing j (cid:48) by oneeach step and maintaining s ( j ) as the minimum value so far. Once value j − j (cid:48) growsbigger than current s ( j ), we know that the value of e ( j ) can no longer decrease. Thismeans we can decrease j (cid:48) exactly s ( j ) − ( j − v ( j )) ≤ s ( j ) times. For preprocessing, webuild the bidirectional BWT index of the MSA in O ( mn ) time [28]. At column j , considerthe trie containing the reverse of the rows of M [1 ..m, ..j ]. Search the trie paths from thebidirectional BWT index until the number of leaves in each trie subtree equals the lengthof the corresponding BWT interval. Let j (cid:48) be the column closest to j where this holds forall trie paths. Then one can set v ( j ) = j (cid:48) . The O ( m ( j − v ( j )) log σ ) time construction ofthe trie has to be repeated for each column. As j − v ( j ) ≤ s ( j ), the claimed preprocessingtime follows. We can do the preprocessing for values v ( j ) in O ( mn ) time. The idea is to build a BWTindex on the MSA rows, and then search all rows backward from right to left in parallel. Oncewe reach a column j (cid:48) where all suffixes have altogether exactly m occurrences (their union ofBWT intervals is of size m ), then MSA [1 ..m, j (cid:48) ..n ] is a valid segment. Then we can drop lastcolumn (do right-contract on all rows) and continue left-extensions until finding the largest j (cid:48) such that MSA [1 ..m, j (cid:48) ..n −
1] is a valid segment. Continuing this way, we can find foreach column j the value v ( j ) = j (cid:48) −
1. The bottleneck of the approach is the computationof the size of the union of intervals, but we can avoid a trivial computation by exploitingthe repeat-free property and the order in which these intervals are computed.
Theorem 2.
Given a multiple sequence alignment
MSA [1 . . . m, . . . n ] with each MSA [ i, j ] ∈ [1 ..σ ] , values v ( j ) for each ≤ j ≤ n can be computed in randomized O ( mn ) time or deter-ministic O ( mn log σ ) time. Here v ( j ) is the largest integer such that segment MSA [1 ..m, v ( j )+1 ..j ] is valid.Proof. Let us build the bidirectional BWT index [28] of MSA rows concatenated into onelong string with some separator symbols added between rows. We will run several algorithmsin synchronization over this BWT index, but we explain them first as if they would be runindependently.Algorithm 1 searches in parallel all rows from right to left advancing each by one positionat a time. Let k be the number of parallel of steps done so far. We can maintain a bitvector M that at k -th step stores M [ i ] = 1 iff BW T [ i ] is the k -th last symbol of some row.11lgorithm 2 uses the variable length sliding window approach of Belazzougui and Cunial[29] to compute values v ( j ). Let the first row of MSA be T [1 ..n ]. Search T [1 ..n ] backwardsin the fully-functional bidirectional BWT index [29]. Stop the search at T [ j (cid:48) + 1 ..n ] suchthat the corresponding BWT interval [ i (cid:48) ..i ] contains only suffixes originating from column j (cid:48) + 1 of the MSA , that is, spelling
MSA [ a, j (cid:48) + 1 ..n ] in the concatenation, for some rows a . Set v b ( n ) = j (cid:48) for row b = 1. Contract T [ n ] from the search string and modify BWT intervalaccordingly [29]. Continue the search (decreasing j (cid:48) by one each step) to find T [ j (cid:48) + 1 ..n − i (cid:48) ..i ] contains only suffixes originating fromcolumn j (cid:48) + 1. Update v b ( n −
1) = j (cid:48) for row b = 1. Continue like this throughout T .Repeat the process for all remaining rows b ∈ [2 ..m ], to compute v ( j ) , v ( j ) , . . . , v m ( j ) forall j . Set v ( j ) = min i v i ( j ) for all j .Let us call the instances of the Algorithm 2 run on the rest of the rows as Algorithms3 , , . . . , m + 1.Let the current BWT interval in Algorithms 2 to m + 1 be [ j (cid:48) + 1 ..j ]. The problematicpart in them is checking if the corresponding active BWT intervals [ i (cid:48) a ..i a ] for Algorithms a ∈ { , , . . . , m + 1 } contain only suffixes originating from column j (cid:48) + 1. To solve this, werun Algorithm 1 as well as Algorithms 2 to m + 1 in synchronization so that we are at the k -th step in Algorithm 1 when we are processing interval [ j (cid:48) + 1 ..j ] in rest of the algorithms,for k = n − j (cid:48) . In addition, we maintain bitvectors B and E such that B [ i (cid:48) a ] = 1 and E [ i a ] = 1 for a ∈ { , , . . . , m + 1 } . For each M [ i ] that we set to 1 at step k with B [ i ] = 0and E [ i ] = 0, we check if M [ i −
1] = 1 and M [ i + 1] = 1. If and only if this check fails onany i , there is a suffix starting outside column j (cid:48) + 1. This follows from the fact that eachsuffix starting at column j (cid:48) + 1 must be contained in exactly one of the distinct intervals ofthe set I = { [ i (cid:48) a ..i a ] } a ∈{ , ...m +1 } . This is because I cannot contain nested interval pairs asall strings in segment [ j (cid:48) + 1 ..j ] of MSA are of equal length, and thus their BWT intervalscannot overlap except if the intervals are exactly the same.Finally, the running time of the main algorithm is O ( mn ) or O ( mn log σ ), using Lemma 1or Lemma 2, respectively, and since the bitvectors are manipulated locally only on indexesthat are maintained as variables during the execution. Recall Eq. (1). Before proceeding to the involved optimal solution, we give some insightsby first improving the running time to logarithmic per entry.As it holds v (1) ≤ v (2) ≤ · · · ≤ v ( n ), the range where the minimum is taken growsas j grows. Now, [ j (cid:48) ..j (cid:48) + s ( j (cid:48) )] can be seen as the effect range of s ( j (cid:48) ): for columns j > j (cid:48) + s ( j (cid:48) ) the maximum from the options is j − j (cid:48) . Consider maintaining (key, value)pairs ( s ( j (cid:48) ) + j (cid:48) , s ( j (cid:48) )) in a binary search tree (BST). When computing s ( j ) we should havepairs ( s ( j (cid:48) ) + j (cid:48) , s ( j (cid:48) )) for 1 ≤ j (cid:48) ≤ v ( j ) in BST. Value s ( j ) can be computed by taking rangeminimum on BST values with keys in range [ j.. ∞ ]. Such query is easy to solve in O (log n )time. If there is nothing in the interval, then s ( j ) = j − v ( j ). Since this is semi-openinterval on keys in range [1 . . . n ], BST can be replaced by van Emde Boas tree to obtain O ( n log log n ) time computation of all values [34]. Alternatively, we can remove elementsfrom the BST once they no longer can be answers to queries, and we can get O ( n log s max )solution. To obtain better running time, we need to exploit more structural properties of12he recurrence.Cazaux et al. [3] considered a similar recurrence and gave a linear time solution for it.In what follows we modify that technique to work with valid ranges.For j between 1 and n , we define x ( j ) = max Argmin j (cid:48) ∈ [1 ..v ( j )] max( j − j (cid:48) , s ( j (cid:48) )) Lemma 4.
For any j ∈ [1 ..n − , we have x ( j ) ≤ x ( j + 1) .Proof. By the definition of x ( . ), for any j ∈ [1 ..n ], we have for j (cid:48) ∈ [1 ..x ( j ) − j − j (cid:48) , s ( j (cid:48) )) ≥ max( j − x ( j ) , s ( x ( j ))) and for j (cid:48) ∈ [ x ( j ) + 1 ..v ( j )], max( j − j (cid:48) , s ( j (cid:48) )) > max( j − x ( j ) , s ( x ( j ))).We assume that there exists j ∈ [1 ..n − x ( j + 1) < x ( j ). In this case, x ( j + 1) ∈ [1 ..x ( j ) −
1] and we have max( j − x ( j + 1) , s ( x ( j + 1))) ≥ max( j − x ( j ) , s ( x ( j ))).As v ( j + 1) ≥ v ( j ), x ( j ) ∈ [ x ( j + 1) + 1 ..v ( j + 1)] and thus max( j + 1 − x ( j + 1) , s ( x ( j + 1))) < max( j + 1 − x ( j ) , s ( x ( j ))). As x ( j + 1) < x ( j ), we have j − x ( j + 1) > j − x ( j ). To simplifythe proof, we take A = j − x ( j + 1), B = s ( x ( j + 1)), C = j − x ( j ) and D = s ( x ( j )). Hence,we have max( A, B ) ≥ max( C, D ), max( A + 1 , B ) < max( C + 1 , D ) and A > C . Now we aregoing to prove that this system admits no solution. • Case where A = max( A, B ) and C = max( C, D ). As
A > C , we have A + 1 > C + 1and thus max( A +1 , B ) > max( C +1 , D ) which is impossible because max( A +1 , B ) < max( C + 1 , D ). • Case where B = max( A, B ) and C = max( C, D ). We can assume that
B > A (inthe other case, we take A = max( A, B )) and as
A > C , we have
B > C + 1 andthus max( A + 1 , B ) > max( C + 1 , D ) which is impossible because max( A + 1 , B ) < max( C + 1 , D ). • Case where A = max( A, B ) and D = max( C, D ). We have
A > D and
A > C ,thus max( A + 1 , B ) > max( C + 1 , D ) which is impossible because max( A + 1 , B ) < max( C + 1 , D ). • Case where B = max( A, B ) and D = max( C, D ). We have B ≥ D and A > C ,thus max( A + 1 , B ) ≥ max( C + 1 , D ) which is impossible because max( A + 1 , B ) < max( C + 1 , D ). Lemma 5.
By initialising s (1) to a threshold K , for any j ∈ [1 ..n ] , we have s ( j ) ≤ max( j, K ) .Proof. We are going to show by induction. The base case is obvious because s (1) = K ≤ max(1 , K ). As s ( j ) = min j (cid:48) :1 ≤ j (cid:48) ≤ v ( j ) max( j − j (cid:48) , s ( j (cid:48) )), by using induction, s ( j ) ≤ min j (cid:48) :1 ≤ j (cid:48) ≤ v ( j ) max( j, K ) ≤ max( j, K )Thanks to Lemma 5, by taking the threshold K = n + 1, the values s ( j ) are in O ( n ) forall j in [1 ..n − emma 6. Given j (cid:63) ∈ [ x ( j −
1) + 1 ..v ( j )] , we can compute in constant time if j (cid:63) = max Argmin j (cid:48) ∈ [ j (cid:63) ..v ( j )] max( j − j (cid:48) , s ( j (cid:48) )) . Proof.
We need just to compare k = max( j − j (cid:63) , s ( j (cid:63) )) and s ( j (cid:5) ) where j (cid:5) is in Argmin j (cid:48) ∈ [ j (cid:63) +1 ..v ( j )] s ( j (cid:48) ). If k is smaller than s ( j (cid:5) ), k is smaller than all the s ( j (cid:48) ) with j (cid:48) ∈ [ j (cid:63) + 1 ..v ( j )] and thus for all max( j − j (cid:48) , s ( j (cid:48) )). Hence we have j (cid:63) = max Argmin j (cid:48) ∈ [ j (cid:63) ..v ( j )] max( j − j (cid:48) , s ( j (cid:48) )).Otherwise, s ( j (cid:5) ) ≥ k and as k ≥ j − j (cid:63) , max( j − j (cid:5) , s ( j (cid:5) )) ≥ k . In this case j (cid:63) (cid:54) =max Argmin j (cid:48) ∈ [ j (cid:63) ..v ( j )] max( j − j (cid:48) , s ( j (cid:48) )). By using the constant time semi-dynamic rangemaximum query by Cazaux et al. [3] on the array s ( . ), we can obtain in constant time j (cid:5) and thus check the equality in constant time. Theorem 3.
The values s ( j ) of Eq. (1), for all j ∈ [1 ..n ] , can be computed in O ( n ) time af-ter a randomized O ( mn ) time or deterministic O ( mn log σ ) time preprocessing. The optimalsegmentation defined by Eq. (1) yields a repeat-free founder graph.Proof. We begin by preprocessing all the values of v ( j ) in O ( mn ) randomized or O ( mn log σ )deterministic time (Theorem 2). The idea is to compute all the values s ( j ) by increasingorder of j and by using the values x ( j ). For each j ∈ [1 ..n ], we check all the j (cid:48) from x ( j − v ( j ) with the equality of Lemma 6 until one is true and thus corresponds to x ( j ). Finally,we add s ( j ) = max( j − x ( j ) , s ( x ( j ))) to the constant time semi-dynamic range maximumquery and continue with j + 1. Recall the indexing solutions of Sect. 4 and the definitions from Sect. 3.We now show that explicit tries and Aho-Corasick automaton can be replaced by someauxiliary data structures associated with the Burrows-Wheeler transformation of the con-catenation C = (cid:81) i ∈{ , ,...,b } (cid:81) v ∈ V i , ( v,w ) ∈ E (cid:96) ( v ) (cid:96) ( w ) .Consider interval SA [ i..k ] in the suffix array of C corresponding to suffixes having (cid:96) ( v )as prefix for some v ∈ V . From the repeat-free property it follows that this interval can besplit into two subintervals, SA [ i..j ] and SA [ j + 1 ..k ], such that suffixes in SA [ i..j ] start with (cid:96) ( v ) and suffixes in SA [ j + 1 ..k ] start with (cid:96) ( v ) (cid:96) ( w ), where ( v, w ) ∈ E . Moreover, BWT[ i..j ]equals multiset { (cid:96) ( u )[ | (cid:96) ( u ) | − | ( u, v ) ∈ E } sorted in lexicographic order . This follows byconsidering the lexicographic order of suffixes (cid:96) ( u )[ | (cid:96) ( u ) | − (cid:96) ( v ) . . . for ( u, v ) ∈ E . Thatis, BWT[ i..j ] (interpreted as a set) represents the children of the root of the trie R ( v )considered in Sect. 4.We are now ready to present the search algorithm that uses only the BWT of C andsome small auxiliary data structures. We associate two bitvectors B and E to the BWT of C as follows. We set B [ i ] = 1 and E [ k ] = 1 iff SA [ i..k ] is maximal interval with all suffixesstarting with (cid:96) ( v ) for some v ∈ V .Consider the backward search with query Q [1 ..q ]. Let SA [ j (cid:48) ..k (cid:48) ] be the interval aftermatching the shortest suffix Q [ q (cid:48) ..q ] such that BWT[ j (cid:48) ] = . Let i = select ( B, rank ( B, j (cid:48) ))and k = select ( E, rank ( B, j (cid:48) )). If i ≤ j (cid:48) and k (cid:48) ≤ k , index j (cid:48) lies inside an interval SA [ i..k ]where all suffixes start with (cid:96) ( v ) for some v . We modify the range into SA [ i..k ], and continue14ith the backward step on Q [ q (cid:48) − expanded backward search .We can now strictly improve Theorems 1 and 1 as follows. Theorem 4.
Let G = ( V, E ) be a repeat-free founder graph (or a repeat-free GD string) withblocks V , V , . . . , V b such that V = V ∪ V ∪· · ·∪ V b . We can preprocess an index structurefor G occupying O ( W | E | log σ ) bits in O ( W | E | ) time, where { , . . . , σ } is the alphabet fornode labels and W = max v ∈ V (cid:96) ( v ) . Given a query string Q [1 ..q ] ∈ { , . . . , σ } q , we can useexpanded backward search with the index structure to find out if Q occurs in G . This querytakes O ( | Q | ) time.Proof. As we expand the search interval in BWT, it is evident that we still find all occur-rences for short patterns that span at most two nodes, like in the proof of Theorem 1. Weneed to show that a) the expansions do not yield spurious occurrences for such short pat-terns and b) the expansions yield exactly the occurrences for long patterns that we earlierfound with the Aho-Corasick and tries approach. In case b), notice that after an expansionstep we are indeed in an interval SA [ i..k ] where all suffixes match (cid:96) ( v ) and thus correspondsto a node v ∈ V . The suffix of the query processed before reaching interval SA [ i..k ] must beat least of length | (cid:96) ( v ) | . That is, to mimic Aho-Corasick approach, we should continue withthe trie R ( v ). This is identical to taking backward step from BWT[ i..k ], and continuingtherein to follow the rest of this implicit trie.To conclude case b), we still need to show that we reach all the same nodes as whenusing Aho-Corasick, and that the search to other direction with L ( v ) can be avoided. Thesefollow from case a), as we see.In case a), before doing the first expansion, the search is identical to the original algo-rithm in the proof of Theorem 1. After the expansion, all matches to be found are those ofcase b). That is, no spurious matches are reported. Finally, no search interval can includetwo distinct node labels, so the search reaches the only relevant node label, where the Aho-Corasick and trie search simulation takes place. We reach all such nodes that can yield afull match for the query, as the proof of Theorem 1 shows that it is sufficient to follow anarbitrary candidate match.As we only need standard backward step, we can use unidirectional BWT index con-structable in deterministic O ( W | E | ) time supporting backward step in constant time [28]. Wheeler graphs are a class of labeled graphs that admit an efficient index for path queries[16]. In this section we show how to transform a repeat-free block graph into an equivalentWheeler graph. This gives an alternative way to index a repeat-free block graph.We view a block graph as a nondeterministic finite automaton (NFA) by adding a newinitial state and edges from the source node to the starts of the first block, and expandingeach row of each block to a path of states. Since in automata the labels are usually on theedges instead of nodes, we define that the label of an edge is the label of the destinationnode. 15uppose we have a repeat-free NFA F . First we determinize it with the standard subsetconstruction for the reachable subsets of states. The states of the DFA are subsets of statesof the NFA such that there is an edge from subset S to subset S with label c iff S isthe set of states at the destinations of edges labeled with c from S . We only represent thesubsets of states reachable from the subset containing only the initial state. We call thedeterministic graph G . See Figures 3 and 4 for an example.Denote with P ( v ) the set of path labels from the initial state to v . Since the graph isa DFA, all sets P ( v ) are disjoint. Denote with P min ( v ) and P max ( v ) the colexicographicminimum and maximum of P ( v ) respectively (recall that the colexicographic order of stringsis the lexicographic order of the reverses of the strings). We denote the colexicographic orderrelation with ≺ .We say that a node v is atomic if for all path labels α in the graph, we have α ∈ P ( v )iff P min ( v ) (cid:22) α (cid:22) P max ( v ). A DFA is Wheeler if and only if all its nodes are atomic [18].In this case, the Wheeler order < of nodes is defined so that v < u if α ≺ β for some strings α ∈ P ( v ) and β ∈ P ( u ). This is well-defined exactly when all nodes are atomic.We will expand the graph so that all its nodes become atomic. To achieve this, weprocess the nodes in breadth-first order from the initial state. If the current node v isat the end of a block, we do nothing. Otherwise, we apply the following transformation.Suppose the in-neighbors of v are u , . . . , u k and the out-neighbors of v are w , . . . , w l . Wedelete v , add k new nodes v , . . . v k and add the sets of edges { ( u i , v i ) | ≤ i ≤ k } and { ( v i , w j ) | ≤ i ≤ k and 1 ≤ j ≤ l } . In other words, we distribute the in-edges of v and duplicate the out-edges. The automaton remains deterministic after this. Figure 5 showsan example of the expansion.We denote the resulting graph after all transformations with G (cid:48) . First we note that byconstruction, the language of G (cid:48) is same as the language of G : any path from the initialstate in G can be mapped to a corresponding path with the same label in G (cid:48) and vice versa.Next, we show that the graph is Wheeler-sortable: Lemma 7.
Every node v in G (cid:48) is atomic.Proof. If v is in the first block of the block graph, then | P ( v ) | = 1, so v is trivially atomic.Suppose v is not in the first block. Then all strings in P ( v ) are of the form αβγ , with | α | ≥ | β | > | γ | ≥
0, such that β occurs in F only once (repeat-free property) and γ is a prefix of some row of some block. Strings β and γ are the same for all strings in P ( v ).Now since β occurs only once in F , βγ also occurs only once in F . Consider a string δ suchthat P min ( v ) (cid:22) δ (cid:22) P max ( v ). Then δ must be suffixed by βγ , so it must be that δ ∈ P ( v )because the only occurrence of βγ leads to v .Next, we show that transformation from F to G (cid:48) increases the number of nodes by atmost a polynomial amount. Lemma 8.
The number of nodes in G (cid:48) is at most O ( nm ) , where n is the length of a longestpath in F and m is the maximum number of rows in a block.Proof. Each non-source node v of G (cid:48) can be associated to a pair ( u, α ), where u is the nodeof G (cid:48) reached by walking from v back to the end of the previous block, and α is the labelof the path from u to v . If v is at the first block, we set u to the added source node. If v
16s at the end of a block, we set u = v and set α to be the empty string. These pairs are alldistinct because G (cid:48) is deterministic.We bound the number of nodes in G (cid:48) by bounding the number of possible distinct pairs( u, α ). For simplicity, we don’t count in the initial state in our calculations. Let end ( b ) be theset of nodes at the end of block b , and let w b and h b be the width and height of block b respec-tively. Then the number of pairs ( u, α ) is at most (cid:80) Bb =1 (cid:16) | end ( b ) | + (cid:80) u ∈ end ( b − ( w b − h b (cid:17) ,where B is the number of blocks, and block b = 0 corresponds to the added initial state ofthe automaton. The term | end ( b ) | accounts for pairs ( u, α ) where α is the empty string andthe term (cid:80) u ∈ end ( b − ( w b − h b accounts for the rest. By plugging in the bounds h b ≤ m and | end ( b ) | ≤ m , we have at most Bm + m (cid:80) Bb =1 w b = Bm + m n nodes, which is O ( m n ).Since G (cid:48) is a Wheeler graph, we can find a Wheeler order by running the XBWT sortingalgorithm on a spanning tree of G (cid:48) , as shown in [18]. Finally, we show how to find theWheeler DFA of minimal size that recognizes the same language as F . One option is to usethe minimization algorithm of Alanko et al. [18]. However it is difficult to implement andis unnecessarily general for our case.According to Alanko et al. [18] a Wheeler DFA is minimal if all successive pairs of statesin the Wheeler order have a different label or are Myhill-Nerode-inequivalent. We denote theMyhill-Nerode equivalence relation with ≈ . Recall that u ≈ v is true iff for every string α ,running the automaton from u with string α gives the same result as running the automatonfrom v with string α , i.e. either both runs accept α or both runs reject α . The motivationfor the definition is that if u ≈ v , then states u and v can be merged without changing thelanguage of the automaton.In our setup, every state is considered accepting, so this means that u (cid:54)≈ v if there is astring α such that there is a path with label α from exactly one of u and v . Note that if u and v are from different columns, then u (cid:54)≈ v , because we can select the distinguishingextension α as the label of any maximal path from the node that is closer to the sourcenode.We will try to merge nodes of G (cid:48) column by column from right to left. When we mergea set of nodes, the in-neighbors of the merged node will be the union of the in-neighbors ofthe individual nodes, and the same for the out-neighbors. First we merge all nodes from thelast column with the same label. Suppose then that we are in column i , and by inductionthe nodes in column i + 1 are pairwise Myhill-Nerode inequivalent or have different labels.We merge a set of Wheeler-adjacent nodes S from column i having the same label if thesets of out-neighbors of all nodes in S are the same and all the out-neighbors have the samelabel. In this case u ≈ v for all u, v ∈ S by the induction assumption. After we have doneall the possible merges from column i , if we have two Wheeler-adjacent nodes u, v remainingwith the same label, then it must be that u (cid:54)≈ v because otherwise they would have beenmerged. Therefore after we have processed all columns, the resulting WDFA is minimal.17
27 C0T1T2 A9T10G11 G3G4G5 C12A15A18 T13T16C19 G14A17T20 T7T8G6 A23A25C21 G24A26G22
Figure 3: Repeat-free block NFA. The last columns of each block are highlighted. $
27 C0T1,2 A9G11T10 G3,4,5G4,5 A15,18C12 C19T16T13 T20A17G14 G6T7,8 C21A23,25 G22A26G24
Figure 4: The DFA resulting from the subset construction for the NFA in Figure 3. Thenumbers above the nodes specify the subset of NFA states corresponding the the DFA state. $
27 C0T1,2 A9G11T10 G3,4,5G4,5G4,5 A15,18C12A15,18A15,18 C19T16T13C19T16C19T16 T20A17G14 G6T7,8T7,8T7,8 C21A23,25A23,25A23,25 G22A26G24
Figure 5: The Wheeler DFA resulting from running our Wheeler expansion algorithm onthe DFA in Figure 4. 18able 1: Segment of MSA with and without gaps.segment of MSA gaps removed -AC-CGATC- ACCGATC-A-CCGATCC ACCGATCCAAC-CGATC- AACCGATCAAC-CGA-C- AACCGAC
We have so far assumed that our input is a gapless multiple alignment. Let us now considerhow to extend the results to the general case. The idea is that gaps are only used in thesegmentation algorithm to define the valid ranges, and that is the only place where specialattention needs to be taken; elsewhere, whenever a substring from MSA rows is read, gapsare treated as empty strings. That is, A-GC-TA- becomes AGCTA.It turns out that allowing gaps in MSA indeed makes the computation of valid rangesmore difficult. To see this, consider the example in Table 1. The second column in Table 1shows the sequences after gaps are removed. Without even seeing the rest of the MSA, onecan see that this is not a valid block, as the first string is a prefix of the second. Withgapless MSAs this was not possible and the algorithm in Sect. 5.3 exploited this fact.In fact, valid ranges as defined earlier do not make sense with gapped MSAs: Two rowscan be identical after gaps are removed, but due to the shift by gaps, they are consideredas different entities in the segmentation. To avoid this artificial side-effect, we redefine thevalid ranges concept, and show that our algorithms extend to this natural extension thatwe call repeat-free elastic founder graph. This naming comes from an analogy to elasticstrings [7]. A multiple sequence alignment MSA [1 ..m, ..n ] is a set of m strings drawn from Σ ∪ { - } , eachof length n . Here - / ∈ Σ is the gap symbol.
Definition 5 (Elastic segmentation) . Let
MSA [1 ..m, ..n ] be a multiple alignment and let R , R , . . . , R m be the strings in MSA. An elastic segmentation S of MSA is a set of b setsof strings S , S , . . . , S b such that for each ≤ i ≤ b there exist interval [ x ( i ) ..y ( i ) ] such that S i = { spell ( R t [ x ( i ) ..y ( i ) )] | ≤ t ≤ m } , where spell ( X ) returns the string with the gapsymbols removed from X . Furthermore, it holds x (1) = 1 , y ( b ) = n , and y ( i ) = x ( i +1) − forall ≤ i < b , so that S , S , . . . , S b covers the MSA. We call W ( S i ) = y ( i ) − x ( i ) + 1 thewidth of S i . Recall the block graphs from Sect. 3.Given an elastic segmentation S of a MSA, we can define the elastic founder graph as ablock graph induced by S . Definition 6 (Elastic founder graph) . An elastic founder graph is a block graph G ( S ) =( V, E, (cid:96) ) induced by an elastic segmentation S as follows: For each ≤ i ≤ b we have X , Y , and Z , | X | = | Y | do not appear elsewhere in MSA, except Z is aprefix of Y . The longer segment is non-valid, because X is not a prefix but a suffix of A X .Reversing the definition does not help, as the same segment contains A Z as prefix of A Y .segment of MSA longer segment of MSA X A XX - XY A YZ - | Y |−| Z | A Z - | Y |−| Z | S i = { (cid:96) ( v ) : v ∈ V i } and ( v, w ) ∈ E if and only if there exists i ∈ [1 ..b − and t ∈ [1 ..m ] such that v ∈ V i , w ∈ V i +1 and spell ( R t [ j..j + | (cid:96) ( v ) | + | (cid:96) ( w ) | − (cid:96) ( v ) (cid:96) ( w ) with j =1 + (cid:80) i − h =1 W ( S h ) . Now we need a slight modification of the repeat-free concept in order to handle thevariable-length strings in blocks.
Definition 7.
Elastic founder graph G ( S ) is repeat-free if each (cid:96) ( v ) for v ∈ V occurs onlyin G as prefix of (cid:96) ( w ) , where w ∈ V is a node from the same block as v . Recall Equation (1) from Sect. 5.2 and the preprocessing algorithm from Sect. 5.3. Weredefine the valid range concept as follows. We say elastic segmentation S is valid if itinduces a repeat-free elastic founder graph G ( S ) = ( V, E ). Elastic segment
MSA [1 ..m, j (cid:48) ..j ]is valid iff each substring spell( MSA [ i, j (cid:48) ..j ]), for 1 ≤ i ≤ m , occurs as many times as prefixof { spell( MSA [ i (cid:48) , j (cid:48) ..j ]) | ≤ i (cid:48) ≤ m } as in { spell( MSA [ i (cid:48) , ..n ]) | ≤ i (cid:48) ≤ m } .The main change to the earlier algorithm is that the preprocessing for finding the largestvalue v ( j ) such that MSA [1 ..m, v ( j ) + 1 ..j ] is valid no longer works: Even if elastic segment MSA [1 ..m, v ( j ) + 1 ..j ] is valid, elastic segment MSA [1 ..m, j (cid:48) ..j ] may not be valid, for some j (cid:48) ≤ v ( j ). Table 2 shows an example.Instead of separate preprocessing and dynamic programming, we consider directly therecurrence for the elastic case. Let e ( j (cid:48) ) be the score of a minimum scoring valid elastic seg-mentation S , S , . . . , S b of prefix MSA [1 ..m, ..j (cid:48) ], where the score is defined as max i :1 ≤ i ≤ b W ( S i ).Then e ( j ) = min j (cid:48) : 0 ≤ j (cid:48) < j, MSA [1 ..m, j (cid:48) + 1 ..j ]is valid elastic segment ,e ( j (cid:48) ) < j (cid:48) + 1 max( j − j (cid:48) , e ( j (cid:48) )) , (2)with e (0) initialized to 0. When there is no valid range for some j , e ( j ) = j + 1To test for a valid range [ j (cid:48) + 1 ..j ], we adjust the preprocessing algorithm in order tointegrate it with the computation of the recurrence as follows:20. A unidirectional BWT index [28] is built on the MSA rows concatenated into onelong string, after the gap characters are removed and some separator symbols addedbetween the rows.2. The search on each row of MSA is initiated for each j decreasing j (cid:48) from j − MSA [ i, j ] is accessed to alter the BWT interval, the old interval is retained if MSA [ i, j ] = -.4. Modification 3) can cause intervals to become nested, and this needs to be checked forthe proper detection of valid ranges.Recall that at any step of the preprocessing algorithm, we had a non-nested set I = { [ i (cid:48) a ..i a ] } a ∈{ , ...m +1 } of intervals. We exploited the non-nestedness in the use of a bitvectors M (marking suffixes of current column), B (BWT interval beginning), and E (BWT intervalending) to detect if I contains only BWT intervals of suffixes of the current column of theMSA. This gave us the linear time algorithm. Now that the intervals can be nested, thesebitvectors no longer work as intended. Instead we resort to a generic method to compute thesize of the union of intervals in I . If this size is m , we know that the range in considerationis valid. This can be done in m log m time e.g. by sorting the interval endpoints and simplescanning to maintain how many active intervals there are at the endpoints. This unioncomputation is repeated at each column. Theorem 5.
The values e ( j ) of Eq. (2), for all j ∈ [1 ..n ] , can be computed in O ( mne max log m ) time, where e max = max j e ( j ) . The optimal elastic segmentation defined by Eq. (2) yields arepeat-free elastic founder graph.Proof. The unidirectional BWT index can be constructed in O ( mn ) time and each left-extension takes constant time [28]. We can start comparing max( j − j (cid:48) , e ( j (cid:48) )) from j (cid:48) = j − j (cid:48) by one each step and maintaining e ( j ) as the minimum value so far. Oncevalue j − j (cid:48) grows bigger than current e ( j ), we know that the value of e ( j ) can no longerdecrease. This means we can decrease j (cid:48) exactly e ( j ) times. At each decrease of j (cid:48) , we do m left-extensions (one for each row), and then check for the validity by computing the union ofsearch intervals in O ( m log m ) time. This gives the claimed running time. Traceback from e ( n ) gives an optimal repeat-free elastic segmentation. Let us first consider the non-succinct solution of Sect. 4 when extended to repeat-free elasticfounder graphs. Recall that now the node labels inside the same block can be prefixes ofanother.The case of short patters not spanning a single node label is not affected, as the searchwould work correctly on any founder graph. Consider the case of long patterns, whereAho-Corasick search is used for finding the candidate matches. Rather than picking anarbitrary candidate match, we select the one Q [ i..j ] with maximum j . We can then do theleft-extensions to read Q [1 ..i −
1] just like before using the tries: No node label can be a21uffix of another node label (even inside the same block), and hence the trie leafs correspondto exactly one row each. The left-extensions are hence not branching, as the search alwaysnarrows down to one row (leaf), before continuing on the next trie. The right-extensionsto read Q [ j + 1 ..q ] do not reach a leaf of the corresponding trie (as we chose the candidatematch to maximize j ), so we do not need to worry about the possible branching caused bycommon prefixes.The succinct solution in Sect. 6 is affected minimally: It suffices to index the re-versed concatenation D = (cid:81) i ∈{ , ,...,b } (cid:81) v ∈ V i , ( v,w ) ∈ E (cid:96) ( w ) r (cid:96) ( v ) r and do expanded back-ward search reading the query from left to right. Here X r means string x m x m − · · · x for X = x x · · · x m . The key feature requiring this reversed direction is that the lexicographicranges of suffixes of D starting with (cid:96) ( v ) r for each v ∈ V are distinct; this would not (on allinputs) hold on suffixes starting (cid:96) ( v ) in concatenation C (of Sect. 6) if (cid:96) ( v ) is a prefix of some (cid:96) ( u ), as it can be in the elastic case. The correctness follows verbatim with the reversedindexing and search direction, as the derivation in Sect. 6 only exploits this distinctnessproperty. The expansion procedure to make all nodes atomic needs to be modified slightly. Thebreadth-first iteration is designed to iterate the nodes in a topological order, but a breadth-first order is no longer necessarily topological in the new setting. We need to iterate inlinear time in some topological order instead. It is straightforward to verify that Lemmas 7and 8 still apply.The simple column-by-column minimization algorithm might now fail to merge twoMyhill-Nerode equivalent nodes from different columns. The general Wheeler DFA min-imization algorithm of Alanko et al. [18] can be used instead as it works for any WheelerDFA.
We show a reduction from the
Orthogonal Vectors ( OV ) problem of matching a query stringin an elastic founder graph. The OV problem is to find out if there exist x ∈ X and y ∈ Y such that x · y = 0, given two set X and Y of n vectors each. We construct string Q using X and graph G using Y . Then, we show that Q has a match in G if and only if X and Y form a “yes”-instance of OV . We first construct string Q , and later we describe how toobtain graph G . We build string Q combining string gadgets Q , . . . , Q n , one for each vector in X , plus someadditional characters. To build string Q i , first we place four b characters, then we scanvector x i ∈ X from left to right. When we encounter a 0, we place four characters, andwhen we encounter a 1, we place four characters. Finally, we place four e characters. To22 bbb bbbb 11110000 11110000 11110000 bbbbeeee bbbb 11110000 11110000 11110000 bbbbeeee bebbeebbbeee · · ·· · ·· · ·· · ·· · ·· · · Figure 6: Sub-graph G L . The last segment belongs to sub-graph G M and shows the con-nection.give an example, vector x i = 101 results into string Q i = bbbb111100001111eeee . Full string Q is then the concatenation Q = bbbb Q Q . . . Q n eeee . The reason behind these specific quantities will be clear when discussing the structure ofthe graph.
We build graph G combining together three different sub-graphs: G L , G M , G R (for left , middle and right ). We now describe the structure of each one of these sub-graphs, firstintroducing their nodes, and then explaining how to connect them with edges. Our finalgoal is to build a graph structured in three logical rows. The first and the third rows areable to match any pattern, while the second matches only sub-patterns encoding vectors ofset X that are orthogonal to at least one vector of set Y . The key is to structure the graphsuch that the pattern is forced to utilize the second row to obtain a full match. Sub-graph G L (Figure 6) consists of a starting segment with a single node labeled b ,followed by n − G L , . . . , G n − L , in this order. Each G iL has d + 2 segments,and is obtained as follows. First, we place a segment containing only one node with label b , then we place d other segments, each one containing two nodes with labels and .Finally, we place a segment containing two nodes with labels b and e .The nodes in each segment are connected to all nodes in the next segment, with theexception of the last segment of each G iL : in this case, the node with label and the onewith label are connected only to the e -node of the next (and last) segment of such G iL . Sub-graph G R (Figure 8) is similar to sub-graph G L , and it consists in n − G R , . . . , G n − R , followed by a segment with a single node labeled e . Part G iR has d + 2segments, and is constructed almost identically to G iL . The differences are that, in the firstsegment of G iR , we place two nodes labeled b and e , while in the last segment we placeonly one node, which we label e .As in G L , the nodes in each segment are connected to all nodes in the next segment,with the exception of the first segment of each G iR : in this case, the node labeled e has nooutgoing edge. 23 ····· bebbeebbbeee bbbeeebbeebe 1000111000 1110000010 101100111000 111000110010 101100111000 111000110010 bebbeebbbeee bbbeeebbeebe bebbeebbbeee bbbeeebbeebe 101100111000 111000110010 1000111000 1110000010 1000111000 1110000010 bebbeebbbeee bbbeeebbeebe bebbeebbbeee bbbeeebbeebe 101100111000 111000110010 1000111000 1110000010 101100111000 111000110010 bebbeebbbeee bbbeeebbeebe ······ G M G M G M Figure 7: Sub-graph G M for vectors y = 100, y = 011 and y = 010. The dashedrectangles highlight the single G jM gadgets. · · ·· · ·· · ·· · ·· · ·· · · bbbeeebbeebe bbbbeeee 11110000 11110000 11110000 eeee bbbbeeee 11110000 11110000 11110000 eeee eeee Figure 8: Sub-graph G R . The first segment belongs to sub-graph G M and shows the con-nection. Sub-graph G M (Figure 7) implements the main logic of the reduction, and it uses threebuilding blocks, G be , G and G , which are organized in three rows, as shown in Figure 9.Sub-graph G M has n parts, G M , . . . , G nM , one for each of the vectors y , . . . , y n in set Y . Each G jM is constructed, from left to right, as follows. First, we place a G be gadget.Then, we scan vector y j from left to right and, for each position h ∈ { , . . . , d } , we place a G gadget if the h -th entry is y j [ h ] = , or a G if y j [ h ] = . Finally, we place another G be gadget.For the edges, we first consider each gadget G jM separately. Let G h and G h +1 , be thegadgets encoding y j [ h ] and y j [ h + 1], respectively. We fully connect the nodes of G h to thenodes of G h +1 row by row, respecting the structure of the segments. Then we connect, row G be bebbeebbbeee bbbeeebbeebe G G Row 1Row 2Row 3 Row 1Row 2Row 3 Row 1Row 2Row 3
Figure 9: Gadgets G be , G and G . Each one of these three gadgets is organized in threerows. 24y row, the b -nodes of the left G be to the leftmost G h , which encodes y j [1], and the nodesof the rightmost G h , which encodes y j [ d ], to the e -nodes of the right G be , again row by row.We repeat the same placement of the edges for every vector G h , G h +1 , 1 ≤ h ≤ d −
1; thisconstruction is shown in Figure 7.To conclude the construction of G M , we need to connect all the G jM gadgets together.Consider G jr , the right G be of gadget G jM , and G j +1 l , the left G be of gadget G j +1 M . The edgesconnecting these two gadgets are depicted in Figure 7, which shows that following a pathwe can either remain in the same row or move to the row below, but we can not move tothe row above. Moreover, sub-pattern b can be matched only in the first and second row,while sub-pattern e only in the second and third rows. Final graph G is obtained by combining sub-graphs G L , G M and G R . To this end, weconnect the nodes in the last segment of G L with the b -nodes in the first and second row ofthe left G be gadget of G M . Finally, we connect the e -nodes in the second and third row ofthe right G be gadget of G nM with both the b -node and e -node in the first segment of G R .Figures 6, 7 and 8 can be visualized together, in this order, as one big picture of final graph G . In Figures 6 and 8 we placed the adjacent segment of G M to show the connection. As the reader can verify, the presented construction is a linear independent component (LIC)reduction [20], and hence it follows that one cannot index generic elastic founder graphs inpolynomial time for efficient string search, unless the
Orthogonal Vector Hypothesis or the
Strong Exponential Time Hypothesis (SETH) are false [20]. Thus, a property like repeat-freeness is required.
10 Implementation
We implemented construction and indexing of repeat-free (elastic) founder graphs. Theimplementation is available at https://github.com/algbio/founderblockgraphs . Someproof-of-concept experiments can be found in the conference version of the paper [35].
Acknowledgments
This work was partly funded by the Academy of Finland (grants 309048, 322595 and 328877),Helsinki Institute for Information Technology (HIIT), and by the European Research Coun-cil (ERC) under the European Union’s Horizon 2020 research and innovation programme(grant agreement No. 851093, SAFEBIO).
References [1] Computational Pan-Genomics Consortium et al. , “Computational pan-genomics: sta-tus, promises and challenges,”
Briefings in Bioinformatics , p. bbw089, 2016.252] T. Norri, B. Cazaux, D. Kosolobov, and V. M¨akinen, “Linear time minimum segmen-tation enables scalable founder reconstruction,”
Algorithms Mol. Biol. , vol. 14, no. 1,pp. 12:1–12:15, 2019.[3] B. Cazaux, D. Kosolobov, V. M¨akinen, and T. Norri, “Linear time maximum seg-mentation problems in column stream model,” in
String Processing and InformationRetrieval - 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7-9,2019, Proceedings , ser. Lecture Notes in Computer Science, N. R. Brisaboa and S. J.Puglisi, Eds., vol. 11811. Springer, 2019, pp. 322–336.[4] E. Ukkonen, “Finding founder sequences from a set of recombinants,” in
Algorithms inBioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September17-21, 2002, Proceedings , 2002, pp. 277–286.[5] P. Rastas and E. Ukkonen, “Haplotype inference via hierarchical genotype parsing,” in
Algorithms in Bioinformatics, 7th International Workshop, WABI 2007, Philadelphia,PA, USA, September 8-9, 2007, Proceedings , 2007, pp. 85–97.[6] M. Alzamel, L. A. K. Ayad, G. Bernardini, R. Grossi, C. S. Iliopoulos, N. Pisanti,S. P. Pissis, and G. Rosone, “Degenerate string comparison and applications,” in , ser. LIPIcs, L. Parida and E. Ukkonen, Eds., vol. 113. SchlossDagstuhl - Leibniz-Zentrum f¨ur Informatik, 2018, pp. 21:1–21:14.[7] G. Bernardini, P. Gawrychowski, N. Pisanti, S. P. Pissis, and G. Rosone, “EvenFaster Elastic-Degenerate String Matching via Fast Matrix Multiplication,” in , ser. Leibniz International Proceedings in Informatics (LIPIcs), C. Baier,I. Chatzigiannakis, P. Flocchini, and S. Leonardi, Eds., vol. 132. Dagstuhl, Germany:Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2019, pp. 21:1–21:15. [Online].Available: http://drops.dagstuhl.de/opus/volltexte/2019/10597[8] J. Sir´en, N. V¨alim¨aki, and V. M¨akinen, “Indexing graphs for path queries with appli-cations in genome research,”
IEEE/ACM Transactions on Computational Biology andBioinformatics , vol. 11, no. 2, pp. 375–388, 2014.[9] M. Burrows and D. Wheeler, “A block-sorting lossless data compression algorithm.”Digital Equipment Corporation, Tech. Rep. 124, 1994.[10] C. Thachuk, “Indexing hypertext,”
Journal of Discrete Algorithms , vol. 18, pp. 113–122, 2012.[11] L. Huang, V. Popic, and S. Batzoglou, “Short read alignment with populations ofgenomes,”
Bioinformatics , vol. 29, no. 13, pp. 361–370, 2013.[12] S. Maciuca, C. del Ojo Elias, G. McVean, and Z. Iqbal, “A natural encoding of geneticvariation in a Burrows-Wheeler transform to enable mapping and genome inference,”in
Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus,Denmark, August 22-24, 2016. Proceedings , ser. Lecture Notes in Computer Science,vol. 9838. Springer, 2016, pp. 222–233.2613] E. Garrison, J. Sir´en, A. Novak, G. Hickey, J. Eizenga, E. Dawson, W. Jones, S. Garg,C. Markello, M. Lin, and B. Paten, “Variation graph toolkit improves read mapping byrepresenting genetic variation in the reference,”
Nature Biotechnology , vol. 36, 08 2018.[14] D. Kim, J. Paggi, C. Park, C. Bennett, and S. Salzberg, “Graph-based genome align-ment and genotyping with hisat2 and hisat-genotype,”
Nature Biotechnology , vol. 37,p. 1, 08 2019.[15] J. Sir´en, E. Garrison, A. M. Novak, B. Paten, and R. Durbin, “Haplotype-aware graphindexes,”
Bioinformatics , vol. 36, no. 2, pp. 400–407, 07 2019.[16] T. Gagie, G. Manzini, and J. Sir´en, “Wheeler graphs: A framework for bwt-based datastructures,”
Theor. Comput. Sci. , vol. 698, pp. 67–78, 2017.[17] D. Gibney and S. V. Thankachan, “On the hardness and inapproximability of rec-ognizing wheeler graphs,” in , ser. LIPIcs, M. A. Bender,O. Svensson, and G. Herman, Eds., vol. 144. Schloss Dagstuhl - Leibniz-Zentrum f¨urInformatik, 2019, pp. 51:1–51:16.[18] J. Alanko, G. D’Agostino, A. Policriti, and N. Prezza, “Regular languages meet prefixsorting,” in
Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms,SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020 , S. Chawla, Ed. SIAM,2020, pp. 911–930.[19] M. Equi, R. Grossi, V. M¨akinen, and A. I. Tomescu, “On the complexity of stringmatching for graphs,” in , ser. LIPIcs, C. Baier,I. Chatzigiannakis, P. Flocchini, and S. Leonardi, Eds., vol. 132. Schloss Dagstuhl -Leibniz-Zentrum f¨ur Informatik, 2019, pp. 55:1–55:15.[20] M. Equi, V. M¨akinen, and A. I. Tomescu, “Graphs cannot be indexed in polynomialtime for sub-quadratic time string matching, unless SETH fails,” in
SOFSEM 2021:Theory and Practice of Computer Science - 47th International Conference on CurrentTrends in Theory and Practice of Computer Science, SOFSEM 2021, Bolzano-Bozen,Italy, January 25-29, 2021, Proceedings , ser. Lecture Notes in Computer Science,T. Bures, R. Dondi, J. Gamper, G. Guerrini, T. Jurdzinski, C. Pahl, F. Sikora, andP. W. H. Wong, Eds., vol. 12607. Springer, 2021, pp. 608–622. [Online]. Available:https://doi.org/10.1007/978-3-030-67731-2 44[21] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya, “Succinct de bruijn graphs,” in
Algorithms in Bioinformatics - 12th International Workshop, WABI 2012, Ljubljana,Slovenia, September 10-12, 2012. Proceedings , ser. Lecture Notes in Computer Science,B. J. Raphael and J. Tang, Eds., vol. 7534. Springer, 2012, pp. 225–235. [Online].Available: https://doi.org/10.1007/978-3-642-33122-0 18[22] C. Boucher, A. Bowe, T. Gagie, S. J. Puglisi, and K. Sadakane, “Variable-order debruijn graphs,” in pril 7-9, 2015 , A. Bilgin, M. W. Marcellin, J. Serra-Sagrist`a, and J. A. Storer, Eds.IEEE, 2015, pp. 383–392. [Online]. Available: https://doi.org/10.1109/DCC.2015.70[23] D. Belazzougui, T. Gagie, V. M¨akinen, M. Previtali, and S. J. Puglisi, “Bidirectionalvariable-order de bruijn graphs,” Int. J. Found. Comput. Sci. , vol. 29, no. 8, pp.1279–1295, 2018. [Online]. Available: https://doi.org/10.1142/S0129054118430037[24] R. De La Briandais, “File searching using variable length keys,” in
Papers Presentedat the the March 3-5, 1959, Western Joint Computer Conference , ser. IRE-AIEE-ACM’59 (Western). New York, NY, USA: Association for Computing Machinery, 1959, p.295–298. [Online]. Available: https://doi.org/10.1145/1457838.1457895[25] A. V. Aho and M. J. Corasick, “Efficient string matching: An aid to bibliographicsearch,”
Commun. ACM , vol. 18, no. 6, pp. 333–340, 1975.[26] U. Manber and E. W. Myers, “Suffix arrays: A new method for on-line stringsearches,”
SIAM J. Comput. , vol. 22, no. 5, pp. 935–948, 1993. [Online]. Available:https://doi.org/10.1137/0222058[27] T. Schnattinger, E. Ohlebusch, and S. Gog, “Bidirectional search in a string withwavelet trees and bidirectional matching statistics,”
Inf. Comput. , vol. 213, pp. 13–22,2012.[28] D. Belazzougui, F. Cunial, J. K¨arkk¨ainen, and V. M¨akinen, “Linear-time stringindexing and analysis in small space,”
ACM Trans. Algorithms , vol. 16, no. 2, Mar.2020. [Online]. Available: https://doi.org/10.1145/3381417[29] D. Belazzougui and F. Cunial, “Fully-functional bidirectional burrows-wheeler indexesand infinite-order de bruijn graphs,” in , ser. LIPIcs, N. Pisantiand S. P. Pissis, Eds., vol. 128. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik,2019, pp. 10:1–10:15.[30] G. Jacobson, “Space-efficient static trees and graphs,” in
Proc. FOCS , 1989, pp. 549–554.[31] D. Belazzougui, F. Cunial, J. K¨arkk¨ainen, and V. M¨akinen, “Versatile succinct repre-sentations of the bidirectional burrows-wheeler transform,” in
European Symposium onAlgorithms . Springer, 2013, pp. 133–144.[32] R. Grossi, A. Gupta, and J. S. Vitter, “High-order entropy-compressed textindexes,” in
Proceedings of the Fourteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms, January 12-14, 2003, Baltimore, Maryland, USA . ACM/SIAM, 2003,pp. 841–850. [Online]. Available: http://dl.acm.org/citation.cfm?id=644108.644250[33] F. Cunial, J. Alanko, and D. Belazzougui, “A framework for space-efficient variable-order markov models,”
Bioinformatics , vol. 35, no. 22, pp. 4607–4616, 2019.2834] H. N. Gabow, J. L. Bentley, and R. E. Tarjan, “Scaling and related techniques forgeometry problems,” in
Proceedings of the 16th Annual ACM Symposium on Theoryof Computing, April 30 - May 2, 1984, Washington, DC, USA , R. A. DeMillo, Ed.ACM, 1984, pp. 135–143. [Online]. Available: https://doi.org/10.1145/800057.808675[35] V. M¨akinen, B. Cazaux, M. Equi, T. Norri, and A. I. Tomescu, “Linear timeconstruction of indexable founder block graphs,” in20th International Workshop onAlgorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (VirtualConference)