[PDF] Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

Abstract

We consider the following string matching problem on a node-labeled graph G=(V,E) : given a pattern string P , decide whether there exists a path in G whose concatenation of node labels equals P . This is a basic primitive in various problems in bioinformatics, graph databases, or networks. The hardness results of Backurs and Indyk (FOCS 2016) imply that this problem cannot be solved in better than O(|E||P|) time, under the Orthogonal Vectors Hypothesis (OVH), and this holds even under various restrictions on the graph (Equi et al., ICALP 2019). In this paper we consider its offline version, namely the one in which we are allowed to index the graph in order to support time-efficient string matching queries. Indeed, it was tantalizing in the string matching community to believe that sub-quadratic time queries can be achieved, e.g. at the cost of a high-degree polynomial-time indexing. We disprove this belief, showing that, under OVH, no polynomial-time index can support querying P in time O(|E | δ |P | β ) , with either δ<1 or β<1 . We prove this tight bound employing a known self-reducibility technique, e.g. from the field of dynamic algorithms, which translates conditional lower bounds for an online problem to its offline version. As a side-contribution, we formalize this technique with the notion of linear independent-components reduction, allowing for a simple proof of our result. As another illustration of our technique, we also translate the quadratic conditional lower bound of Backurs and Indyk (STOC 2015) for the problem of matching a query string inside a text, under edit distance. We obtain an analogous tight quadratic lower bound for its offline version, improving the recent result of Cohen-Addad, Feuilloley and Starikovskaya (SODA 2019), but with a slightly different boundary condition.

Full PDF

GGraphs cannot be indexed in polynomial time for sub-quadratictime string matching, unless SETH fails

Massimo Equi, Veli M¨akinen, and Alexandru I. TomescuDepartment of Computer Science, University of Helsinki { massimo.equi,veli.makinen,alexandru.tomescu } @helsinki.fi March 5, 2020

Abstract

OVH , no polynomial-time indexing scheme ofthe graph can support querying P in time O ( | E | δ | P | β ), with either δ < β <

1. We provethis tight bound employing a known self-reducibility technique, e.g. from the ﬁeld of dynamicalgorithms, which translates conditional lower bounds for an online problem to its oﬄine version.As a side-contribution, we formalize this technique with the notion of linear independent-components reduction , allowing for a simple proof of our result. As another illustration thathardness of indexing follows as a corollary of a linear independent-components reduction, wealso translate the quadratic conditional lower bound of Backurs and Indyk (STOC 2015) for theproblem of matching a query string inside a text, under edit distance. We obtain an analogous tight quadratic lower bound for its oﬄine version, improving the recent result of Cohen-Addad,Feuilloley and Starikovskaya (SODA 2019), but with a slightly diﬀerent boundary condition. a r X i v : . [ c s . CC ] M a r Introduction

The

String Matching in Labeled Graphs (

SMLG ) problem is deﬁned as follows. Problem 1 ( SMLG ) . Input : A directed graph G = ( V, E, (cid:96) ), where each node v ∈ V is labeled by a character (cid:96) ( v ), anda pattern string P . Output : True if and only if there is path ( v , v , . . . , v | P | ) in G such that P [ i ] = (cid:96) ( v i ) holds for all1 ≤ i ≤ | P | .This is a natural generalization of the problem of matching a string inside a text, and it isa primitive in various problems in computational biology, graph databases, and graph mining.In genome research, the very ﬁrst step of many standard analysis pipelines of high-throughputsequencing data is nowadays to align sequenced fragments of DNA on a labeled graph (a so-called pan-genome ) that encodes all genomes of a population [41, 16, 32, 26]. In graph databases, querylanguages provide the user with the ability to select paths based on the labels of their nodesor edges [9, 24, 40, 37]. In graph mining, this is a basic ingredient related to computing graphkernels [30] or node similarity [17].The SMLG problem can be solved in time O ( | V | + | E || P | ) [7] in the comparison model. Onacyclic graphs, bitparallelism can be used for improving the time to O ( | V | + | E |(cid:100)| P | /w (cid:101) ) [38] inthe RAM model with word size w = Θ(log | E | ). It remained an open question whether a truly sub-quadratic time algorithm for it exists. However, the recent conditional lower bounds by Backursand Indyk [11] for regular expression matching imply that the SMLG problem cannot be solved insub-quadratic time, unless the so-called

Orthogonal Vectors Hypothesis (

OVH ) is false. This resultwas strengthened by Equi et al. [20] by showing that the problem remains quadratic under OVH also on directed acyclic graphs (DAGs), that are even deterministic , in the sense that for everynode, the labels of its out-neighbors are all distinct.As mentioned above, in real-world applications one usually considers the oﬄine version of the

SMLG problem. Namely, we are allowed to index the labeled graph so that we can query for patternstrings in possibly sub-quadratic time. In the case when the graph is just a labeled (directed) path,then the problem asks about indexing a text string, which is a fundamental problem in stringmatching. There exists a variety of indexes constructable in linear time supporting linear-time queries [18]. The same holds also when the graph is a tree [23]. A trivial indexing scheme forarbitrary graphs is to enumerate all the possibly exponentially many paths of the graph and indexthose as strings. So a natural question is whether we can at least index the graph in polynomialtime to support sub-quadratic time queries. Note that the conditional lower bound for the onlineproblem naturally refutes the possibility of an index constructable in sub-quadratic time to supportsub-quadratic time queries. Even before the

OVH -based reductions, another weak lower bound wasknown to hold conditioned on the hardness of indexing for set intersection queries [12] (see alsoTable 1). We discuss this connection to the

Set Intersection Conjecture (

SIC ) [36, 28] in Appendix A.The connections to SIC and to

OVH constrain the possible construction and query time tradeoﬀsfor

SMLG , but they are yet not strong enough to prove the impossibility of building an index inpolynomial time such that queries could be sub-quadratic, or even take time say O ( | E | / | P | ).This would be a signiﬁcant result. In fact, given the wide applicability of this problem, therehave been many attempts to obtain such indexing schemes. Sir´en, V¨alim¨aki, and M¨akinen [43]proposed an extension of the Burrows-Wheeler transform [14] for preﬁx-sorted graphs. Standardindexing techniques [29, 22, 35] can be applied on such generalized Burrows-Wheeler transform2o support linear time pattern search. The bottleneck of the approach is the preﬁx-sorting step,which requires ﬁnding shortest preﬁxes for all paths such that they distinguish the nodes from eachother. The size of the transform is still exponential in the worst case. However, unlike the trivialindexing scheme, it is linear in the best case, and also linear in the average case under a realisticmodel for genomics applications [43]. There have been some advances in making the approach morepractical [42, 32, 26], but the exponential bottleneck has remained. Since in real-world scenariosapproximate search is required on the graph, there have also been advances in expanding sparsedynamic programming and chaining algorithms [33], as well as the seed-and-extend strategy [19, 39]to this setting.The concept of preﬁx-sorted graphs was later formalized into a more general concept of

Wheelergraphs [25]: Conceptually these are a class of graphs that admit a generalization of the Burrows-Wheeler transform, and thus an index of size linear in the size of the graph, supporting string searchin linear time in the size of the query pattern. Gibney and Thankachan showed that Wheeler graphrecognition problem is NP-complete [27]. Alanko et al. [5] give polynomial time solutions on somespecial cases and improve the preﬁx-sorting algorithm to work in near-optimal time in the size ofthe output. They also give an example where such output can be of exponential size even for acyclicdeterministic ﬁnite automata (acyclic DFA). One could conjecture that conversion of a graph intoan equivalent Wheeler graph is equally hard as indexing a graph for linear time string search, butas far as we know, such equivalence result has not yet been established. Therefore the hardness ofindexing graphs is largely still open.In this paper we refute the existence of such a polynomial indexing scheme for graphs, under

OVH . This contributes to a growing number of conditional lower bounds for oﬄine string problem,such as the one for indexed jumbled pattern matching [6], conditioned on 3SUM-hardness, and theone for indexed approximate pattern matching under κ diﬀerences [15], conditioned on SETH .Our result holds even for deterministic DAGs with labels from binary alphabet. By introducinga super-source connected to all source nodes and moving labels to incoming edges, such graphscan be interpreted as acyclic non-deterministic ﬁnite automata (acyclic NFA) whose only non-deterministic state is the start state. It follows that determinisation of such simple NFAs cannotbe done in polynomial time unless

OVH is false (and thus also unless

SETH is false). This corollarycomplements the current picture of the exponential gap between NFAs and DFAs.Table 1 and Figure 1 summarize the complexity landscape around oﬄine

SMLG . In the Orthogonal Vectors ( OV ) problem we are given two sets X, Y ⊆ { , } d such that | X | = | Y | = N and d = ω (log N ), and we need to decide whether there exists x ∈ X and y ∈ Y such that x and y are orthogonal, namely, x · y = 0. OVH states that for any constant ε >

0, no algorithmcan solve OV in time O ( N − ε poly( d )). Notice that the better known Strong Exponential TimeHypothesis ( SETH ) [31] implies

OVH [44], so all our lower bounds hold also under

SETH .Our results are obtained using a technique used for example in the ﬁeld of dynamic algorithms,see e.g. [4, 2]. Recall the reduction from k -SAT to OV from [44]: the n variables of the formula φ are split into two groups of n/ n/ Boolean assignments are generatedfor each group, and these induce two sets X and Y of size N = 2 n/ each, such that OV returns ‘yes’on X and Y if and only if φ is satisﬁable. Suppose one could index X to support O ( M − ε poly( d ))-time queries for any set Y of M vectors, for some ε >

0. One now can adjust the splitting of thevariables based on the hypothetical ε : the ﬁrst part (corresponding to X ) has nδ ε variables, andthe other part (corresponding to Y ) has n (1 − δ ε ) variables. We can choose a δ ε depending on ε such that querying each vector in Y against the index on X takes overall time O (2 n (1 − γ ) ), for some3 raph Indexing time Query time Reference, Year path O ( | E | ) O ( | P | ) classical [18]tree O ( | E | ) O ( | P | ) [23], 2009Wheeler graph O ( | E | ) O ( | P | ) [43, 25], 2014DAG O ( | E | α ), α < f ( | P | )impossible under SIC [12], 2013arbitrary O ( | E | α ), α ≤ δ O ( | E | δ | P | β ), δ + β < OVH [11], 2016deterministic DAG O ( | E | α ), α ≤ δ O ( | E | δ | P | β ), δ + β < OVH [20], 2019deterministic DAG O ( | E | α ), α ∈ R O ( | E | δ | P | β ), δ + β < OVH

This paperarbitrary O ( | E | α ), α ∈ R O ( | E | δ | P | β ), δ < β < OVH

This paperTable 1: Upper bounds (ﬁrst three rows) and conditional lower bounds for oﬄine

SMLG on a graph G = ( V, E ) and a pattern P . On the fourth line, f ( · ) is an arbitrary function. δ β O ( | E | | P |

1) algorithm (a) α ≤ δ δ β O ( | E | | P |

1) algorithm (b) α ∈ R Figure 1: The dashed areas of the plots represent the forbidden values of δ and β for O ( | E | δ | P | β )-time queries, under OVH . Figure 1a shows the lower bound that follows from the online case [11, 20],and holds for α ≤ δ . Figure 1b depicts our lower bounds (tight, thanks to the online O ( | E || P | )-timealgorithm from [8]). In addition, these hold for any value of α . γ >

0, contradicting

SETH .In this paper, instead of employing this technique inside the reduction for oﬄine

SMLG (as donein previous applications of this technique), we formalize the reason why it works through the notionof a linear independent-component reduction ( lic ) . Such a reduction allows to immediately arguethat if a problem A is hard to index, and we have a lic reduction from A to B , then also B is hardto index (Lemma 1). Since OV is hard to index, it follows simply as a corollary that any problemto which OV reduces is hard to index. In order to get the best possible result for SMLG , we alsoshow that a generalized version of OV is hard to index (Theorem 5). As such, we upgrade the ideaof an “adjustable splitting” of the variables from a technique to a directly transferable result, once4 lic reduction is shown to exist.Examples of problems to which a lic reduction could be applied are those that arise fromcomputing a distance between two elements. Popular examples are edit distance, dynamic timewarping distance (DTWD), Frechet distance, longest common subsequence. All these problemshave been shown to require quadratic time to be solved under OVH . The reductions proving theselower bound for DTWD [1] and Frechet distance [13] are indeed lic reductions, hence these problemsautomatically obtain a lower bound also for their oﬄine version. More speciﬁcally,

OVH impliesthat we cannot preprocess the ﬁrst input of a DTWD or Frechet distance problem in polynomialtime and provide sub-quadratic time queries for the second input.On the other hand, the ﬁnal sequences used in the reductions for edit distance [10] and longestcommon subsequence [1] present some dependencies within each other, thus they would need tobe slightly tweaked to make the deﬁnition of lic reduction apply. These cross dependencies onlyconcerns the size of the gadgets used in the reductions and not their structural properties, hencewe are conﬁdent that the modiﬁcations needed to such gadgets require only a marginal eﬀort.To easily explain this idea and to better understand the utility of a lic reduction, let us consideredit distance. In a common oﬄine variation of this problem, we are required to build a datastructure for a long string T such that one can decide if a given query string P is within editdistance κ from a substring of T . It suﬃces to observe that, in the reduction of Backurs andIndyk [10] from OV to edit distance, this problem is utilized as an intermidiate step, and up to thispoint the reduction from OV is indeed a lic reduction (see Section 2.2). Hence, we immediatelyobtain the following result. Theorem 1.

For any α > , β ≥ , and δ > such that β + δ < , there is no algorithmpreprocessing a string T in time O ( | E | α ) , such that for any pattern string P we can ﬁnd a substringof T at minimum edit distance with P , in time O ( | T | δ | P | β ) , unless OVH is false.

This bound is tight because for δ + β = 2 there is a matching online algorithm [34]. Theorem 1also strenghtens the recent result of Cohen-Addad, Feuilloley and Starikovskaya [15], stating that anindex built in polynomial time cannot support queries for approximate string matching in O ( | T | δ )time, for any δ <

1, unless

SETH is false. However, the boundary condition is diﬀerent, since intheir case κ = O (log | T | ), while in our case κ = Θ( | P | ).Our approach for the SMLG problem is similar. In Section 3 we revisit the reduction from [21]and observe that it is a lic reduction. As such, we can immediately obtain the following result.

Theorem 2.

For any α > , β ≥ , and δ > such that β + δ < , there is no algorithmpreprocessing a labeled graph G = ( V, E, (cid:96) ) in time O ( | E | α ) such that for any pattern string P wecan solve the SMLG problem on G and P in time O ( | E | δ | P | β ) , unless OVH is false. This holds evenif restricted to a binary alphabet, and to deterministic DAGs in which the sum of out-degree andin-degree of any node is at most three. This lower bound is tight because for δ + β = 2 there is a matching online algorithm [7].However, this bound does not disprove a hypothetical polynomial indexing algorithm with querytime O ( | E | δ | P | ), for some 0 < δ <

1. Since graphs in practical applications are much larger thanthe pattern, such an algorithm would be quite signiﬁcant. However, when the graph is allowed tohave cycles, we also show that this is impossible under

OVH . We implicitly assumed here that the graph G is the part of the input on which to build the index, because it isthe ﬁrst input to SMLG . However, by exchanging G and P , it trivially holds that we also cannot polynomially indexa pattern string P to support fast queries in the form of a labeled graph. heorem 3. For any α > , β ≥ , and < δ < , there is no algorithm preprocessing a labeledgraph G = ( V, E, (cid:96) ) in time O ( | E | α ) such that for any pattern string P we can solve the SMLG problem on G and P in time O ( | E | δ | P | β ) , unless OVH is false.

We obtain Theorem 3 by slightly modifying the reduction of [21] with the introduction of certaincycles, that are necessary to allow for query patterns of length longer than the graph size. We leaveas open question whether the lower bound from Theorem 3 holds also for DAGs.

Open Problem 1.

Do there exist α > β ≥

1, 0 < δ <

1, and an algorithm preprocessing alabeled (deterministic) DAG G = ( V, E, (cid:96) ) in time O ( | E | α ) such that for any pattern string P wecan solve the SMLG problem on G and P in time O ( | E | δ | P | β )? All problems considered in this paper are such that their input is naturally partitioned in two. Fora problem P , we will denote by P X × P Y the set of all possible inputs for P . For a particular input( p x , p y ) ∈ P X × P Y , we will denote by | p x | and | p y | the length of each of p x and p y , respectively.Intuitively, p x represents what we want to build the index on, while p y is what we want to queryfor. We start by formalizing the concept of indexability . Deﬁnition 1 (Indexability) . Problem P is ( I , Q ) -indexable if for every p x ∈ P X we can preprocess p x in time I ( | p x | ) such that for every p y ∈ P Y we can solve P on ( p x , p y ) in time Q ( | p x | , | p y | ).We further reﬁne this notion into that of polynomial indexability , by specifying the degree ofthe polynomial costs of building the index and of performing the queries. Deﬁnition 2 (Polynomial indexability) . Problem P is ( α, δ, β ) -polynomially indexable with param-eter k if P is ( I , Q )-indexable and I ( | p x | ) = O ( k O (1) | p x | α ) and Q ( k O (1) | p x | , | p y | ) = O ( | p x | δ | p y | β ). If k = O (1), then we say that P is ( α, δ, β ) -polynomially indexable .The introduction of parameter k is needed to be consistent with OVH , since when proving alower bound conditioned on

OVH , the reduction is allowed to be polynomial in the vector dimension d . As we will see, we will set k = d .We now introduce linear independent-components reductions, which we show below in Lemma 1to maintain ( α, δ, β )-polynomial indexability. Deﬁnition 3 ( lic reduction) . Problem A has a linear independent-components ( lic ) reduction withparameter k to problem B , indicated as A ≤ klic B , if the following two properties hold:i) Correctness : There exists a reduction from A to B modeled by functions r x , r y and s . Thatis, for any input ( a x , a y ) for A , we have r x ( a x ) = b x , r y ( a y ) = b y , ( b x , b y ) is a valid input for B , and s solves A given the output B ( b x , b y ) of an oracle to B , namely s ( B ( r ( a x ) , r ( a y ))) = A ( a x , a y ).ii) Parameterized linearity : Functions r x , r y and s can be computed in linear time in thesize of their input, multiplied by k O (1) . Lemma 1.

Given problems A and B and constants α > , δ > , β ≥ , if A ≤ klic B holds, and B is ( α, δ, β )-polynomially indexable, then A is ( α, δ, β )-polynomially indexable with parameter k . roof. Let a x ∈ A X be the ﬁrst input of problem A . The linear independent-components reductioncomputes the ﬁrst input of problem B as b x = r x ( a x ) in time O ( k O (1) | a x | ). This means that | b x | = O ( k O (1) | a x | ), since the size of the data structure that we build with the reduction cannot begreater than the time spent for performing the reduction itself. Problem B is ( α, δ, β )-polynomiallyindexable, hence we can build an index on b x in time O ( | b x | α ) in such a way that we can performqueries for every b y in time O ( | b x | δ | b y | β ). Now given any input a y for A we can compute itscorresponding b y = r y ( a y ) via the reduction in time O ( k O (1) | a y | ) and answer a query for it usingthe index that we built on b x . Again, notice that | b y | = O ( k O (1) | a y | ). The cost for such a queryis O ( k O (1) | a y | + | b x | δ | b y | β ) = O ( k O (1) | a y | + k O (1) | a x | δ | a y | β ), which, since δ >

0, is the same as O ( k O (1) | a x | δ | a y | β ) when β ≥

1. Notice that the indexing time is O ( | b x | α ) = O ( k O (1) | a x | α ). Hence A is ( α, δ, β )-polynomially indexable with parameter k , when β ≥ We begin by stating, with our formalism, a known strengthening of the hardness of indexingreduction presented at the beginning of Section 1.2 (note that it also follows as a special case ofTheorem 5 below).

Theorem 4 (Folklore) . If OV is ( α, δ, β )-polynomially indexable with parameter d , and β + δ < ,then OVH fails.

The value of a parameterized lic reduction can now be apprehended: once a parameterized lic reduction is shown to exist, the indexing lower bound follows directly.

Corollary 1.

Any problem P such that OV ≤ dlic P holds is not ( α, δ, β )-polynomially indexable, forany α > , β ≥ , δ > with β + δ < , unless OVH is false.Proof.

Assume by contradiction that P is ( α, δ, β )-polynomially indexable. Apply Lemma 1 toprove that OV is ( α, δ, β )-polynomially indexable with parameter d , and β + δ <

2; this contradictsTheorem 4.For a simple and concrete application of Corollary 1, consider the following problem, in which d ( S , S ) denotes the edit distance between strings S and S . Problem 2 ( PATTERN ) . Input : Two strings T and P . Output : min S substring of T d ( S, P ).Backurs and Indyk [10] reduce OV to PATTERN by constructing a string T based solely on theﬁrst input X to OV and a string P based solely on the second input Y to OV , such that if thereare two orthogonal vectors then the answer to PATTERN on T and P is below a certain value,and if there are not, then the answer is equal to another speciﬁc value. Each of T and P can beconstructed in time O ( d O (1) N ) = O ( d O (1) ( dN )). This is a lic reduction with parameter d . Directlyapplying Corollary 1, we obtain Theorem 1. Corollary 1 will suﬃce to prove Theorem 2. However, in order to prove that no query time O ( | E | δ | P | β ) is possible for any δ <

1, we need a strengthening of Theorem 4. As such, we in-troduce the generalized (

N, M )- Orthogonal Vectors problem, as follows:7 roblem 3 (( N, M )- OV ) . Input : Two sets

X, Y ⊆ { , } d , such that | X | = N and | Y | = M . Output : T rue if and only if there exists ( x, y ) ∈ X × Y such that x · y = 0.The theorem below is the desired generalization of Theorem 4, since it implies, for example,that we cannot have O ( N / M )-time queries after polynomial-time indexing. To the best of oureﬀorts, we could not ﬁnd a proof of this result in the literature, and hence we give one here. Itis based on the same idea of an “adjustable splitting” into subvectors, a part of which is indexed,while the other part is queried. However, some technical subtleties arise from the combination ofall parameters α, δ, β . Theorem 5. If ( N, M ) - OV is ( α, δ, β )-polynomially indexable with parameter d , and either δ < or β < , then OVH fails. That is, under

OVH , we cannot support O ( N δ M β ) -time queries for ( N, M ) - OV , for either δ < or β < , even after polynomial-time preprocessing.Proof. Let X and Y be the input for OV and assume that their length is n . Our strategy is tosplit this instance of OV into many ( N, M )- OV sub-problems and show that a too eﬃcient indexingscheme for ( N, M )- OV applied to such sub-problems would lead to an online algorithm for OV running in sub-quadratic time, hence contradicting OVH . The key is to adjust the size of such(

N, M )- OV sub-problems to ﬁt our needs. Let us begin by partitioning set X into subsets of N vectors each, and set Y into subsets of M vectors each, as shown in Figure 2. The instances of Xx ... x N x N +1 ... x N ... x n − N +1 ... x n Yy ... y M y M +1 ... y M ... y n − M +1 ... y n X X X d nN e Y Y Y d nM e Figure 2: Two sets of n vectors and their partitioning into sub-sets of X for indexing, and sub-setsof Y for querying.( N, M )- OV sub-problems that we want to consider are all those pairs of vector sets ( X i , Y j ) inwhich X i is a subset of X and Y j is a subset of Y . Solving all the ( X i , Y j ) instances solves theoriginal problem. Now we proceed to index sub-sets X i and to analyze how the time complexityof the original problem looks like when expressed in terms of the ( N, M )- OV sub-problems. Laterwe show how we can obtain a sub-quadratic time algorithm for OV by choosing speciﬁc values for N and M . The idea of splitting the two sets into smaller groups was also used in [3] to obtain a fast randomized algorithmfor OV , based on the polynomial method, and therein the groups always had equal size. N, M )- OV is ( α, δ, β )-polynomially indexable with parameter d , wecan build index Idx ( X i ) for subset X i of N vectors in time O ( d O (1) ( dN ) α ), and additionally we cananswer a query for any subset Y j of M vectors using index Idx ( X i ) in time O ( d O (1) ( dN ) δ ( dM ) β ).Hence, given index Idx ( X i ), we can solve sub-problems ( X i , Y j ) for a ﬁxed i and j (1 ≤ j ≤ (cid:100) nM (cid:101) )by performing (cid:100) nM (cid:101) queries, one for each subset Y j of Y . Repeating this process for all X i coversall possible pairs ( X i , Y j ), and since we have (cid:100) nN (cid:101)(cid:100) nM (cid:101) such pairs, the total cost for solving OV is: O (cid:16) d O (1) ( dN ) α nN + d O (1) ( dN ) δ ( dM ) β nN nM (cid:17) = O (cid:16) d O (1) (cid:16) N α − n + N δ − M β − n (cid:17)(cid:17) . (1)In order to have a sub-quadratic-time algorithm for OV we need both of the terms of the sum aboveto be sub-quadratic. Namely, our time complexity should be O (cid:16) d O (1) (cid:16) n − ε (cid:48) + n − ε (cid:17)(cid:17) , for some ε, ε (cid:48) >

0. Notice that in order to prove

OVH to be wrong it is enough to ﬁnd one speciﬁc value for ε and one for ε (cid:48) such that the following two conditions hold:(a) : N α − n = O ( n − ε (cid:48) )(b) : N δ − M β − n = O ( n − ε )As a ﬁrst observation, notice that we need also to enforce 1 ≤ N ≤ n and 1 ≤ M ≤ n . This isbecause every X i and every Y j must contain at least one vector in order to be an instance of ( N, M )- OV , and trivially their size must not exceed the size n of the original OV instance. Moreover, N and M must also be integers. This last requirement might cause some complications during ouranalysis, and due to this reason we will take advantage of a useful trick. We introduce new variables˜ N and ˜ M so that we can make them assume real values. Our actual N and M would be the ceilingof ˜ N and ˜ M . Putting all together, we want that for every n ∈ N , α, δ, β > δ < β < ε > , ε (cid:48) > , N, M, ˜ N , ˜ M such that:(a) N α − n = O ( n − ε (cid:48) )(b) N δ − M β − n = O ( n − ε )(˜a) ˜ N α − n = n − ε (cid:48) (˜b) ˜ N δ − ˜ M β − n = n − ε (c) N = (cid:100) ˜ N (cid:101) , M = (cid:100) ˜ M (cid:101) (d) 1 ≤ ˜ N ≤ n, ≤ ˜ M ≤ n Notice that forcing 1 ≤ ˜ N ≤ n also ensures 1 ≤ N ≤ n , since we are taking the ceiling N = (cid:100) ˜ N (cid:101) .The same holds for ˜ M and M .We start our case analysis by identifying two cases for parameter α , namely α (cid:54) = 1 and α = 1.These are eventually broken down into speciﬁc sub-cases for parameters δ and β . The strategy isto prove that if conditions (˜a), (˜b), (c), (d), (e) hold, then also conditions (a) and (b) hold. Forsimplicity, we report here only the most interesting cases in which α (cid:54) = 0, δ (cid:54) = 1 and β (cid:54) = 1. Thecomplete analysis of the remaining cases can be found in Appendix B. Case 1 : α (cid:54) = 1. In this case we obtain the following constraint on ˜ N from condition (˜a):˜ N α − n = n − ε (cid:48) ⇔ ˜ N = n − ε (cid:48) α − . (2)Now, given ε (cid:48) , we can compute ˜ N using this equation so that we satisfy condition (˜a). In doing sowe need to make sure that condition (d) is also respected. To this end, we need to check that ε (cid:48) ≤ − ε (cid:48) α − ≤

1. Let us start with the left inequality.1 − ε (cid:48) α − ≥ ⇔ (1 − ε (cid:48) ≥ α − >

0) or (1 − ε (cid:48) ≤ α − < ⇔ ( ε (cid:48) ≤ α >

1) or ( ε (cid:48) ≥ α <

1) (3)For the right inequality, we ﬁrst handle the case in which ε (cid:48) ≤ α > − ε (cid:48) α − ≤

1. Since α > α − > − ε (cid:48) α − ≤ ⇔ ε (cid:48) ≥ − α .Hence, the ﬁnal constraint for ε (cid:48) is 2 − α ≤ ε (cid:48) ≤

1, and we know that there exists valid values for ε (cid:48) since α > ⇒ − α < ε (cid:48) ≥ α <

1. We ﬁnd ourselves in asymmetric situation in which α < α − < − ε (cid:48) α − ≤ ⇔ ε (cid:48) ≤ − α .Putting all together we have 1 ≤ ε (cid:48) ≤ − α , and the existence of valid values for ε (cid:48) is guaranteedby the fact that α < ⇒ − α > N . To analyze the other conditions, we need toconsider three sub-cases. Here we present the more challenging one, which is in turn split into twomore sub-cases. The reader can ﬁnd the others in Appendix B. Case 1.1 : δ (cid:54) = 1 and β (cid:54) = 1. Now condition (˜b) yields the following:˜ N δ − ˜ M β − n = n − ε ⇔ ˜ M = ˜ N − δβ − n ε − β . (4)We apply the substitution ˜ N = n − ε (cid:48) α − that we obtained from equation (2). Hence:˜ M = n − ε (cid:48) α − − δβ − n ε − β = n − ε (cid:48) α − − δβ − − εβ − . We apply condition (d) obtaining the following constraint:0 ≤ − ε (cid:48) α − − δβ − − εβ − ≤ Case 1.1.1 : β − < ⇔ β <

1. We extract the constraint on ε from the two inequalities in(5) above. We start by analysing the left inequality.1 − ε (cid:48) α − − δβ − − εβ − ≥ ε ≥ − ε (cid:48) α − − δ ) . From the second inequality instead we get:1 − ε (cid:48) α − − δβ − − εβ − ≤ ε − − ε (cid:48) α − − δ ) ≤ − βε ≤ − β + 1 − ε (cid:48) α − − δ )Since ε >

0, we need to be sure that the right term of this last inequality is strictly greater than0. Since 1 − β > − ε (cid:48) α − > − δ <

0. Notice that10 − ε (cid:48) α − → ε (cid:48) →

1. This means that we can always choose ε (cid:48) as close to 1 as needed to make1 − β + − ε (cid:48) α − (1 − δ ) > ε , ε (cid:48) , ˜ N and ˜ M in such a way that conditions(˜a), (˜b) and (d) are veriﬁed. Then we choose N = (cid:100) N (cid:101) and M = (cid:100) M (cid:101) so that condition (c) isveriﬁed. We analyse in depth condition (b) since it is more complicated; condition (a) can beproven applying the same technique. We ﬁrst remark the following property: Fact 1. ∀ n, a, b ∈ R . (cid:100) n a (cid:101) b = O ( n ab ). Proof.

For b = 0 the statement is trivially true. If b >

0, we have (cid:100) n a (cid:101) b ≤ ( n a + 1) b = O ( n ab ). If b <

0, we have (cid:100) n a (cid:101) b ≤ ( n a − b = O ( n ab ).Now we can show that N δ − M β − n = (cid:100) n − ε (cid:48) α − (cid:101) δ − (cid:100) n − ε (cid:48) α − − δβ − − εβ − (cid:101) β − n = O (cid:18) n − − ε (cid:48) α − (1 − δ ) n − ε (cid:48) α − (1 − δ ) − ε n (cid:19) = O ( n − ε )where the ﬁrst step is justiﬁed by Fact 1. We conclude that both conditions (a) and (b) hold. Case 1.1.2 : β − > ⇔ β >

1. This case is symmetric to the previous one and implies that weare in the situation δ <

1. When extracting the constraints on ε , we will have the same inequalitiesbut with the opposite direction. Indeed, we multiply by β − − β + 1 − ε (cid:48) α − − δ ) ≤ ε ≤ − ε (cid:48) α − − δ )Given that ε >

0, we need to verify to have room to choose such an ε , that is − ε (cid:48) α − (1 − δ ) >

0. Weknow that the ﬁrst factor of this multiplication is between 0 and 1. Hence, we can always choosean ε (cid:48) such that − ε (cid:48) α − >

0. Moreover, in this sub-case we have δ < − δ isstrictly positive. Hence, the quantity − ε (cid:48) α − (1 − δ ) is strictly positive, which means that there alwaysexists an ε such that condition (d) holds. Assuming condition (c), conditions (a) and (b) can beproved to hold in the same manner as in the previous sub-case.In conclusion, we can say that depending on α , δ and β we ﬁnd ourselves into one of the listedcases. We showed that in each one of those we can always ﬁnd values for ε and ε (cid:48) such that thereexists integer values for N and M that provide an algorithm for OV running in time O ( n − ε + n − ε (cid:48) ),proving OVH to be false.

Corollary 2.

Any problem P such that ( N, M ) - OV ≤ dlic P holds is not ( α, δ, β )-polynomiallyindexable, for any α > , β ≥ , < δ < , unless OVH is false.

Recall the following conditional lower bound for

SMLG from Equi et al. [21].

Theorem 6 ([21]) . For any ε > , SMLG on labeled deterministic DAGs cannot be solved in either O ( | E | − ε | P | ) or O ( | E | | P | − ε ) time unless OVH fails. This holds even if restricted to a binaryalphabet, and to DAGs in which the sum of out-degree and in-degree of any node is at most three. G (1) W e · · · b G ( N ) W ee G (2 N − U b · · · e G ( N − U b · · · e G (1) U b b G (1) U e · · · b G ( N ) U e · · · b G (2 N − U eb b bb b e ee e e G = Figure 3: Non-deterministic graph G . The dashed thick edges are not present in the acyclic graphfrom [21], and must be added to handle ( N, M )- OV instances with M > N .Given an OV instance with sets X and Y , the reduction from [21] builds a graph G usingsolely X , and a pattern P using solely Y , both in linear time O ( dN ), such that P has a match in G if and only if there exists a pair of orthogonal vectors. This shows that the two conditions of thelinear independent-components reduction property hold, thus OV ≤ dlic SMLG . Directly applyingCorollary 1, we obtain Theorem 2.Next, we show that constraint β + δ < N, M )- OV instances with M > N , then the reduction from [21] no longer holds, because the pattern P is too large to ﬁtinside the DAG G . As such, we need to make a minor adjustment to G . For this, we must givesome additional details of that reduction. For our purposes, it is enough to explain the constructionof a non-deterministic graph from [21, Section 2.3].Pattern P is over the alphabet Σ = { b , e , , } , has length | P | = O ( dM ), and can be built in O ( dM ) time from the second set of vectors Y = { y , . . . , y M } . Namely, we deﬁne P = bb P y e b P y e . . . b P y M ee where P y i is a string of length d that is associated with each y i ∈ Y , for 1 ≤ i ≤ M . The h -thsymbol of P y i is either or , for each h ∈ { , . . . , d } , such that P y i [ h ] = if and only if x i [ h ] = 1.Starting from the ﬁrst set of vectors X , we deﬁne the directed graph G W = ( V W , E W , L W ),which can be built in O ( dN ) time and consists of N connected components G ( j ) W , one for each vector x j ∈ X . Component G ( j ) W can be constructed so that it contains an occurrence of a subpattern P y i if and only if x j · y i = 0. In addition, we need a universal gadget G U = ( V U , E U , L U ) of 2 N − G ( k ) U , where each component can match any of the subpatterns P y i . We actually needtwo copies G U and G U of such universal gadgets, a “top” one, and a “bottom” one, respectively.All the gadgets are connected as indicated in Figure 3 and the resulting graph G has total size O ( dN ).The intuition is that a preﬁx of P is handled by the “top” universal gadgets G U , a possiblematching a subpattern P y i of P by one of the “middle” gadgets G ( j ) W , and a suﬃx of P by the“bottom” universal gadgets, because P has a bb preﬁx and an ee suﬃx. As mentioned above, by Notice that [21] originally built P based on X , and G based on Y . Since it is immaterial for correctness, and inorder to keep in line with the notation in this paper, we assumed the opposite here. N, M )- OV instances with M > N . However, we can easilyﬁx this by adding a cycle in each of the “top” and “bottom” universal gadgets, so that a longerpattern will match a universal gadget in this cycle as many times needed to ﬁt inside the graph.More precisely, we can add an edge from the e -node to the right of G (1) U back to the b -node tothe left of G (1) U , and likewise from the e -node to the right of G (2 N − U back to the b -node to theleft of G (2 N − U (see Figure 3). It can easily be checked that it still holds that P has a match inthe resulting graph G if and only if there are two orthogonal vectors, no matter the relationshipbetween N and M . Applying Corollary 2, we obtain Theorem 3. References [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results forLCS and other sequence similarity measures. In

IEEE 56th Annual Symposium on Foundationsof Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78,2015.[2] Amir Abboud, Aviad Rubinstein, and R. Ryan Williams. Distributed PCP theorems forhardness of approximation in P. In Chris Umans, editor, ,pages 25–36. IEEE Computer Society, 2017. URL: https://doi.org/10.1109/FOCS.2017.12 , doi:10.1109/FOCS.2017.12 .[3] Amir Abboud, Ryan Williams, and Huacheng Yu. More applications of the polynomial methodto algorithm design. In Proceedings of the Twenty-sixth Annual ACM-SIAM Symposium onDiscrete Algorithms , SODA ’15, pages 218–230, Philadelphia, PA, USA, 2015. Society forIndustrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=2722129.2722146 .[4] Amir Abboud and Virginia Vassilevska Williams. Popular conjectures imply strong lowerbounds for dynamic problems. In , pages 434–443. IEEEComputer Society, 2014. URL: https://doi.org/10.1109/FOCS.2014.53 , doi:10.1109/FOCS.2014.53 .[5] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languagesmeet preﬁx sorting. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposiumon Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020 , pages 911–930. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.55 , doi:10.1137/1.9781611975994.55 .[6] Amihood Amir, Timothy M. Chan, Moshe Lewenstein, and Noa Lewenstein. On hard-ness of jumbled indexing. In Javier Esparza, Pierre Fraigniaud, Thore Husfeldt, and EliasKoutsoupias, editors, Automata, Languages, and Programming - 41st International Collo-quium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I , volume8572 of

Lecture Notes in Computer Science , pages 114–125. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-43948-7_10 , doi:10.1007/978-3-662-43948-7\_10 .[7] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. In WADS’97, Halifax, LNCS 1272 , pages 160–173, 1997.138] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext.

J.Algorithms , 35(1):82–99, 2000.[9] Renzo Angles and Claudio Gutierrez. Survey of graph database models.

ACM Comput. Surv. ,40(1):1:1–1:39, February 2008. URL: http://doi.acm.org/10.1145/1322432.1322433 , doi:10.1145/1322432.1322433 .[10] Arturs Backurs and Piotr Indyk. Edit Distance Cannot Be Computed in Strongly SubquadraticTime (Unless SETH is False). In Proceedings of the Forty-seventh Annual ACM Symposiumon Theory of Computing , STOC ’15, pages 51–58, New York, NY, USA, 2015. ACM. URL: http://doi.acm.org/10.1145/2746539.2746612 , doi:10.1145/2746539.2746612 .[11] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October2016, Hyatt Regency, New Brunswick, New Jersey, USA , pages 457–466, 2016.[12] Philip Bille. Personal Communication at Dagstuhl Seminar on Indexes and Computation overCompressed Structured Data, June 2013.[13] Karl Bringmann. Why walking the dog takes time: Frechet distance has no strongly sub-quadratic algorithms unless seth fails. In , pages 661–670. IEEE, 2014.[14] M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. TechnicalReport 124, Digital Equipment Corporation, 1994.[15] Vincent Cohen-Addad, Laurent Feuilloley, and Tatiana Starikovskaya. Lower bounds for textindexing with mismatches and diﬀerences. In Timothy M. Chan, editor,

Proceedings of theThirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego,California, USA, January 6-9, 2019 , pages 1146–1164. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.70 , doi:10.1137/1.9781611975482.70 .[16] The Computational Pan-Genomics Consortium. Computational pan-genomics: status,promises and challenges. Brieﬁngs in Bioinformatics , 19(1):118–135, 2018. URL: http://dx.doi.org/10.1093/bib/bbw089 , arXiv:/oup/backfile/content_public/journal/bib/19/1/10.1093_bib_bbw089/5/bbw089.pdf , doi:10.1093/bib/bbw089 .[17] Alessio Conte, Gaspare Ferraro, Roberto Grossi, Andrea Marino, Kunihiko Sadakane, andTakeaki Uno. Node Similarity with q -Grams for Real-World Labeled Networks. In Proceedingsof the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,KDD 2018, London, UK, August 19-23, 2018 , pages 1282–1291, 2018. URL: https://doi.org/10.1145/3219819.3220085 , doi:10.1145/3219819.3220085 .[18] Maxime Crochemore and Wojciech Rytter. Jewels of stringology . World Scientiﬁc, 2002. URL: https://doi.org/10.1142/4838 , doi:10.1142/4838 .[19] Eggertsson Hannes P, Jonsson Hakon, Kristmundsdottir Snaedis, Hjartarson Eirikur, KehrBirte, Masson Gisli, Zink Florian, Hjorleifsson Kristjan E, Jonasdottir Aslaug, JonasdottirAdalbjorg, Jonsdottir Ingileif, Gudbjartsson Daniel F, Melsted Pall, Stefansson Kari, and Hall-dorsson Bjarni V. Graphtyper enables population-scale genotyping using pangenome graphs. Nature Genetics , 49(11):1654–1660, 2017. doi:https://doi.org/10.1038/ng.3964 .1420] Massimo Equi. Pattern matching in labeled graphs. Master’s thesis, University of Pisa,Italy, 2018. URL: https://etd.adm.unipi.it/theses/available/etd-09102018-185610/unrestricted/MasterThesis_MassimoEqui.pdf .[21] Massimo Equi, Roberto Grossi, Veli M¨akinen, and Alexandru I. Tomescu. On the complexityof string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini,and Stefano Leonardi, editors, , volume 132 of

LIPIcs , pages55:1–55:15. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICALP.2019.55 , doi:10.4230/LIPIcs.ICALP.2019.55 .[22] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM , 52(4):552–581,2005.[23] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing andindexing labeled trees, with applications.

J. ACM , 57(1):4:1–4:33, 2009.[24] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Vic-tor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr´es Taylor. Cypher:An evolving query language for property graphs. In

Proceedings of the 2018 InternationalConference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June10-15, 2018 , pages 1433–1445, 2018. URL: https://doi.org/10.1145/3183713.3190657 , doi:10.1145/3183713.3190657 .[25] Travis Gagie, Giovanni Manzini, and Jouni Sir´en. Wheeler graphs: A framework for BWT-based data structures. Theor. Comput. Sci. , 698:67–78, 2017. URL: https://doi.org/10.1016/j.tcs.2017.06.016 , doi:10.1016/j.tcs.2017.06.016 .[26] Garrison Erik, Sir´en Jouni, Novak Adam M, Hickey Glenn, Eizenga Jordan M, DawsonEric T, Jones William, Garg Shilpa, Markello Charles, Lin Michael F, Paten Benedict, andDurbin Richard. Variation graph toolkit improves read mapping by representing geneticvariation in the reference. Nature Biotechnology , 36:875, aug 2018. URL: , doi:http://dx.doi.org/10.1038/nbt.422710.1038/nbt.4227 .[27] Daniel Gibney and Sharma V. Thankachan. On the hardness and inapproximability of recog-nizing wheeler graphs. In Michael A. Bender, Ola Svensson, and Grzegorz Herman, editors, , volume 144 of LIPIcs , pages 51:1–51:16. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.51 , doi:10.4230/LIPIcs.ESA.2019.51 .[28] Isaac Goldstein, Moshe Lewenstein, and Ely Porat. On the hardness of set disjointness andset intersection with bounded universe. In Pinyan Lu and Guochuan Zhang, editors, , volume 149 of LIPIcs , pages7:1–7:22. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2019.7 , doi:10.4230/LIPIcs.ISAAC.2019.7 .[29] R. Grossi and J. Vitter. Compressed suﬃx arrays and suﬃx trees with applications to textindexing and string matching. SIAM Journal on Computing , 35(2):378–407, 2006.1530] Shohei Hido and Hisashi Kashima. A linear-time graph kernel. In Wei Wang 0010, HillolKargupta, Sanjay Ranka, Philip S. Yu, and Xindong Wu, editors,

ICDM 2009, The NinthIEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009 ,pages 179–188. IEEE Computer Society, 2009.[31] Russell Impagliazzo and Ramamohan Paturi. On the Complexity of k-SAT.

Journal of Com-puter and System Sciences , 62(2):367 – 375, 2001. URL: , doi:https://doi.org/10.1006/jcss.2000.1727 .[32] Kim Daehwan, Paggi Joseph M., Park Chanhee, Bennett Christopher, and Salzberg Steven L.Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. NatureBiotechnology , 37(8):907–915, 2019. doi:https://doi.org/10.1038/s41587-019-0201-4 .[33] Veli M¨akinen, Alexandru I. Tomescu, Anna Kuosmanen, Topi Paavilainen, Travis Gagie, andRayan Chikhi. Sparse dynamic programming on DAGs with small width.

ACM Trans. Algo-rithms , 15(2):29:1–29:21, 2019.[34] William J. Masek and Michael S. Paterson. A faster algorithm computing stringedit distances.

Journal of Computer and System Sciences , 20(1):18–31, 1980.URL: , doi:10.1016/0022-0000(80)90002-1 .[35] Gonzalo Navarro and Veli M¨akinen. Compressed full-text indexes. ACM Comput. Surv. ,39(1):2, 2007. URL: https://doi.org/10.1145/1216370.1216372 , doi:10.1145/1216370.1216372 .[36] Mihai Patrascu and Liam Roditty. Distance Oracles beyond the Thorup-Zwick Bound. SIAM J.Comput. , 43(1):300–311, 2014. URL: https://doi.org/10.1137/11084128X , doi:10.1137/11084128X .[37] Eric Prud’hommeaux and Andy Seaborne. SPARQL query language for RDF. World WideWeb Consortium, Recommendation REC-rdf-sparql-query-20080115, January 2008.[38] Mikko Rautiainen, Veli M¨akinen, and Tobias Marschall. Bit-parallel sequence-to-graphalignment. Bioinformatics , 35(19):3599–3607, 2019. URL: https://doi.org/10.1093/bioinformatics/btz162 , doi:10.1093/bioinformatics/btz162 .[39] Mikko Rautiainen and Tobias Marschall. GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment. bioRxiv , 2019. URL: , , doi:10.1101/810812 .[40] Marko A. Rodriguez. The gremlin graph traversal machine and language (invited talk). In Proceedings of the 15th Symposium on Database Programming Languages, Pittsburgh, PA,USA, October 25-30, 2015 , pages 1–10, 2015. URL: https://doi.org/10.1145/2815072.2815073 , doi:10.1145/2815072.2815073 .[41] Korbinian Schneeberger, J¨org Hagmann, Stephan Ossowski, Norman Warthmann, SandraGesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads againstmultiple genomes. Genome Biology , 10:R98, 2009.1642] Jouni Sir´en. Indexing variation graphs. In S´andor P. Fekete and Vijaya Ramachandran, editors,

Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments, ALENEX2017, Barcelona, Spain, Hotel Porta Fira, January 17-18, 2017 , pages 13–27. SIAM, 2017.URL: https://doi.org/10.1137/1.9781611974768.2 , doi:10.1137/1.9781611974768.2 .[43] Jouni Sir´en, Niko V¨alim¨aki, and Veli M¨akinen. Indexing graphs for path queries with appli-cations in genome research. IEEE/ACM Trans. Comput. Biol. Bioinformatics , 11(2):375–388,March 2014. URL: http://dx.doi.org/10.1109/TCBB.2013.2297101 , doi:10.1109/TCBB.2013.2297101 .[44] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci. , 348(2-3):357–365, 2005. URL: https://doi.org/10.1016/j.tcs.2005.09.023 , doi:10.1016/j.tcs.2005.09.023 .17 Connection to

SIC

Given sets S , S , . . . , S n ⊆ [1 ..u ], where u = log c n for suﬃciently large c , the Set IntersectionConjecture (

SIC ) [36] is that there is no index of size O ( n − ε ) for any ε > S i and S j intersect or not (i.e. there is no improvement over the table of allprecomputed solutions). The reduction of [12] is as follows: build a simple DAG with one copy ofthe sets as source nodes and another copy as sink nodes. Then add nodes in between correspondingto the elements of the sets. Connect source node corresponding to S i to all nodes correspondingto elements v ∈ S i , and all nodes corresponding to v ∈ S i to the sink corresponding to S i , forall i . Label sources and sinks with their set identiﬁer, and nodes in between with some commonletter, say A . Since the graph size is O ( n log c n ), a truly sub-quadratic size index supporting stringqueries of the form P = i A j even, say, in exponential time in | P | would prove SIC false. Modifyingthe relationship between universe size u and number of sets n gives rise to several reﬁned lowerbounds for the tradeoﬀ betweeen index construction and query time [28], which directly transferto the graph indexing problem through the simple connection stated above. B Missing cases of the proof of Theorem 5

Case 2 : α = 1. Condition (˜a) simply becomes n = n − ε (cid:48) , which is veriﬁed for ε (cid:48) = 1. We now splitthe analysis of condition (˜b) into two sub-cases. Case 2.1 : δ < β . We can rewrite condition (˜b) as:˜ N = ˜ M − βδ − n ε − δ since δ < δ − (cid:54) = 0. If we choose ˜ M = 1 we respect condition (d) and we obtain˜ N = n ε − δ for any value of β . Hence, we can ﬁrst choose a value for ε and later use this equationto obtain the right value for ˜ N that will satisfy condition (˜b). Nevertheless, we cannot just pickany value for ε . Indeed, we need to guarantee also that condition (d) is holding. This can beachieved by verifying that 0 ≤ ε − δ ≤

1. Since δ < ε > ε − δ >

0. Moreover, ε − δ ≤ ⇔ ε ≤ − δ , which means that any ε such that 0 < ε ≤ − δ satisﬁes condition (d). Weknow that there exists such an ε since 1 − δ > α = 1. Since N = (cid:100) ˜ N (cid:101) = (cid:100) n ε − δ (cid:101) and M = (cid:100) ˜ M (cid:101) = 1, and noticing that δ − <

0, we can analysecondition (b) as follows. N δ − M β − n = (cid:100) n ε − δ (cid:101) δ − n ≤ (cid:16) n ε − δ − (cid:17) δ − · n = O ( n ε − δ δ − n )= O ( n − ε ) . Hence, condition (b) is veriﬁed and so all the conditions hold.

Case 2.2 : β < δ . This case is symmetric to the previous one. Indeed,we now rewrite condition (˜b) as: ˜ M = ˜ N − δβ − n ε − β where β < β − (cid:54) = 0. This time we choose ˜ N = 1, from which we obtain ˜ M = n ε − β for any value of δ . Again, we will use this equation to ﬁnd the right value for ˜ N once we have18hosen ε . When choosing such ε , we will have to respect the constraint 0 ≤ ε − β ≤ ε such that 0 < ε ≤ − β satisﬁes condition (d), and β < ε exists.As in the previous case, condition (a) is easily veriﬁed by α = 1. For verifying condition (b) wechoose ε, ε (cid:48) , ˜ N , ˜ M such that conditions (˜a), (˜b) and (d) are veriﬁed. Then we choose N = (cid:100) ˜ N (cid:101) = 1and M = (cid:100) ˜ M (cid:101) = (cid:100) n ε − β (cid:101) so that condition (c) is veriﬁed. The analysis of condition (b) is analogousto the previous case and yields N δ − M β − n ≤ (cid:16) n ε − β + 1 (cid:17) β − · n = O ( n − ε ), which veriﬁescondition (b). Case 1.2 : δ < β = 1. In this case condition (˜b) simpliﬁes to˜ N δ − n = n − ε ⇔ ˜ N = n ε − δ , where 1 − δ > δ <

1. Condition (˜a) and condition (˜b) both concern ˜ N , and bycombining them we obtain: n ε − δ = n − ε (cid:48) α − ⇔ ε − δ = 1 − ε (cid:48) α − ⇔ ε = 1 − ε (cid:48) α − − δ ) . We already know that 0 < − ε (cid:48) α − ≤

1, which guarantees that 0 < ε ≤ − δ and also veriﬁescondition (d). Indeed, condition (d) requires 0 ≤ ε − δ ≤

1, but this is already kept in check by thefact that ε − δ = − ε (cid:48) α − . Since δ <

1, we have 1 − δ >

0, and hence we can conclude that all conditions(˜a), (˜b) and (d) hold.Using Fact 1 we can prove that when choosing N as in (c) condition (a) holds. N α − n = (cid:100) n − ε (cid:48) α − (cid:101) α − n = O ( n − ε (cid:48) α − α − n )= O ( n − ε (cid:48) ) . Observing that β = 1 makes condition (b) simplify to N δ − n = O ( n − ε ), and we can perform asimilar analysis to obtain N δ − n ≤ (cid:16) n ε − δ + 1 (cid:17) δ − · n = O ( n − ε ), which veriﬁes condition (b). Case 1.3 : δ = 1 and β <

1. Similarly to the previous case, from condition (˜b) we get:˜ M β − n = n − ε ⇔ ˜ M = n ε − β . Here, condition (d) is equivalent to 0 ≤ ε − β ≤

1, which is guaranteed by choosing ε such that0 < ε ≤ − ββ