Hardness of Approximation of (Multi-)LCS over Small Alphabet
aa r X i v : . [ c s . CC ] J un Hardness of Approximation of (Multi-)LCS over Small Alphabet
Amey Bhangale * Diptarka Chakraborty † Rajendra Kumar ‡ June 25, 2020
Abstract
The problem of finding longest common subsequence (LCS) is one of the fundamental problemsin computer science, which finds application in fields such as computational biology, text processing,information retrieval, data compression etc. It is well known that (decision version of) the problem offinding the length of a LCS of an arbitrary number of input sequences (which we refer to as Multi-LCSproblem) is NP-complete. Jiang and Li [SICOMP’95] showed that if Max-Clique is hard to approximatewithin a factor of s then Multi-LCS is also hard to approximate within a factor of Θ( s ) . By the NP-hardness of the problem of approximating Max-Clique by Zuckerman [ToC’07], for any constant δ > ,the length of a LCS of arbitrary number of input sequences of length n each, cannot be approximatedwithin an n − δ -factor in polynomial time unless P = NP . However, the reduction of Jiang and Li assumesthe alphabet size to be Ω( n ) . So far no hardness result is known for the problem of approximating Multi-LCS over sub-linear sized alphabet. On the other hand, it is easy to get / | Σ | -factor approximation forstrings of alphabet Σ .In this paper, we make a significant progress towards proving hardness of approximation over smallalphabet by showing a polynomial-time reduction from the well-studied densest k -subgraph problemwith perfect completeness to approximating Multi-LCS over alphabet of size poly ( n/k ) . As a conse-quence, from the known hardness result of densest k -subgraph problem (e.g. [Manurangsi, STOC’17])we get that no polynomial-time algorithm can give an n − o (1) -factor approximation of Multi-LCS overan alphabet of size n o (1) , unless the Exponential Time Hypothesis is false. * University of California Riverside, USA. Email: [email protected] † National University of Singapore, Singapore. Author is supported in part by NUS ODPRT Grant, WBS No. R-252-000-A94-133. Email: [email protected] ‡ IIT Kanpur, India and National University of Singapore. Author is supported in part by the National Research FoundationSingapore under its AI Singapore Programme [Award Number: AISG-RP-2018-005]. Email: [email protected]
Introduction
Finding longest common subsequence (LCS) of a given set of strings over some alphabet is one of thefundamental problems of computer science. The computational problem of finding (the length of a) LCShas been intensively studied for the last five decades (see [16] and the references therein). This problemfinds many applications in the fields of computational biology, data compression, pattern recognition, textprocessing and others. LCS is often considered among two strings, and in that case it is considered to beone of the classic string similarity measures (see [5]). The general case, when the number of input strings isunrestricted, is also very interesting and well-studied. To avoid any confusion we refer to this general versionof the LCS problem as
Multi-LCS problem. One of the major applications of Multi-LCS is to find similarregions of a set of DNA sequences. Multi-LCS is also a special case of the multiple sequence alignment andconsensus subsequence discovery problem (e.g. [27]). Interested readers may refer to the chapter entitled“Multi String Comparison-the Holy Grail” of the book [13] for a comprehensive study on this topic. Otherapplications of Multi-LCS include text processing, syntactic pattern recognition [22] etc.Using a basic dynamic programming algorithm [30] we can find a LCS between two strings of length n in quadratic time. However the general version, i.e., the Multi-LCS problem is known to be NP -hard [23]even for the binary alphabet. This problem remains NP -hard even with certain restrictions on input strings(e.g. [7]). For m input strings a generalization of the basic dynamic programming algorithm finds LCS intime O ( mn m ) . Recently, Abboud, Backurs and Williams [2] showed that an O ( n m − ε ) time (for any ε > )algorithm for this problem would refute the Strong Exponential Time Hypothesis (SETH) even for alphabetof size O ( m ) .Due to the computational hardness of exact computation of a LCS, an interesting problem is what is thebest approximation factor that we can achieve within a reasonable time bound. A c -approximate solution(for some < c ≤ ) of a LCS is a common subsequence of length at least c · | LCS | , where | LCS | denotesthe length of a LCS. For the Multi-LCS problem, Jiang and Li [18] showed that if Max-Clique is hard toapproximate within a factor of s then Multi-LCS is also hard to approximate within a factor of Θ( s ) . By theNP-hardness of the problem of approximating Max-Clique by Zuckerman [31], for any constant δ > , thelength of a LCS of arbitrary number of input sequences of length n each, cannot be approximated withinan n − δ -factor in polynomial time unless P = NP . However, the result of Jiang and Li [18] is only true foralphabets of size Ω( n ) . For smaller alphabets (even for size sublinear in n ) we do not know any suchhardness result. Jiang and Li [18] conjectured that Multi-LCS for even binary alphabet is MAX-SNP -hard(see [26] for the definition of
MAX-SNP -hardness). To the best of our knowledge no progress has been doneso far on the direction of showing any conditional hardness for smaller alphabets. On the other hand, it isvery easy to get a / | Σ | -approximation algorithm for the Multi-LCS problem over any alphabet Σ . Thealgorithm just outputs the best subsequence among the subsequences of the same symbol.In this paper, we make a significant progress towards showing hardness of approximation of Multi-LCS by refuting the existence of a polynomial time constant factor approximation algorithm under theExponential Time Hypothesis (ETH). Theorem 1.1.
There exists a growing function f ( n ) = n o (1) such that assuming ETH, there is no polynomialtime f ( n ) -factor approximation algorithm for the Multi-LCS problem over n o (1) -sized alphabet. This rules out any efficient poly-logarithmic factor approximation algorithm for the Multi-LCS problemover any n o (1) -sized alphabet. We show the above theorem by providing a polynomial time reduction fromthe well-studied densest k -subgraph problem with perfect completeness and its gap version γ -D k S (for thedefinition see Section 2). 1 heorem 1.2.
Let kn = β ( n ) γ ( n ) for β < γ ≤ . If there is no polynomial time algorithm that solves ( γ / -D k S( k, n ), then there is no polynomial time algorithm that solves γ -approximate Multi-LCS problem oversome alphabet of size O ( β ) . The above reduction together with the ETH-based hardness result for the densest k -subgraph problemgiven by Manurangsi [24] implies Theorem 1.1. We refer to Appendix 1.2 for the previous works related tothe LCS problem and the densest k -subgraph problem. Our reduction starts with the reduction from the Max-Clique problem to Multi-LCS given by [18]. Givena graph G on n vertices the reduction outputs a Multi-LCS instance I over an alphabet { a , a , . . . , a n } ofsize n with n strings. The reduction has a guarantee that the maximum LCS size of I is equal to the sizeof the maximum clique in G .A natural way to reduce the alphabet size is to replace each symbol a i in a string with a string S i ∈ Σ m over a smaller alphabet Σ . Let us denote this new instance by I ′ . The hope is that the only way to get a largeLCS in I ′ is to match the corresponding strings whenever the respective symbols in I are matched. But thiswishful thinking is not true when the alphabet size is much smaller than the original alphabet size as onemight get a large common subsequence by matching parts of strings S i , S j corresponding to the differentsymbols a i , a j in the original strings.We get away with this issue by using a special collection of strings { S , S , . . . , S n } with the guaranteethat for every pair i = j , LCS ( S i , S j ) is much smaller than m . We can construct such a set deterministicallyby using the known deterministic construction of the so called long-distance synchronization strings [9, 14].There is also a much simpler randomized construction (see Theorem 3.1). It is easy to see that if the originalstrings have a LCS of size t , then the new Multi-LCS instance I ′ over alphabet Σ has an LCS of size at least tm .The interesting direction is to prove the converse i.e., if the LCS of I ′ is large then the LCS of I is alsolarge. We do not know if this is true in general. So we rely on the starting problem of Max-Clique fromwhich the instance I (and hence I ′ ) was created. We show that if I ′ has large LCS, then we can find a largesubgraph of G which has a non trivial density (instead of finding a large clique). Thus, the reduction relieson hardness of approximation of the D k S problem with perfect completeness . Then we use the result ofManurangsi [24] which shows that given a graph G with a guarantee that there is a clique of size k , thereis no polynomial time algorithm which finds a subgraph of G of size k with density at least γ ( n ) for some γ ( n ) = o ( n ) , assuming the ETH. Finding LCS between two strings is an important problem in computer science. Wagner and Fischer [30]gave a quadratic time algorithm, which is in fact prototypical to dynamic programming. The running timewas later improved to (slightly) sub-quadratic, more specifically O ( n log log n log n ) [12, 25]. Abboud, Back-urs and Williams [2] showed that a truly sub-quadratic algorithm ( O ( n − ε ) for some ε > ) would im-ply a (1 − δ ) n time algorithm for CNF-satisfiability, contradicting the Strong Exponential Time Hypothesis(SETH). They in fact showed that for m input strings an algorithm with running time O ( n m − ε ) would re-fute SETH. Abboud et al. [3] later further strengthened the barrier result by showing that even shaving an2rbitrarily large polylog factor from n would have the plausible, but hard-to-prove, consequence that NEXP does not have non-uniform
N C circuits. In case of approximation algorithm for LCS over arbitrarily largealphabets a simple sampling based technique achieves O ( n − x ) -approximation in O ( n − x ) time. Very re-cently, an O ( n − . ) factor approximation (breaking O ( √ n ) barrier) linear time algorithm is providedby Hajiaghayi et al. [15]. For binary alphabets another very recent result breaks / -approximation factorbarrier in subquadratic time [29]. (Note, / | Σ | -approximation over any alphabet Σ is trivial.) The onlyhardness (or barrier) results for approximating LCS in subquadratic time are presented in [1, 4].For the general case (which we also refer as Multi-LCS), when the number of input strings is unre-stricted, the decision version of the problem is known to be NP -complete [23] even for the binary alphabet.The problem remains NP -complete even with further restriction like bounded run-length on input strings [7].As cited earlier, Jiang and Li [18] (along with the result of Zuckerman [31]) showed that for every constant δ > , there is no polynomial time algorithm that achieves n − δ -approximation factor, unless P = NP . Oneinteresting aspect of the reduction in [18] is that in any input string any particular symbol appears at mosttwice. It is worth mentioning that if we restrict ourselves to the input strings where a symbol appears exactlyonce, then we can find a LCS in polynomial time. The algorithm is just an extension of the dynamic pro-gramming algorithm that finds a longest increasing subsequence of an input sequence. It is also not difficultto show that the decision version of the Multi-LCS problem with the above restriction on the input stringscan be solved even in non-deterministic logarithmic space. To see this, consider a LCS as a certificate. Thenthe verification algorithm makes single pass on the certificate, and checks whether every two consecutivesymbols in the certificate appears in the same order in all the input strings. Clearly, the above verificationalgorithm uses only logarithmic space. Since we know that each symbol appears exactly once in a string,the above verification algorithm correctly decides whether the given certificate is a valid LCS or not. k -subgraph problem Our starting point of the reduction is the hardness of approximating the densest k -subgraph problem. In thedensest k -subgraph problem (D k S), we are given a graph G ( V, E ) and an integer ≤ k ≤ | V | . The taskis to find a subgraph of G of size k with maximum density. Various approximation algorithms are knownfor D k S [10, 21], and the current best known is by [6] which gives n / ε -approximation algorithm for anyconstant ε > .A special case of D k S is when it is guaranteed that G has a clique of size k and the task is to find asubgraph of size k with density at least γ for < γ ≤ . In this perfect completeness case, Feige andSeltser [11] gave an algorithm which finds a k sized subgraph with density (1 − ε ) in time n O ((1+log nk ) /ε ) .There are several inapproximability results known for D k S based on worst-case assumptions. Khot [19]ruled out a PTAS assuming NP * BPTIME (2 n ε ) for some constant ε > . Raghavendra and Steurer [28]showed that DkS is hard to approximate to within any constant ratio assuming the Unique Games Conjecturewhere the constraint graph satisfies a small set expansion property.Assuming the Exponential Time Hypothesis, Braverman et al. [8], showed that for some constant ε > ,there is no polynomial time algorithm which when given a graph with a k -clique finds a k sized subgraphwith density (1 − ε ) . This result is significantly improved by Manurangsi [24] in which he showed thatassuming ETH, no polynomial time algorithm can distinguish between the cases when G has a clique ofsize k and when every k sized subgraph has density at most n − / (log log n ) c for some constant c > . Note, here size of a subgraph refers to the number of vertices present in that subgraph. Preliminaries
Notations:
We use [ n ] to denote the set { , , · · · , n } . For any string S we use | S | to denote its length.By abuse of notation, for any set V we also use the notation | V | to denote the size of V . For any string S oflength n and two indices i, j ∈ [ n ] , S [ i, j ] denotes the substring of S that starts at index i and ends at index j . We use α ( n ) , β ( n ) , γ ( n ) to denote that α, β, γ are allowed to depend on n . Given m sequences S , . . . , S m of length n over an alphabet Σ , the longest common subsequence is thelongest sequence S such that ∀ i ∈ [ m ] , S is a subsequence of S i .We will refer to the computational problem of finding or deciding the length of LCS as a Multi-LCSproblem. In this paper, we consider the decision variant of this problem: Given an integer ℓ ≤ n , we have todecide whether LCS has a length greater than equal to ℓ , or less than ℓ . For the approximation, we considerthe following gap-version of this problem. Problem 2.1.
For any < κ < , the κ -approximate Multi-LCS problem is defined as: Given sequences S , . . . , S m of length n over an alphabet Σ and an integer ℓ , the goal is to distinguish between the followingtwo cases• YES instance: A LCS of S , . . . , S m has length greater than or equal to ℓ .• NO instance: A LCS of S , . . . , S m has length less than κ · ℓ .We use the following definition of alignment. Definition 2.1 (Alignment) . Given two strings S and S of lengths n and m respectively, alignment σ is afunction from [ n ] to [ m ] ∪ {∗} which satisfies ∀ i ∈ [ n ] , if σ ( i ) = ∗ then S [ i ] = S [ σ ( i )] and for any i and j if σ ( i ) = ∗ , σ ( j ) = ∗ then for i > j , σ ( i ) > σ ( j ) .For an alignment σ between two strings S and S we say σ aligns some subsequence T = S [ i ] S [ i ] · · · S [ i ℓ ] of S with some subsequence T = S [ j ] S [ j ] · · · S [ j ℓ ] of S if and only if for all p ∈ [ ℓ ] , σ ( i p ) ∈{ j , j , · · · , j ℓ } . The Exponential Time Hypothesis (ETH) was introduced by Impagliazzo and Paturi [17]. It refutes thepossibility of getting much faster algorithm to decide satisfiability of a -CNF formula (also referred as -SAT problem) than that by the trivial brute force method. Hypothesis 1 (ETH) . There is no o ( n ) time algorithm for the -SAT problem over n variables. k -Subgraph problem and related hardness results For any graph, the density is defined as the ratio of the number of edges present in it and the number ofedges in any complete graph of the same size. So given a graph G = ( V, E ) , the density of G is | E || V | −| V | .The Densest k -Subgraph (D k S) problem is the following: Given a graph G on n vertices and a positiveinteger k ≤ n , the goal is to find a subgraph of G with k vertices which has maximum density.In this paper we will consider the following gap-version of densest k -subgraph, which in the literatureis sometimes referred as densest k -subgraph with perfect completeness .4 roblem 2.2. For any γ ≤ , γ -D k S( k, n ) is defined as: Given a graph G on n vertices and a positiveinteger k ≤ n , the goal is to distinguish between the following two cases• YES instance: There exists a clique of size k .• NO instance: All subgraphs of size k have density at most γ .We say that an algorithm solves γ -D k S( k, n ) if given any input it can distinguish whether the input is aYES instance or a NO instance. If the algorithm is randomized then it should succeed with probability atleast / .In this paper we use the following hardness result by Manurangsi [24]. Theorem 2.1 ([24]) . There exists a constant c > such that assuming the Exponential Time Hypothesis,for all constants ε > , there is no polynomial time algorithm for γ -D k S( k, n ) where γ = n − O (cid:16) n ) c (cid:17) and kn ∈ (cid:20) n − ε , n − Ω (cid:16) n (cid:17) (cid:21) . In this section we provide a reduction from the densest k -subgraph problem to the problem of approximatingMulti-LCS and prove Theorem 1.2. Note that, Theorem 1.2 and Theorem 2.1 together immediately implyTheorem 1.1 by plugging γ ( n ) = n − O (cid:16) n ) c (cid:17) , β ( n ) = γ ( n ) . Remark . If we want to get the hardness of Multi-LCS for a constant sized alphabet using Theorem 1.2then k must be Ω( n ) . However, when k = Ω( n ) Theorem 2.1 does not imply any hardness result. In fact,when k = Ω( n ) , there is a polynomial time algorithm for (1 − ε ) -D k S ( k, n ) for any constant ε > [11].Therefore our reduction will not give any hardness for constant sized alphabet. However, if one can improveTheorem 2.1 for k/n = 1 / poly (log n ) and γ ( n ) = 1 / poly (log n ) , then our main reduction in Theorem 1.2will imply Multi-LCS hardness for poly (log n ) sized alphabet!Our reduction involves two steps: First, we use the reduction from the Max-Clique problem to theMulti-LCS problem over large alphabet given in [18]. Next we perform alphabet reduction by replacingeach character by a “short” string over a small-sized alphabet. Revisiting the reduction from Max-Clique to Multi-LCS.
We first recall the reduction from [18]. Weare given a graph G = ( V, E ) on n vertices and an integer k ≤ n . Fix an arbitrary labeling on the verticesof V as v , . . . , v n . For every vertex v i , partition its neighbors into two subsets: N < ( v i ) contains all theneighboring vertices v j with j < i ; and N > ( v i ) contains all the neighboring vertices v j with j > i .Consider an alphabet Σ containing a separate symbol for each vertex. We use v i to denote both thevertex and its corresponding symbol in Σ . Now for each vertex v i ∈ V , construct the following two strings X i and X ′ i X i = v . . . v i − v i +1 . . . v n v i v i r . . . v i s and X ′ i = v i p . . . v i q v i v . . . v i − v i +1 . . . v n where N > ( v i ) = { v i r , · · · , v i s } with i r < · · · < i s , and N < ( v i ) = { v i p , · · · , v i q } with i p < · · · < i q . Thefollowing proposition is immediate from the above construction. Proposition 3.1 ([18]) . If there is a clique of size c in G , then there is a common subsequence of X , · · · , X n , X ′ , · · · , X ′ n of length c . Proposition 3.2 ([18]) . For any common subsequence S of X , · · · , X n , X ′ , · · · , X ′ n , all the v i ’s presentin S form a clique in G . The proofs of these propositions follow from the facts that any common subsequence is of the form v i , v i , . . . , v i t where i < i < . . . < i t and that there must be an edge between v i j and v i j ′ for ≤ j For some parameter α ( n ) < , let { S , . . . , S n } be a set of strings oflength m over some alphabet Σ ′ such that: for all i = j | LCS ( S i , S j ) | ≤ αm. We will fix the value of m and | Σ ′ | later. The following theorem (Theorem 1 of [20]) shows that if we pick strings from Σ ′ m uniformlyat random then for | Σ ′ | = O (1 /α ) , with high probability the sampled strings will satisfy the above desiredproperty. Theorem 3.1 ([20]) . For every ε > there exists c > such that for large enough sized alphabet Σ ′ forany m if two strings S , S are picked uniformly at random from Σ ′ m then P r h(cid:12)(cid:12)(cid:12) | LCS ( S , S ) | − m p | Σ ′ | (cid:12)(cid:12)(cid:12) ≥ ε m p | Σ ′ | i ≤ e − cm/ √ | Σ ′ | . Now by suitably choosing ε, m the following lemma directly follows from a union bound over everypair of n chosen strings. Lemma 3.1. For any α ∈ (0 , , and n ∈ N there exists an alphabet Σ ′ of size O ( α − ) such that forany m ≥ cα − log n (for some suitably chosen constant c > ), if we choose a set of strings S , · · · , S n uniformly at random from Σ ′ m then with probability at least − /n for each i = j , | LCS ( S i , S j ) | ≤ αm . The above lemma gives us a randomized reduction. However we can deterministically find such acollection (with a slight loss in the parameters) using the known construction of synchronization strings .The proof of the following Lemma is deferred to Appendix A. Lemma 3.2. For any α ∈ (0 , , and n ∈ N there exists an alphabet Σ ′ of size O ( α − ) such that for any m > α − log n , there is a deterministic construction of a set of strings S , · · · , S n ∈ Σ ′ m such that foreach i = j , | LCS ( S i , S j ) | ≤ αm . Moreover, all the strings can be generated in time O ( α − nm ) .Remark . One advantage of using the randomized construction is the alphabet size (as well as the lengthof strings); randomized construction has only a quadratic loss whereas the deterministic construction has acubic loss in the alphabet size. However this will not matter much for the parameters we need to prove ourmain theorem.Now let us continue with the description of our reduction. We replace each v j ∈ Σ by the string S j .After the replacement we get the following two strings Y i and Y ′ i respectively from X i and X ′ i . Y i = S . . . S i − S i +1 . . . S n S i S i r . . . S i s and Y ′ i = S i p . . . S i q S i S . . . S i − S i +1 . . . S n Note, Y i and Y ′ i ’s are over the alphabet Σ ′ . For notational convenience we use S N >i to denote the substring S i r . . . S i s , and S N
Lemma 3.4 (Soundness) . Let α ∈ (0 , / and β = √ α . If graph G is a NO instance of γ - D k S (everysubgraph of size k has density less than γ ), then a LCS of Y , . . . , Y n , Y ′ , . . . , Y ′ n has length at most βmn . Let L be an (arbitrary) LCS of Y , · · · , Y n , Y ′ , · · · , Y ′ n of size greater than βmn . By the construction Y n = S . . . S n (since N > ( v n ) = ∅ ). So we can partition the subsequence L as Z , · · · , Z n where ∀ i ∈ [ n ] Z i is a subsequence of S i . ( Z i can be an empty string). Now consider all the Z i of length at least βm , andlet W denote the set of all such Z i ’s, i.e., W = { Z i | | Z i | ≥ βm } . Suppose L is the string formed byremoving all Z i 6∈ W from L . Clearly, | L | ≥ | L | − βmn ≥ βmn .For all i, j ∈ [ n ] such that i < j , define C [ i, j ] as: C [ i, j ] := { Z t ∈ W | i ≤ t ≤ j } . Note, W = C [1 , n ] .Next we show that either the size of C [1 , n ] is small or there exists a subgraph in G which has large density.Let us consider the set of vertices V H := { v t | Z t ∈ W} . So | V H | = |W| ≥ | L | m − βn ≥ βn . If we couldshow that the subgraph H of G induced by the set of vertices V H has high density (ideally, a clique), thenthat will imply Lemma 3.4.Now consider an (arbitrary) alignment between L and Y , · · · , Y n , Y ′ , · · · , Y ′ n . Let us denote the align-ment between L and Y i ( Y ′ i ) by σ i ( σ ′ i ). From now on whenever we will talk about alignment we will referto these particular alignments ( σ i or σ ′ i depending on strings under consideration) without specifying themexplicitly. Consider a Z t ∈ W . We say Z t is ε -aligned (for some ε ∈ [0 , ) with some substring S ′ of some Y i (or Y ′ i ) if and only if either the first or the last ε fraction of symbols of Z t is aligned by the alignment σ ′ i (or σ ′ i ) with some subsequence of S ′ . Throughout this proof we will set ε = 1 / . Note that, if we partition Y i into (any) two parts Y li and Y ri then Z i is / - aligned to at least one of Y li and Y ri , and this justifies oursetting of parameter ε .By following the argument of the proof of Proposition 3.2 given in [18], it is possible to show that if σ aligns all Z t with some subsequence of S t in all strings Y i (and Y ′ i ), then the subgraph H induced byvertices in V H has high density (actually forms a clique). Unfortunately we do not know whether all the Z t ’s are aligned with their corresponding S t ’s in all the Y i ’s (and Y ′ i ’s). Following are the different cases ofmapping Z i ∈ W with Y i :1. Z i is / -aligned with the substring S . . . S i − of Y i .2. Z i is / -aligned with S i +1 . . . S n S i S N >i of Y i and there exists a j > i such that a symbol of Z j in L is aligned with some symbol of S j in the substring S i +1 . . . S n S i .3. Z i is / -aligned with the substring S i +1 . . . S n S i S N >i in Y i and there exists no j > i such that asymbol of Z j ∈ W is aligned with some symbol of S j in the substring S i +1 . . . S n .Similarly, we will also consider the mapping with Y ′ i ’s. We will categorize first and second case as sparsecase and the third one as the dense case . Next we analyze these cases.7 .1.1 Sparse Case: Improper mapping leads to small LCS locally Let us recall that Y i = S . . . S i − S i +1 . . . S n S i S N >i and Y ′ i = S N i such that | C [ i, j ] | ≤ αβ ( j − i + 1) .Proof. Suppose Z i is / -aligned with S . . . S i − of Y i . Let j be the largest index less than i such that asymbol in Z j is aligned (by σ i ) with some symbol in S j in Y i (if there does not exist such a j then take j = 0 ). Note, by the definition of / -alignment at least first βm/ symbols of Z i are mapped (by σ i ) in S . . . S i − . Recall, the definition of / -alignment ensures the mapping of the first or the last half fractionof symbols. However in this case if Z i ’s last βm/ symbols are mapped in S . . . S i − then the whole Z i isactually mapped in S . . . S i − , which is even stronger than what we state.By the properties of strings S k ’s specified in Lemma 3.2, the first βm/ symbols of Z i require at least β α blocks from { S j , S j +1 , . . . , S i − } to map completely (see Figure 1). S i − S t Z i L Y i ≥ β α blocksFigure 1: Z i is / -aligned with S . . . S i − where t > j Similarly each element of C [ j + 1 , i − also requires at least βα blocks from { S j , S j +1 , . . . , S i − } .However any two Z p , Z p +1 ∈ C [ j + 1 , i ] may share a block (more specifically, the last block used for Z p and the first block used for Z p +1 ) for mapping. So, we get β α + ( βα − | C [ j + 1 , i − | ≤ i − j ⇒ β α | C [ j + 1 , i ] | ≤ i − j. Note, βα − ≥ β α as α ≤ / (recall, β = √ α ), and C [ j + 1 , i − ∪ { Z i } = C [ j + 1 , i ] .Similarly, suppose Z i is / -aligned with S i +1 . . . S n of Y ′ i . Let j be the smallest index greater than i such that a symbol of Z j is aligned (by σ ′ i ) with some symbol of S j in Y ′ i (if there does not exist any j thentake j = n + 1 ). Using an argument similar to the above, we get β α + ( βα − | C [ i + 1 , j − | ≤ j − i ⇒ β α | C [ i, j − | ≤ j − i. Claim 3.2. Suppose (by the alignment σ i ) Z i ∈ W is / -aligned with S i +1 . . . S n S i S N >i of Y i , and thereexists a j > i such that a symbol of Z j in L is aligned with some symbol of S j in the substring S i +1 . . . S n S i .Then there exists r such that i < r ≤ j and | C [ i, r − | ≤ αβ ( r − i ) . imilarly, suppose (by the alignment σ ′ i ) Z i ∈ W is / -aligned with S N i of Y i and there exists a j > i such that a symbolof Z j in L is aligned (by σ i ) with some symbol of S j in the substring S i +1 . . . S n S i . Let us choose r to bethe smallest j with the above condition. By the argument used in the proof of Claim 3.1, Z i requires at least β α blocks from { S i +1 , S i +2 , · · · , S r } , and every element in C [ i + 1 , r − requires at least βα blocks from { S i +1 , S i +2 , · · · , S r } . Again, any two Z p , Z p +1 ∈ C [ i, r − may share a block (more specifically, the lastblock used for Z p and the first block used for Z p +1 ) for mapping. So we get β α + | C [ i + 1 , r − | ( βα − ≤ r − i ⇒ β α | C [ i, r − | ≤ r − i. Similarly, suppose Z i is / -aligned with S N iH := { v t ∈ V H | t > i } and V Suppose (by the alignment σ i ) Z i ∈ W is / -aligned with S i +1 . . . S n S i S N >i in Y i , and thereexists no j > i such that a symbol of Z j ∈ W is aligned with some symbol of S j in the substring S i +1 . . . S n .Then | V >iH \ N > ( v i ) | + β α | V >iH \ N > ( v i ) | ≤ n − i ) + 1 . Proof. Z i is / -aligned with S i +1 . . . S n S i S N >i of Y i . So to align all Z r ∈ C [ i +1 , n ] (note, | C [ i +1 , n ] | = | V >iH | ) at most n − i ) + 1 blocks of S p ’s are available. Since for no j > i a symbol of Z j ∈ W is alignedwith some symbol of S j in S i +1 . . . S n , each Z r such that v r ∈ V >iH \ N > ( v i ) requires at least βα blocks of S p ’s to map. Any two Z r , Z r +1 such that v r , v r +1 ∈ V >iH \ N > ( v i ) may share a block (more specifically, thelast block used for Z p and the first block used for Z p +1 ) for mapping. Recall for our choice of parameters α, β , βα − ≥ β α . So we get | V >iH \ N > ( v i ) | + β α | V >iH \ N > ( v i ) | ≤ n − i ) + 1 . Similarly, we consider the mapping of Z i in the string Y ′ i .9 q p q p i i j i j q j C [1 , n ] Shaded region is included in T Considering s = 3 , ( i , j ) , ( i , j ) , ( i , j ) is a se-ries of pairs to cover C [ p , q ] where i = p and j = q Figure 2: T as a union of disjoint subsets Claim 3.4. Suppose (by the alignment σ ′ i ) Z i ∈ W is / -aligned with S N i such that | C [ i,j ] | j − i +1 ≤ αβ , and then add all Z k ∈ C [ i, j ] inthe set T . (If no such j exists then do not add anything to T .)3. Define a new set W ′ = W \ T .Let L be the string formed by removing all Z i 6∈ W ′ from L . Let us also define a set of vertices V ′ H = { v t | Z t ∈ W ′ } . (Note, V ′ H ⊆ V H .) Now we will argue that the set V H has not shrunk by much after removingthe sparse blocks and each vertex in V ′ H has high degree in the subgraph H , which eventually implies thatthe subgraph H has high density. Claim 3.5. | V ′ H | ≥ | V H | − αβ n .Proof. Let us consider the set T . We can write T as a union of disjoint subsets as T = C [ p , q ] ∪ C [ p , q ] ∪· · · ∪ C [ p r , q r ] for some integer r ∈ [ n ] , such that ∀ ≤ ℓ ≤ r − C [ q ℓ , p ℓ +1 ] = ∅ (see Figure 2).10ow if we could show that for each ℓ ∈ [ r ] , | C [ p ℓ , q ℓ ] | ≤ αβ ( q ℓ − p ℓ ) , then |T | = r X ℓ =1 | C [ p ℓ , q ℓ ] | ≤ αβ r X ℓ =1 ( q ℓ − p ℓ ) ≤ αβ n where the last inequality is true since p < q < p < q < · · · < p r < q r . So to conclude the proof of theclaim next we show that for all ℓ ∈ [ r ] | C [ p ℓ , q ℓ ] | ≤ αβ ( q ℓ − p ℓ ) .It is immediate from the construction of the set T that there exists a sequence of pair of indices ( i , j ) , · · · , ( i s , j s ) (for some positive integer s ) where i = p ℓ and j s = q ℓ , such that for all t ∈ [ s ] while processing Z i t we add blocks of C [ i t , j t ] in T , and C [ p ℓ , q ℓ ] = S t ∈ [ s ] C [ i t , j t ] . We can furtherassume that there exists no t ′ ∈ [ s ] such that C [ i t ′ , j t ′ ] ⊆ S t ∈ [ s ] \{ t ′ } C [ i t , j t ] . (In words it means that C [ i , j ] , · · · , C [ i s , j s ] is a minimal sequence of subsets whose union is C [ i , j s ] .) Due to this assumptionwe can write that i ≤ j ≤ i ≤ j ≤ · · · ≤ i s ≤ j s − and ∀ t ∈ [ s − , i t +2 ≥ j t + 1 (see Figure 2). So, | C [ p ℓ , q ℓ ] | ≤ s X t =1 | C [ i t , j t ] | ≤ αβ s X t =1 ( j t − i t + 1)= 2 αβ h s + ( j s − i ) + s − X t =1 ( j t − i t +1 ) i ≤ αβ h s + ( j s − i ) + ( j s − − i − ( s − i ≤ αβ h j s − i ) i where second last inequality uses the fact that ∀ t ∈ [ s − , i t +2 ≥ j t + 1 and last inequality uses the factthat j s ≥ j s − + 1 and i ≥ i + 1 . Hence we conclude that | C [ p ℓ , q ℓ ] | ≤ αβ ( q ℓ − p ℓ ) , and this completesthe proof. Claim 3.6. For each vertex v i ∈ V ′ H , | V H T N ( v i ) | ≥ | V H | − αβ n .Proof. By the construction of W ′ , for each Z i ∈ W ′ we know that there exists no j > i (or < i ) suchthat | C [ i,j ] | j − i +1 ≤ αβ (or | C [ j,i ] | i − j +1 ≤ αβ ). Then by Claim 3.1 and Claim 3.2 it follows that all Z i ∈ W ′ satisfypreconditions of both Claim 3.3 and Claim 3.4. Otherwise by Claim 3.1 and Claim 3.2 we know that thereexists a j > i (or < i ) such that | C [ i,j ] | j − i +1 ≤ αβ (or | C [ j,i ] | i − j +1 ≤ αβ ). For j > i when we process Z i to constructthe set T we add all the blocks of C [ i, j ] , and for j < i when we process Z j we add all the blocks of C [ j, i ] . So it must be the case that the alignment σ i between L and Y i , / -aligns Z i to the substring S i +1 . . . S n S i S N >i and there exists no j > i such that Z j ∈ W aligns with S j in the substring S i +1 . . . S n .Also, σ ′ i / -aligns Z i to the substring S N iH \ N > ( v i ) | + β α | V >iH \ N > ( v i ) | ≤ n − i ) + 1 , and by Claim 3.4 | V Proof of Lemma 3.4. For the sake of contradiction let us assume that the LCS is of size at least βmn .Recall, we have already seen that | V H | ≥ βn . Now we consider the following two cases depending on thesize of V H . Case 1: (When | V H | ≤ βγ n ) Suppose | V H | ≤ βγ n ( = k ). Let V ′ ⊇ V H be an arbitrary set of size exactly βγ n . Let H ′ be the subgraph induced by the vertices V ′ . Using Claim 3.5 and Claim 3.6, we can lowerbound the density of the subgraph H ′ by: P v ∈ V ′ H (cid:16) | V H | − αβ n (cid:17)(cid:0) | V ′ | (cid:1) ≥ (cid:16) β − αβ (cid:17) n · (cid:16) β − αβ (cid:17) n βγ n · βγ n ≥ (cid:18) γ − αγβ (cid:19) . As we set α = β / , we get that the density of the subgraph induced by V ′ is at least ( γ/ . Case 2: (When | V H | > βγ n ) If | V H | > βγ n , the density of the subgaph H induced by V H is lower boundedby: P v ∈ V ′ H (cid:16) | V H | − αβ n (cid:17)(cid:0) | V H | (cid:1) ≥ | V ′ H | (cid:16) | V H | − αβ n (cid:17) | V H | ( | V H | − ≥ (cid:16) | V H | − αβ n (cid:17) | V H | = (cid:18) − αnβ | V H | (cid:19) ≥ (1 − γ (since | V H | > βγ n and we set α = β / ) ≥ ( γ/ (since γ ≤ ) . Now since density of the subgraph is at least ( γ/ , it follows from the following simple claim that thereexists a subgraph of H of size βγ n which has density at least ( γ/ .12 laim 3.7. Suppose a graph G = ( V, E ) has edge density c , then for any ≤ k ≤ | V | , there exists asubgraph of size k with density at least c .Proof. Let n = | V | . Pick a subset H ⊆ V of size exactly k uniformly at random. For a fixed edge e in G ,the probability that the edge e is present in the subgraph induced by H is exactly ( n − k − )( nk ) . Since G has c · (cid:0) n (cid:1) edges, by linearity of expectation, the expected number of edges in the subgraph induced by H is equal to c · (cid:0) n (cid:1) · ( n − k − )( nk ) = c · (cid:0) k (cid:1) . Therefore, the expected density of the subgraph is exactly equal to c . Hence, by anaveraging argument, there exists a subgraph of G of size k with density at least c .In both the cases, we have shown that there exists a subgraph of size βγ n (= k ) with density at least ( γ/ , which is a contradiction to the fact that we started with a NO instance of γ -D k S (cid:16) βγ n, n (cid:17) . Thereforein this case, the size of LCS must be at most βmn . Proof of Theorem 1.2: If there is no polynomial time algorithm to distinguish between the YES and NOinstances of γ -D k S (cid:16) βγ n, n (cid:17) , then using Lemma 3.3 and Lemma 3.4, it follows that there is no polynomialtime algorithm to distinguish between the cases when the LCS of Y , · · · , Y n , Y ′ , · · · , Y ′ n is of size βγ mn vs. βmn . Also note that if we use Lemma 3.2 to construct the strings S i ’s then the alphabet size is O ( α − ) = O ( β − ) . This proves the main theorem. In this paper we show hardness of constant factor approximation of Multi-LCS problem with input of length n over n o (1) sized alphabet assuming the Exponential Time Hypothesis (ETH). This is the first hardnessresult for approximating Multi-LCS problem for sublinear sized alphabet. To prove our result we provide areduction from the densest k -subgraph problem with perfect completeness, and then use the known hardnessresults for the latter problem from [24]. One interesting fact is that if one could show hardness of the γ -D k S ( k, n ) problem for k = Θ( npoly log n ) and γ = (log n ) − c for some c > , then due to our reduction thatwill directly imply constant factor hardness for Multi-LCS over poly-logarithmic sized alphabet under ETH. Acknowledgements. Authors would like to thank anonymous reviewers for providing helpful commentson an earlier version of this paper and especially for pointing out a small technical mistake in the proof ofLemma 3.4. Authors would also like to thank Pasin Manurangsi for pointing out that for certain regimes nohardness result is known for the densest k -subgraph problem. References [1] Amir Abboud and Arturs Backurs. Towards hardness of approximation for polynomial time problems.In , pages 11:1–11:26, 2017.[2] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS andother sequence similarity measures. In IEEE 56th Annual Symposium on Foundations of ComputerScience, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78, 2015.133] Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simu-lating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made.In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016,Cambridge, MA, USA, June 18-21, 2016 , pages 375–388, 2016.[4] Amir Abboud and Aviad Rubinstein. Fast and deterministic constant factor approximation algorithmsfor LCS imply new circuit lower bounds. In , pages 35:1–35:14, 2018.[5] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algo-rithms. In Pablo de la Fuente, editor, Seventh International Symposium on String Processing andInformation Retrieval, SPIRE 2000, A Coru˜na, Spain, September 27-29, 2000 , pages 39–48. IEEEComputer Society, 2000.[6] Aditya Bhaskara, Moses Charikar, Eden Chlamtac, Uriel Feige, and Aravindan Vijayaraghavan. De-tecting high log-densities: an O ( n k -subgraph. In Leonard J. Schulman,editor, Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge,Massachusetts, USA, 5-8 June 2010 , pages 201–210. ACM, 2010.[7] Guillaume Blin, Laurent Bulteau, Minghui Jiang, Pedro J. Tejada, and St´ephane Vialette. Hardnessof longest common subsequence for sequences with bounded run-lengths. In Combinatorial PatternMatching - 23rd Annual Symposium, CPM 2012, Helsinki, Finland, July 3-5, 2012. Proceedings , pages138–148, 2012.[8] Mark Braverman, Young Kun Ko, Aviad Rubinstein, and Omri Weinstein. ETH hardness for densest- k -subgraph with perfect completeness. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Sym-posium on Discrete Algorithms , pages 1326–1341. SIAM, 2017.[9] Kuan Cheng, Bernhard Haeupler, Xin Li, Amirbehshad Shahrasbi, and Ke Wu. Synchronizationstrings: Highly efficient deterministic constructions over small alphabets. In Proceedings of the Thirti-eth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA,January 6-9, 2019 , pages 2185–2204, 2019.[10] Uriel Feige, David Peleg, and Guy Kortsarz. The dense k -subgraph problem. Algorithmica , 29(3):410–421, 2001.[11] Uriel Feige and Michael Seltser. On the densest k -subgraph problem. 1997.[12] Szymon Grabowski. New tabulation and sparse dynamic programming based techniques for sequencesimilarity problems. Discrete Applied Mathematics , 212:96–103, 2016.[13] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and ComputationalBiology . Cambridge University Press, 1997.[14] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings: explicit constructions, localdecoding, and applications. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theoryof Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018 , pages 841–854, 2018.1415] MohammadTaghi Hajiaghayi, Masoud Seddighin, Saeed Seddighin, and Xiaorui Sun. ApproximatingLCS in linear time: Beating the √ n barrier. In Proceedings of the Thirtieth Annual ACM-SIAM Sym-posium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019 , pages1181–1200, 2019.[16] D.S. Hirschberg. Recent results on the complexity of common subsequence problems. In Time Warps,String Edits, and Macromolecules, D. Sankoff and J.B. Kruskal, ed., Addison-Wesley , pages 323–328,1983.[17] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k -SAT. Journal of Computer andSystem Sciences , 62(2):367–375, 2001.[18] Tao Jiang and Ming Li. On the approximation of shortest common supersequences and longest com-mon subsequences. SIAM J. on Computing , 24(5):1122–1139, 1995.[19] Subhash Khot. Ruling out ptas for graph min-bisection, dense k -subgraph, and bipartite clique. SIAMJournal on Computing , 36(4):1025–1071, 2006.[20] Marcos Kiwi, Martin Loebl, and Jiˇr´ı Matouˇsek. Expected length of the longest common subsequencefor large alphabets. Advances in Mathematics , 197(2):480–498, 2005.[21] G Kortsarz and D Peleg. On choosing a dense subgraph. In Proceedings of the 1993 IEEE 34th AnnualFoundations of Computer Science , pages 692–701. IEEE Computer Society, 1993.[22] S. Lu and K. S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transac-tions on Systems, Man, and Cybernetics , 8(5):381–389, May 1978.[23] David Maier. The complexity of some problems on subsequences and supersequences. J. ACM ,25(2):322–336, April 1978.[24] Pasin Manurangsi. Almost-polynomial ratio ETH-hardness of approximating densest k -subgraph. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing , pages 954–961.ACM, 2017.[25] William J. Masek and Michael S. Paterson. A faster algorithm computing string edit distances. Journalof Computer and System Sciences , 20(1):18 – 31, 1980.[26] Christos H. Papadimitriou and Mihalis Yannakakis. Optimization, approximation, and complexityclasses. J. Comput. Syst. Sci. , 43(3):425–440, 1991.[27] Pavel A. Pevzner. Multiple alignment with guaranteed error bounds and communication cost. In Combinatorial Pattern Matching, Third Annual Symposium, CPM 92, Tucson, Arizona, USA, April 29- May 1, 1992, Proceedings , pages 205–213, 1992.[28] Prasad Raghavendra and David Steurer. Graph expansion and the unique games conjecture. In Pro-ceedings of the forty-second ACM symposium on Theory of computing , pages 755–764. ACM, 2010.[29] Aviad Rubinstein and Zhao Song. Reducing approximate longest common subsequence to approximateedit distance. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020 , pages 1591–1600. SIAM, 2020.1530] Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM , 21(1):168–173, January 1974.[31] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromaticnumber. Theory of Computing , 3(6):103–128, 2007. A Derandomized version of Lemma 3.1 To achieve deterministic reduction we need to construct the set of strings S , · · · , S n deterministicallyin time poly ( n ) . For that purpose we use the notion of synchronization strings used in the literature of insertion-deletion codes [9, 14]. Definition A.1 ( c -long-distance ε -synchronization string) . A string S ∈ Σ n is called a c -long-distance ε -synchronization string for some parameter ε ∈ (0 , , if for every ≤ i < j ≤ i ′ < j ′ ≤ n with i ′ − j ≤ n · ( j + j ′ − i − i ′ ) >c log n , | LCS ( S [ i, j ] , S [ i ′ , j ′ ]) | ≤ ε ( j + j ′ − i − i ′ ) , where ( j + j ′ − i − i ′ ) >c log n isthe indicator function for ( j + j ′ − i − i ′ ) > c log n .Note, in the definition of c -long-distance ε -synchronization string in [9] authors used the notion of editdistance instead of LCS. More specifically, they specified the edit distance between S [ i, j ] and S [ i ′ , j ′ ]) isat least (1 − ε )( | S [ i, j ] | + | S [ i ′ , j ′ ] | ) . However both the notions can be used interchangeably since for anytwo strings S, S ′ , | LCS ( S, S ′ ) | = | S | + | S ′ | − ED ( S, S ′ ) , where the edit distance ED ( S, S ′ ) is definedas the minimum number of insertion and deletion operations required to transform S to S ′ . One may notethat, generally while defining the edit distance we also allow substitution operation. However here we arenot allowing substitution operation, and that is why we are able to write the following equivalence betweenLCS and the edit distance of two strings S, S ′ : | LCS ( S, S ′ ) | = | S | + | S ′ | − ED ( S, S ′ ) . We would liketo mention that in [9] authors also used this particular version of the edit distance notion (i.e., withoutsubstitution operation).Several constructions of such long-distance synchronization strings are given in [9, 14] with differentparameters. However we restate one of the theorems from [9] that we find useful for our purpose. Theorem A.1 (Rephrasing of Theorem 5.4 of [9]) . For any n ∈ N and parameter ε ∈ (0 , , there is adeterministic construction of an ε − -long-distance ε -synchronization string S ∈ Σ n for some alphabet Σ ofsize O ( ε − ) . Moreover, for any i ∈ [ n ] the substring S [ i, i + log n ] can be computed in time O ( ε − log n ) . Now using the above we will provide deterministic construction of set of strings S , · · · , S n with ourdesired property. Lemma 3.2. For any α ∈ (0 , , and n ∈ N there exists an alphabet Σ ′ of size O ( α − ) such that for any m > α − log n , there is a deterministic construction of a set of strings S , · · · , S n ∈ Σ ′ m such that foreach i = j , | LCS ( S i , S j ) | ≤ αm . Moreover, all the strings can be generated in time O ( α − nm ) .Proof. For a specified α and n , set ε = α/ . Then use the construction from Theorem A.1 to get an ε − -long-distance ε -synchronization string S of length nm , for any m > ε − log n . The bound on m is required to satisfy the condition that ( j + j ′ − i − i ′ ) > c log n of Definition A.1. (Note, in ourcase ( j + j ′ − i − i ′ ) = 2 m and c = ε − .) Then divide the string S into m length blocks. Finallychoose alternate blocks as S , · · · , S n . More specifically, S = S [1 , m ] , S = S [2 m + 1 , m ] , · · · , S n = S [(2 n − m + 1 , (2 n − m ] . Now the bound on | LCS ( S i , S j ) | for any i = jj