[PDF] A Data-Structure for Approximate Longest Common Subsequence of A Set of Strings

Abstract

Given a set of k strings I , their longest common subsequence (LCS) is the string with the maximum length that is a subset of all the strings in I . A data-structure for this problem preprocesses I into a data-structure such that the LCS of a set of query strings Q with the strings of I can be computed faster. Since the problem is NP-hard for arbitrary k , we allow an error that allows some characters to be replaced by other characters. We define the approximation version of the problem with an extra input m , which is the length of the regular expression (regex) that describes the input, and the approximation factor is the logarithm of the number of possibilities in the regex returned by the algorithm, divided by the logarithm regex with the minimum number of possibilities. Then, we use a tree data-structure to achieve sublinear-time LCS queries. We also explain how the idea can be extended to the longest increasing subsequence (LIS) problem.

Full PDF

aa r X i v : . [ c s . D S ] A ug A Data-Structure for Approximate Longest CommonSubsequence of A Set of Strings

Sepideh Aghamolaei Sharif University of Technology, Tehran, Iran [email protected]

Abstract.

Given a set of k strings I , their longest common subsequence (LCS) is the stringwith the maximum length that is a subset of all the strings in I . A data-structure for thisproblem preprocesses I into a data-structure such that the LCS of a set of query strings Q with the strings of I can be computed faster. Since the problem is NP-hard for arbitrary k ,we allow an error that allows some characters to be replaced by other characters.We deﬁne the approximation version of the problem with an extra input m , which is the lengthof the regular expression (regex) that describes the input, and the approximation factor isthe logarithm of the number of possibilities in the regex returned by the algorithm, dividedby the logarithm regex with the minimum number of possibilities.Then, we use a tree data-structure to achieve sublinear-time LCS queries. Keywords:

Longest Common Subsequence · Shortest Common Super-sequence · RegularExpressions · Approximation Algorithms.

Summarizing input strings in a preprocessing step for string matching algorithms has been usedbefore [4]. We use a similar approach for the longest common subsequence problem.A special case of this problem is matching regular expressions, which has been studied for thecase with two strings [5].For two strings, a ﬁnite state machine with O ( n / log n ) that separates them exists, whichimproves other sublinear existing results [3]. However, the paper does not provide an algorithm forbuilding it. Longest Common Substring and Shortest Common Super-sequence.

For a given set of strings, theirlongest common substring is the longest string which is a subsequence of both strings. This problemis also NP-hard and W[1]-hard [9,6,11]. For the longest common substring problem, data-structuressuch as preﬁx/suﬃx-tree, trie and etc. are known for matching one string with a set of query strings.The shortest common super-sequence of a set of k strings of length n is the shortest string thatcontains all the strings. This problem is NP-hard for alphabets of size 5 [10] and it can be solved in O ( n k ) using dynamic programming [9,6]. The problem is also W[1]-hard and has no O ( f ( k ) n O (1) )algorithms [11]. Multifurcating Phylogenetic Tree

A multifurcating phylogenetic tree of a set of biological stringsis a tree with nodes of arbitrary degree where each intermediate node corresponds to a commonancestor and each leaf is a living organism. Using the input strings in the leaves of the tree andtheir common super-sequences at intermediate nodes, this gives a tree on the strings.

Sepideh Aghamolaei

Agglomerative Clustering Tree

A method of constructing phylogenetic trees is to use a hierarchicalclustering algorithm, and report the resulting tree. Using single-linkage clustering, at each stepthe two closest points are clustered together and an intermediate node representing them is addedto the tree, and this proceeds until there are k clusters. If k = 1, this is equivalent to Kruskal’salgorithm for minimum spanning tree. String Matching using Regular Expressions.

Regular expressions are speciﬁcations of regular lan-guages using a set of characters called the alphabet and a set of wildcards. Symbols such as paran-theses for grouping, | for choosing one of the character (groups) on each of its sides, . for representingan arbitrary character. A wildcard character denotes the repetitions of other character groups, forexample ∗ means an arbitrary number of repetitions, + means at least one repetition, ? means 0 or1 repetition of the character or a group of character inside the parentheses just before the wildcard.In regular expressions, another notation { i } for representing i repetitions for any i ∈ { } ∪ N canalso be used.String matching with wildcards was discussed in [5] for two strings, where the authors usehashing to solve the problem in the massively parallel model [2]. The goal is to compute a regular expression R of length m of a string S of length n , such that thenumber of strings in the language of the regular expression R is minimized. Assume the length ofthe regular expression of the strings separated by | is the maximum of their lengths, and all thecharacters that do not appear in the output string have length 0. Example.

Consider S = banana and R = 2, then b ( a | n ) { } has 2 possible cases and (( b | n ) a ) { } has 2 possible cases.If { i } is not allowed and no error is allowed, the regular expression with the minimum lengthcan be computed by DFA minimization and writing the equivalent regular expression in O ( ns log n )for an alphabet of size s using Hopcroft’s algorithm [7].The DFA for a single string can be built by setting n + 2 states, where the state q i goes to state q i +1 with character S i , for i = 0 , · · · , n −

1, and to a trap state q n with any other character. Thestart state is q and the terminal is q n − .For a set of strings, we ﬁrst need to build their DFA, which has size O ( n k ) assuming all preﬁxesof the strings at the states. Building the DFA takes O ( skn k ), for choosing a character and a stringto add the next character, at each state. Computing the regular expression of this state machinetakes O ( skn k log n ) time.We deﬁne an α -approximation of the wildcard compression with X possibilities for a string S ,as a regular expression with X α possible cases for S . A similar deﬁnition can be used for a set ofstrings I instead of one string S .The regular expression . { n } with s n possible cases has an unbounded approximation factor for astring that repeats a character n times. So, there is no obvious bound on the approximation factorof the problem. Example.

A string with all distinct characters can be summarized as ( S | · · · | S m )( S m +1 |· · · | S m ) · · · ( S ⌊ nm ⌋ +1 , · · · , S n ), which has O ((min { m, s } ) n/m ) possibilities. This is an exponentialnumber of possibilities in the input size, unless s = O (1) , n/m = O (1). A shorter representation is . { n } , with s n possibilities. The second representation is a (min { m, n log mm log s } )-approximation. pproximate Longest Common Subsequence of Strings 3 Algorithm.

The following operations are repeatedly applied to the string or the DFA of the languageof that string to get the formula: – The operation of unifying two characters, i.e. replacing two characters such as a and b with( a | b ) is equivalent to merging their states. The newly equivalent states can now be mergedtogether. Two states are equivalent if their incoming and outgoing states are the same. – A sequence of consecutive 2 i states representing string s ′ , if a sequence of consecutive statesare equal to s ′ { j } , for an integer j , repeated t times, it can be replaced with s ′ { j + t } , and thestrings between the repetitions are merged by | .In each iteration, the operation o i which minimizes log X i ∆ i is chosen, where X i is the set of possibilitiesafter the operation o i divided by the number of possibilities before it, and ∆ i is the amount of lengthreduction done by o i . The algorithm ends when the length of the string reaches m . Also, the secondoperation has priority over the ﬁrst one. Theorem 1.

The approximation factor of the wildcard simpliﬁcation algorithm is .Proof. By checking subsequences of length 2 i , i ∈ , · · · , log n instead of i, i = 1 , · · · , n , the lengthof the repeated sequence is approximated by at most 2. Finding a solution of the same size wouldrequire adding as many new characters as the length of the string. So, the ﬁrst operation is withina factor 2 of the optimal solution.If the second operation concatenates strings that appear separately in the optimal solution, itcan add as much as the length of the concatenated string to the length of the solution, but sincethat is also a copy of the current string, the approximation factor is again 2.Any operations that is blocked after a sequence of operations, can still be done with approx-imation factor 2. Based on the greedy choice of the algorithm, the ratio between the number ofpossibilities X i to the length s i of the string used in the regular expression of the best operationin the optimal solution is less than or equal to the one chosen by the algorithm in the ﬁrst step,i.e. X ′ /s ′ ≤ X /s . If the ﬁrst operation changes the string used in an operation from the optimalsolution, the ratio X i /s i will decrease or remain the same, since the length of the string s ′ i is at leastas much as the length of the optimal string s i and the approximation factor 2 gives X ′ i ≤ X i .Thealgorithm needs to check all the copies of a string at the same time for the greedy choice to work.Assuming the number of operations resulting in the optimal solution is d , then the sum of thelengths of the repeated strings σ i , i = 1 , · · · , d is len ( σ , · · · , σ d ) ≤ n − m , where len () is the lengthof the union of the strings. By replacing each operation with a greedy choice, the length of eachstring either remains the same or increases, and their intersections is at least as much as theirlengths, so the number of operations of the algorithm is at most d . Applying the bound for eachoperation, the total possibilities of the algorithm would be ≤ Q di =1 X i , which is a 2-approximationof the optimal solution, i.e. Q di =1 X i . ⊓⊔ Example.

For the string “banana” from the previous example, one application of the second oper-ation gives b ( an ) { } a , and then one application of the ﬁrst operation which gives b ( a | n ) { } . Thenumber of possible cases is 2 , so it is a -approximation. A Simple Algorithm for Strings of Equal Length.

In case the input set I consists of strings withlength n , it is enough to keep the set of possible characters at each location. This can be doneby computing the distinct characters at each location, by sorting or marking the elements of thealphabet. This can be done in O (min { skn, kn log k } ) time. Sepideh Aghamolaei

Matching Regular Expressions

For a set of input strings, and the nearest neighbor query, we can ﬁrstbuild a wildcard compression, the shortest common super-sequence or any other regular expression,build a ﬁnite state machine for it, and then use it at the query time.

The LCS problem does not have the mergeability property [1], i.e. merging the solutions of sub-sets of inputs does not give the solution to the LCS of all. For example, consider the set I = { aac,caa,cbb,bbc } , and assume the partitions are { aac, caa } and { cbb,bbc } ; then the LCS of thepartitions are { aa,bb } , whose LCS is empty, while the solution is the string “c”.While the wildcard compressions are not mergeable either, they are composable [8], meaning thesolutions to subsets of the input give an approximate solution to the original problem. In the worstcase the wildcard characters are at diﬀerent locations, so they have no intersection with each otherand the number of possibilities is the product of the number of possibilities of each subproblem.So, the approximation factor of the composable solution is the sum of the approximation factors ofthe subproblems. References

1. Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J., Wei, Z., Yi, K.: Mergeable summaries. In: Pro-ceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems.pp. 23–34 (2012)2. Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings ofthe 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems. pp. 273–284.ACM (2013)3. Chase, Z.: A new upper bound for separating words. arXiv preprint arXiv:2007.12097 (2020)4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms (3-rd edition). MITPress and McGraw-Hill (2009)5. Hajiaghayi, M., Saleh, H., Seddighin, S., Sun, X.: Massively parallel algorithms for string matchingwith wildcards. arXiv preprint arXiv:1910.11829 (2019)6. Hakata, K., Imai, H.: The longest common subsequence problem for small alphabet size between manystrings. In: International Symposium on Algorithms and Computation. pp. 469–478. Springer (1992)7. Hopcroft, J.: An n log n algorithm for minimizing states in a ﬁnite automaton. In: Theory of machinesand computations, pp. 189–196. Elsevier (1971)8. Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coveragemaximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principlesof database systems. pp. 100–108 (2014)9. Irving, R.W., Fraser, C.B.: Two algorithms for the longest common subsequence of three (or more)strings. In: Annual Symposium on Combinatorial Pattern Matching. pp. 214–229. Springer (1992)10. Maier, D.: The complexity of some problems on subsequences and supersequences. Journal of the ACM(JACM) (2), 322–336 (1978)11. Pietrzak, K.: On the parameterized complexity of the ﬁxed alphabet shortest common supersequenceand longest common subsequence problems. Journal of Computer and System Sciences67