A Fast Randomized Algorithm for Finding the Maximal Common Subsequences
AA Fast Randomized Algorithm for Finding the MaximalCommon Subsequences
Jin Cao
Nokia Bell [email protected]
Dewei Zhong
Rutgers [email protected]
ABSTRACT
Finding the common subsequences of L multiple strings has manyapplications in the area of bioinformatics, computational linguistics,and information retrieval. A well known result states that findinga Longest Common Subsequence (LCS) for L strings is NP-hard,e.g., the computational complexity is exponential in L . In this paper,we develop a randomized algorithm, referred to as Random-MCS ,for finding a random instance of Maximal Common Subsequence(
MCS ) of multiple strings. A common subsequence is maximal ifinserting any character into the subsequence no longer yields acommon subsequence. A special case of MCS is LCS where thelength is the longest. We show the complexity of our algorithmis linear in L , and therefore is suitable for large L . Furthermore,we study the occurrence probability for a single instance of MCS,and demonstrate via both theoretical and experimental studies thatthe longest subsequence from multiple runs of Random-MCS oftenyields a solution to
LCS . KEYWORDS
Longest Common Subsequence, Maximal common subsequence,randomized algorithm, string pattern discovery
Data discovery and pre-processing in many data science projectsoften require laborious efforts and creativity from the data scientist.Developing methods that can automatically generate insights fromraw data is an important topic in automated machine learning [6]in order to eliminate human bottleneck and make machine learningavailable to non-experts. As string or text is a common form of datarepresentation, comparing strings so that information regardingto what is common and what is unique among the strings can beextracted and summarized is an important pre-processing task.A subsequence of a string S is a character sequence that can bederived from S by deleting some characters without changing theorder of the remaining characters. Consider the case of L stringswhere L is large. A common subsequence of L strings can be thoughtof as a common pattern shared by all strings. Unlike substrings,subsequences are not required to occupy consecutive positionswithin the original strings.For string comparison, we consider two types of common subse-quences of the L strings. The Longest Common Subsequence (LCS)is a subsequence common to all the L strings that has a maximallength. The Maximal Common Subsequence (MCS) is defined as maximal if and only if inserting any character into the subsequencecan no longer yield a common subsequence. By definition, a LCSis a MCS with the maximal length. Furthermore, there may exist many MCSs of different lengths, and many LCSs of the same max-imal length. For example, for the given two strings ′ f abecd ′ and ′ acde f ′ , the set of MCSs are { f , acd , ae } where acd is the LCS.Finding LCS for multiple strings has important applications inmany areas, including bioinformatics, computational linguistics,and information retrieval [1, 3, 18]. The problem is, however, NP-hard [15] as the number of strings L becomes large. Much of theliterature addresses the simple case of two or three strings [8, 10, 16].Several methods have been proposed to improve the computationefficiency for the general case of L strings, either by using par-allelization [4, 14, 19] or assuming a special string structure [9].Reviews of various methods can be found in [2, 12].In this paper, we attack the problem of string comparison fromthe angle of MCS instead of LCS. The problem of finding MCS ismuch less studied compared to LCS. All methods from the exist-ing literature only consider the case of two strings. For example,methods are presented by [10] to find MCS and constrained MCS.A dynamic programming approach is presented in [7] to find theshortest MCS. More recently, [17] proposes an computationallyefficient way to find a MCS but his method can only find one MCS.We develop a fast randomized algorithm to find MCS solutions of L strings and show the computational complexity is linear in L , thusmuch more amenable for the analysis of a large number of stringsthan algorithms developed for LCS. Furthermore, as each run ofour algorithm returns a random MCS and LCS is the longest MCS,we can run our algorithms multiple times and then take the longestMCS from the returned solutions to approximate LCS. We studythis both theoretically and empirically. Our main contributions aresummarized as follows: • We develop a randomized algorithm, referred to as
RandomMCS ,for finding a random
MCS solution of multiple strings. • We extend an existing algorithm for finding
MCS of twostrings [17] to the case of L strings. • For a set of L strings with common length n , we show thecomputational complexity of our RandomMCS algorithmis O ( n L ) and our extension to the algorithm in [17], is O ( nL log n ) , both are linear in the number of strings L . • We carry out simulation studies to understand the perfor-mance of our proposed approach. • We analyze the occurrence probability of a MCS solutionreturned from
RandomMCS . • We demonstrate via both theoretical analysis and experimen-tal studies that the longest subsequence from multiple runsof our algorithm often yields a
LCS .The rest of the paper is organized as follows. In Section 2, wepresent the relevant background for our work. In Section 3, wepropose our method and illustrate it using a toy example. In Section4, we analyze the occurrence probability for a specific MCS and a r X i v : . [ c s . D S ] S e p in Cao and Dewei Zhong show the computational complexity of our algorithm is linear inthe number of strings L . We carry out simulations to understandthe performance of our algorithm empirically and present an appli-cation of our work to Automated Machine Learning (AutoML) inSection 6. We conclude and discuss future work in Section 7. In this section, we shall first formally define Longest CommonSubsequence (LCS) and Maximal Common Subsequence (MCS) for L strings. Then we discuss previous work on finding LCS and MCS.We shall introduce the following notations used throughout thepaper. We denote the empty string by ′′ and denote the empty setby ∅ . To make presentation clear, we put quote ′′ around singlecharacters to differentiate them from variables but sometimes omitthe ′′ for strings with multiple characters. We use calligraphic lettersto indicate sets, i.e., A , M , etc. Throughout the paper, strings arerepresented using upper case letters. We use ⊕ to represent stringjoin, and reserve the letter L to indicate the number of strings inconsideration. In the following, we are given a set A of L strings: { A , A , . . . , A L } ,where each A l is a string with n l characters represented by A l = a l a l . . . a n l , l . Definition 2.1.
A sequence of characters C is a common subse-quence for (strings in) A , if C is contained in each A l , l = , , . . . , L in the same character order.To avoid confusion, we differentiate a subsequence from a sub-string where a substring a consecutive block of characters from astring. For a subsequence, we often concatenate its characters anduse a string to represent it. Definition 2.2.
Define
LCS (A) as the longest common subse-quence contained in each string A l in A , l = , . . . , L . Definition 2.3.
Define
MCS (A) as a subsequence contained ineach string A l in A with the property such that an addition of anycharacter to MCS (A) no longer yields a common subsequence for A . Example.
The solution set of MCS for A = { T EGAP , GAEPR } is { GAP , EP } . Out of these two solutions, ′ GAP ′ is the LCS. Dynamic programming is a common technique used for findingLCS. For example, consider the LCS of two strings of length n , X = x x ... x n and Y = y y ... y n . If x n = y n , then LCS ( X , Y ) = LCS ( X n − , Y n − ) ⊕ x n . If x n (cid:44) y n , then LCS ( X , Y ) = max ( LCS ( X n − , Y ) , LCS ( X , Y n − )) where X n − and Y n − represent the previous n − X and Y respectively. It can be shown the complexityof using dynamic programming for finding LCS is O ( n ) . For thegeneral case of L strings, the extension of the dynamic program-ming algorithm will have a time complexity of O ( n L ) , which impliesthe problem is NP-hard [15]. An algorithm of a running time of O (( r + n ) log n ) is proposed by [11] where r is the total number ofordered pairs of positions at which the two sequences match. Inthe worst case r can be O ( n ) . There are several proposed methods for finding MCS. It hasbeen shown by [7] the problem of finding all shortest MCSs for L strings is NP-hard for large L . All proposed algorithms focus onlyon two strings and no computationally effective methods have beenproposed in the general case of L strings. Our algorithm targets thegeneral case. L STRINGS3.1 Intuition
Our algorithm is inspired by Lemma 2 from [17] which states anecessary and sufficient condition for a subsequence W being max-imal for two strings. We shall extend the lemma to the case of L strings. In the following, we denote the set of L strings of interestby A = { A , . . . , A L } . Definition 3.1.
For a string A , define | A | as the number of char-acters in A . For each k = , . . . , | A | , define A ( , k ] as the prefix of A starting from position 1 to k . Define A ( k , | A |] as the suffix of A starting from position ( k + ) to | A | . Define A ( , k ] = ′′ for k = A ( k , | A |] = ′′ for k = | A | where ′′ is the empty string. Definition 3.2.
Let W be a subsequence contained in string A ,then for any k = , . . . , | W | , define Middle ( A , W , k ) as the remain-ing substring obtained from A by deleting both the shortest prefixcontaining W ( , k ] and the shortest suffix containing W ( k , | W |] . Example.
The following gives a simple example of this function.
Middle ( ′ T EGAP ′ , ′ E ′ , k = ) is ′ T ′ since when W = ′ E ′ and k = ′ T EGAP ′ containing W ( , k ] = ′′ is ′′ , and theshortest suffix containing W ( k , | W |) = ′ E ′ is ′ EGAP ′ (this exampleis also shown in the first line in Cell 3 of Figure 2).Theorem 3.3. For any common subsequence W of A , W is max-imal if and only if for any ≤ k ≤ | W | , the set of L substrings Middle ( A l , W , k ) , derived from A l , l = , . . . , L , are disjoint (i.e. donot share any common characters). Proof. If W is maximal, then for each k = , , . . . , | W | , the L substrings Middle ( A l , W , k ) , derived from A l , l = , . . . , L , have tobe disjoint. This is because if this is not true, then there exisits a com-mon character c shared by the L substrings Middle ( A l , W , k ) , l = , . . . , L . Therefore, by (string) joining W ( , k ] , c , and W ( k , | W |) ,we can construct a longer common subsequence that contains W which contradicts the condition that W is maximal. The converseis true since it validates the condition of W being maximal. □ The contra-positive of the above lemma can be stated as follows.Theorem 3.4.
For any common subsequence W of A , W is notmaximal if and only if there exist k , ≤ k ≤ | W | such that the set of L substrings, Middle ( A l , W , k ) , derived from A l , l = , . . . , L shareat least one common character. Theorems 3.3 and 3.4 are in fact the basis of our algorithm sinceit can be used to constructively obtain a MCS. Suppose we start W asthe empty set, according to Theorem 3.4, if W is not maximal, thenwe can find a character that is common to the set of L strings A toadd to W . This step can be performed iteratively until W becomemaximal, i.e., the set of L substrings, Middle ( A l , W , k ) , each from Fast Randomized Algorithm for Finding the Maximal Common Subsequences A l , becomes disjoint so that we can no longer insert characters to W . To obtain many instances of MCSs, we randomize the characterinsertion to W , which is the essence of our algorithm. To formally present out algorithm, we first need to define somesupporting functions.
Definition 3.5.
Define commonChar ( A ) as the function that re-turns a set of common characters shared by each string in a givenstring set A . Example.
Suppose A = { T EGAP , GAEPR } , the function willreturn a set of 4 characters { E , G , A , P } as they are all shared char-acters for the two strings. Suppose A = { abccde , дf chca , d f cca } ,then the function will return the set { a , c } . However, in this case,the character ′ c ′ appears at least two times in every string. Thisfrequency information can be used in our algorithm when we ran-domly select a character from the common set so that the high-frequency characters are more likely to be selected. Definition 3.6.
Given a set of L Strings A = { A , A , . . . , A L } anda common subsequence W , define the function BreakPoints (A , k ) that returns the set of location indices k to be inserted in W sothat the new subsequence is still common to all strings in A .That is, the updated common subsequence is the string join of W ( , k ] , c , W ( k , | W |] .A pseudo code implemention of the function BreakPoints (A , k ) is shown as follows. Algorithm 1
Function
BreakPoints
Input : A set of L strings A and a common subsequence W Output : The list of indices in W where new characters can bepotentially inserted to create an updated common subsequence. position ← ∅ for k in 0 : | W | for l in 1 : L , m l (cid:17) Middle ( A l , W , k ) if commonChar ({ m , . . . , m L }) (cid:44) ∅ position ← position ∪ { k } return position Example.
The following gives examples of this function. Forthe given A = { T EGAP , GAEPR } and a subsequence W = ′ A ′ , BreakPoints ( A , W ) will return the set { , } . This is because when k =
0, according Definition 3.2,
Middle ( ′ T EGAP ′ , ′ A ′ , ) = ′ T EG ′ and Middle ( ′ GAEPR ′ , ′ A ′ , ) = ′ G ′ . Hence, since there is a com-mon character ′ G ′ shared by ′ T EG ′ and ′ G ′ , the evaluation ofexistence of common characters in line 4 of Algorithm 1 will suc-ceed. Likewise, when k = Middle ( ′ T EGAP ′ , ′ A ′ , ) = ′ P ′ and Middle ( ′ GAEPR ′ , ′ A ′ , ) = ′ EPR ′ , sharing a common character ′ P ′ .Therefore BreakPoints ( A , W ) will return the set { , } .On the contrary, for A = { T EGAP , GAEPR } and W = ′ GAP ′ BreakPoints (A , W ) will return an empty set. This is because foreach k = , , Middle ( ′ T EGAP ′ , ′ GAP ′ , k ) and Middle ( ′ GAEPR ′ , ′ GAP ′ , k ) do not share any common characters.Algorithm 2 presents the pseudo-code of our algorithm for find-ing a random solution of MCS. The function is written in a recursive fashion and has an optional starting value of W which we shallexplain further in Section 3.4. The termination condition of thealgorithm is expressed in line 2 which validates W as a MCS byTheorem 3.3. Line 3-7 applies Theorem 3.4 (which states the contra-positive of Theorem 3.3) to constructively search for the possiblecommon characters to update a previous common subsequence W .In line 5, when we randomly select a character from the commonset, we can utilize the minimum frequency discussed in the examplefollowing Definition 3.5 as the optional weights. We have found viasimulation studies in Section 5 that this performs better for findingthe long MCSs. Algorithm 2
A randomized algorithm,
RandomMCS , to find a sin-gle MCS for L strings A = { A , . . . , A L } . Input : A set of strings A Optional Input : An initial starting value of W with default W = ∅ Output : A random MCS M of A position (cid:17) BreakPoints (A , W ) if position = ∅ return W else k (cid:17) a random element (index value) from the set position A ′ (cid:17) { Middle ( A l , W , k ) , l = , . . . , L } c (cid:17) a random character from the set commonChar (A ′ ) with or without optional frequency weighting W ← W ( , k ] ⊕ c ⊕ W ( k , | W |) return RandomLCS (A , W ) We shall illustrate our
RandomLCS algorithm for finding a ran-dom MCS solution using a toy example consisting of two simplestrings: { T EGAP , GAEPR } . We show two runs of the algorithm withdifferent MCS solution output in Figure 1 and 2 respectively. Thesolutions are different due to the inherent randomness in the algo-rithm design.Each figure consists of cells that show a certain state of the al-gorithm through iterations, linked by arrows illustrating the stateprogression. To make the presentation clear, we label each cellwith an index value shown in the upper right corner of the cell.Characters in red within each cell represent the current value ofthe common subsequence W which will be updated through theprogression to produce a final MCS solution. The small red framesaround the characters indicate the prefix and suffix to be eliminatedwhen computing Middle ( A , W , k ) for a certain k value (see Defini-tion 3.2), i.e. Middle ( A , W , k ) is the remaining characters excludingthe characters in the red frames. The outgoing branches from a cellrepresent the candidate indices k = , . . . , | W | of current commonsubsequence W , in an attempt to update W by inserting new char-acters (line 2 of Algorithm 1). A branch will expire if condition inline 4 of Algorithm 1 is not satisfied, that is, no common charactersare found to perform the update.In Figure 1, we want to find the MCS for the list { T EGAP , GAEPR }shown in Cell 1. Notice that the two strings share 4 commoncharacters: ′ E ′ , ′ G ′ , ′ A ′ , ′ P ′ . Initialize W = ′′ . Next in Step 1, wechoose one of the four characters ′ P ′ as the first character to be in Cao and Dewei Zhong MCS is GAPT E G A P
G A E P R
STEP 1
T E G A P
G A E P R c T E G A P
G A E P R
T E G A P
G A E P R ccc cc
T E G A P
G A E P R c T E G A P
G A E P R c T E G A P
G A E P R
STEP 2
STEP 3
T E G A P
G A E P R T E G A P
G A E P R
T E G A P
G A E P R T E G A P
G A E P R c c cSTEP 4
STEP 5 c c c cW={} W={P} W={GP} W={GAP}
Figure 1: Illustration of
RandomMCS algorithm for the case of two strings in a run producing
GAP as the MCS
T E G A P
G A E P R
STEP 1
T E G A P
G A E P R c T E G A P
G A E P R
T E G A P
G A E P R cc c cc T E G A PG A E P R c T E G A P
G A E P R
T E G A P
G A E P R
STEP 2
STEP 3
MCS is EP c c STEP 4cW={} W={E} W={EP} Figure 2: Illustration of
RandomMCS algorithm for the case of two strings in a different run producing EP as the MCS inserted in W , and update W = ′ P ′ . We now move to Cell 2 where{ P } is marked red. Since the length of | W | = | ′ P ′ | =
1, we have twoplaces to insert characters in W , k = ,
1, corresponding to the twobranches from Cell 2, resulting Cell 3 and Cell 4, respectively.We will discuss Cell 4 first, which corresponds to the case of W = ′ P ′ and k =
1. In this case, since
Middle ( ′ T EGAP ′ , W , k ) = ′′ and Middle ( ′ GAEPR ′ , W , k ) = ′ R ′ do not share any common characters,the cell expires (recall the red frames indicate the prefix and suffixto be removed for calculating Middle ( A , W , k ) ). On the other hand,in Cell 3 where W = ′ P ′ and k = Middle ( ′ T EGAP ′ , ′ P ′ , k = ) = ′ T EGA ′ and Middle ( ′ GAEPR ′ , ′ P ′ , k = ) = ′ GAE ′ , sharingboth ′ E ′ and ′ G ′ as common characters. The progression continuesand we select the character ′ G ′ to be inserted in W at position 0,resulting an updated W = ′ GP ′ . In summary, at the end of Step 2, BreakPoint (A , ′ P ′ ) = ′ G ′ is randomly selected toobtain an updated common sequence W = ′ GP ′ .By the same token, from Cell 3, since W = ′ GP ′ , there are threeoutgoing branches for k = , , BreakPoint (A , W ) = { } which implies Cell 5 and 7 willexpire, and only Cell 6 will continue to the next step. In Cell 6,character ′ A ′ is selected so the updated common subsequence isnow W = ′ GAP ′ . In Step 4, BreakPoint (A , W ) returns an emptyset which marks the end of the algorithm, resulting ′ GAP ′ as thereturned MCS output.Figure 2 shows a different realization of our algorithm for thesame string pair. The first difference from Figure 1 occurs in Cell2 where the character ′ E ′ s added to the common subsequence W instead of ′ P ′ . Next in Step 2, characer ′ P ′ is selected to result afinal MCS output of ′ EP ′ . A constrained MCS is a MCS that must include a predefined subse-quence W . It is in fact straightforward to modify our algorithm toobtain constrained MCS, simply by using W as the starting value(the optional input in the pseudo-code shown in Algorithm 2). Thisis due to the nature of our algorithm design as it incrementallyinserts a new character to update an existing common subsequenceuntil it becomes maximal. For instance, consider the constrainedMCS problem for the input string set A = { T EGAP , GAEPR } thathas to contain ′ GP ′ . Using ′ GP ′ as the optional input in Algorithm 2,the derivation process is identical to Figure 1 when Cell 3 is used asthe starting point. Branches from Cell 3 will finally lead to ′ GAP ′ as the MCS output.We comment here that [17] presented an algorithm for the con-strained MCS problem in the case of two strings. However, themodification from the base algorithm used to derive a single MCSsolution is significant. RandomMCS
ALGORITHM
In this section, we analyze the performance of
RandomMCS algo-rithm. First, for each
MCS solution, we study the probability of thesolution being returned from one run of the algorithm. We analyzeLCS as a special instance of MCS and discuss the probability of aLCS being returned from the algorithm. Next, we analyze the com-putational complexity of our algorithm and compare it to previousapproaches. As previous approaches for finding MCS only appliesto two strings, we also propose an extension of a previous solutionto the case of multiple strings.
Fast Randomized Algorithm for Finding the Maximal Common Subsequences
As a set of strings may have many MCSs, we denote the set of MCSsas M . Note that one run of our RandomMCS algorithm will yieldsexactly one random MCS M from the set M , a natural questionto ask is what is the probability value of M being returned from asingle run.Theorem 4.1. For a given MCS M in the solution set M , theprobability of M being returned as the solution from RandomMCS depends only on M and the solution set M . For a given subsequence W , let M( W ) be the set of MCSs that contains W as a subsequence.Then for any M ∈ M( W ) , the probability that M being returned asthe solution from constrained RandomMCS depends only on M and M( W ) . This implies that the probability is conditionally independentof the set of L strings, A . Proof. Notice that each character insertion to an existing com-mon subsequence W (line 3-6 of Algorithm 2) is carried out by tworandom selections. The first is the choice of a breakpoint position k (line 3) and the second is the choice of a common character c (line 5). Both random selections depend only on the current W and the set of MCSs. Therefore, the random selection is condition-ally independent of the original set of strings given M . Hence theresult. □ Example.
We evaluate the occurrence probability of each MCSbeing returned from one run of
RandomMCS using examples inFigure 1 and 2 where the set of strings under consideration are { T EGAP , GAEPR } . The solution set of MCS is { GAP , EP } . Startingwith an empty string W , notice that we have 4 common characters{ E , G , A , P } in the beginning and all of them share the same proba-bility 1/4 to be selected. If the first selected character is ′ G ′ or ′ A ′ ,the final MCS produced must be ′ GAP ′ . Likewise, the MCS is ′ EP ′ when the first character selected is ′ E ′ . But when the first characteris ′ P ′ , the returned solution depends on the second selected charac-ter. In this case, the choice of first two characters are { GP , AP , EP }and all of them have the same occurrence probability of 1/3. In total,the probability of ′ GAP ′ is 1 / + / + / · / = / ′ EP ′ is 1 / + / · / = /
3. In this case, we can see our algorithmfavors the longer MCS (the LCS) since it has a higher probability.Theorem 4.2.
Let C be an upper bound of the number of uniquecommon characters for string set A , i.e., | CommonChars (A)| ≤ C . If M ∈ M is a MCS that has a distinguishing subsequence with lengthbounded by D and the character is selected uniformly random in line5 of Algorithm 2, then it is easy to show that P ( M ) ≥ C − D . This implies the occurrence probability of M is bounded below. Proof. Let S be a distinguishing subsequence for a MCS M withlength bounded by D , which implies that M is the only MCS contain-ing S . Therefore, if S is selected as the common subsequence afterat most | D | character insertions to the initial empty string, then M would be returned as the output MCS from the RandomMCS algo-rithm. It is now clear that the probability of returning M is boundedby the probability of selecting S as the common subsequence afterat most | D | character insertions into the initial empty string. If the characters are chosen uniformly, then this probability is boundedby C − D . □ For our toy example where the string set is { T EGAP , GAEPR } and the solution set of MCS is { GAP , EP } . Notice that the numberof unique common characters is C =
4. In addition, either ′ G ′ or ′ A ′ is a distinguishing subseqeunce for MCS ′ GAP ′ , therefore, theprobability of GAP is bounded by 2 C − D = · − = 1/2. Obviouslythis is a loose lower bound since we have shown before that theactual probability is 2 / M , if the occurrence probability of M isbounded below by a value p , then with enough independent runs of RandomMCS algorithm we can recover M with a high probability.In fact, for an arbitrarily small ϵ , if we set T = (cid:24) log ϵ log ( − p ) (cid:25) , then P ( M does not appear in T runs ) ≤ ϵ . As LCS is a special case of MCS, this implies that if the conditionof Theorem 4.2 holds for a LCS, then we can recover the LCS withhigh probability with enough runs of the algorithm. Hand-wavingarguments suggest that our algorithm favors longer MCS as itwill likely to contain more characters and more positions (fromAlgorithm 1) to be selected to W . In fact in the extreme case where aMCS contains is formed by multiple occurrences of a single distinctcharacter, it will not be returned unless the character is selected atthe first time. In Section 5, we shall study empirically the occurrenceprobability of a MCS and correlate that with its length. Theorem 4.3.
For a set of L strings A = { A , A , . . . , A L } , let n l be the string length of A l , l = , . . . , L . Define n = min ( n , n , ..., n L ) as the minimum string length, then the time complexity for one runof Algorithm RandomMCS (Algorithm 2) to find a MCS solution for A is O ( n (cid:205) Li = n i ) . Therefore, when all strings are of equal length n ,the time complexity is O ( n L ) . Proof. It is easy to show that the computational complexitiesof
BreakPoint (Algorithm 1) and commonChar (Definition 3.5) are O ( n (cid:205) Li = n i ) and O ( n L ) , respectively. The algorithm RandomMCS may replicate
BreakPoint evaluations at most n times. Hence theresult. □ The above theorem states that the time complexity of our algo-rithm is linear in the number of strings L , as opposed to exponentialin L for algorithms to find LCS. It is therefore much more ameanablefor the case of large number of strings. We compare our approach to previous approaches for finding MCSs.
All previous approaches forfinding MCSs are developed for the case of two strings [7, 10, 17].The recent algorithm in [17] can be extended to the case of multiplestrings in the following manner. The original algorithm maintainsa sequence of index pairs that tracks the matches between two in Cao and Dewei Zhong strings. We extend their technique and maintain a sequence of L -tuple indices that tracks the matches between the L strings. These L -tuple indices break the original strings into blocks, where additionsto the sequence of L -tuples are searched within the matched blocks.We present the pseudo code in the appendix. For two stringswith equal length n , [17] has the highest efficiency among allproposed algorithms for finding a MCS for two strings. The com-plexity is O ( n log ( n )) . Our extension to the case of L strings (seeappendix) also enjoys the highest efficiency with a complexity O ( Ln log n ) . However, since the algorithm maintains a certain orderwhen traversing the strings, it can only find one MCSs (or twoMCSs if we reverse the order of strings), which may not be desir-able when there are multiple MCSs. [7] focuses on finding MCSfirst, and obtain all MCSs and the LCS for two strings with length m and n with a complexity O ( mn ( m + n )) . [10] developed an algo-rithm for the constrained LCS for two strings with lengths m and n and a complexity O ( mn ) . [11] provides an algorithm to computethe LCS for 2 strings in the complexity O ( n log ( n )) , but it is onlyfor the special best case scenario with a short LCS. The followingtable summarizes the computational complexities of these differentmethods. Algorithm Target Complexity RandomMCS
MCS, L strings Ln Our extension to Sakai (2019) MCS, L strings O ( Ln log n ) Sakai (2019) [17] MCS, 2 strings O ( n log n ) Fraser & Irving(1995) [7] MCSs, 2 strings mn ( m + n ) Hirschberg(1975) [10] CLCS, 2 strings mn Hunt & Szymanski(1977)[11] LCS, 2 strings ≥ O ( n log ( n )) Table 1: Comparison of Computational Complexity (CLCSstands for constrained LCS)
In this section, we perform simulation studies to understand theperformance of our
RandomMCS algorithm. First, we would liketo understand empirically if the longest MCS from multiple runsof
RandomMCS would yield a solution to LCS. Second, we studyempirically the computational complexity of our algorithm.
In this setting, our simulations are run with the number stringsvaries from 2 to 4, with string lengths ranging from 20 to 50. Wealso vary the alphabet size from 5 to 100. For this experiment, weuse the basic dynamic programming method to compute LCS, andrun our
RandomMCS algorithm 1000 times to select the longest oneand compare the result with the real LCS. The reason that we stopat 4 strings is due to the explosion of the computational time usedfor finding LCS using dynamic programming when the number ofstrings exceeds 5.To simplify the evaluation, strings are generated using randomcharacters from the alphabet. We also consider two kinds of ran-domization when implementing
RandomMCS . For the first kind, when we randomly insert a character into a common sequence(line 5 of Algorithm 2), we uniformly choose the character from thecommon set. For the second kind, we use frequency weighting toselect the character with a weight that is proportional to the (least)number of times the character appears in each string.Some sample results are described as follows. For L = n =
50 from an alphabet of size B =
6, bothLCS algorithm and the longest MCS solution from 1000 iterationsof
RandomMCS yield the same string with length 15. The longestMCS solution took 3sec, and the LCS solution takes 8sec. For L = , n =
50 and alphabet size B =
50, longest MCS from our algorithmalso yields the same result as the real LCS. In fact, we have notencountered a case where they disagree. Furthermore, the 1000repetitions are unnecessary for finding LCS using our algorithmas the real LCS tends to have a high occurrence probability beingreturned (close to 40%) in many instances. Finally, we do not findsignificant differences in the performance between the two typesof random selection.
When the number of strings L gets large, existing algorithms forfinding LCS fails to work well due to the high computational time.We use the following approach to evaluate our algorithm in thisinstance. Our simulation is designed in such a way that finding thelongest common subsequence is challenging.Our simulation generates L = S , S , S , S with increas-ing lengths 3, 6, 9, and 12 respectively, from an alphabet size of15. Next, we insert these subseqeunces into a string of 60 char-acters in the following way. First we randomly pick 3 indices tosituate S , then we randomly pick 6 indices to situate S from theremaining 57 indices, then we randomly pick 9 indices to situate S from the remaining 51 indices, and finally we randomly pick 12indices to situate S S from the remaining 42 indices. This way allthe subseqeunces S , S , S , S will be intermingled in each stringwhich makes the problem of finding LCS challenging. Notice thatthe total number of characters in S , S , S , S is 30. In the last stepof the string generation, we insert 30 random characters into theremaining 30 slots, with an expanded alphabet size of 30 (whichincludes the original alphabet set of size 15 for S , S , S , S ).For two random strings with a common length n where char-acters are randomly generated from an alphabet, let the expectedlength of their LCS be e . It has been shown that lim n →∞ e / n < L such random strings will decrease to 0 exponentially fastwith L . Since in the last step where we generated 30 completelyrandom characters, with a large L , we expect the common sub-sequence from these 30 random characters will be negligble (orempty). Therefore, by design, we expect the long subsequences in S , S , S , S will remain as MCS and S will be LCS since it is thelongest.The result of our simulation is as follows. With 200 runs of RandomMCS , the empirical estimate of the occurrence probabilitiesfor each S i , i = , . . . ,
4, is: 0.27 for S , 0.23 for S , 0.11 for S and azero probability value for S . The reason that S is no longer a MCS Fast Randomized Algorithm for Finding the Maximal Common Subsequences is due to the intermingling of S , S , S , S among themselves duringthe process of situating S , S , S , S , as the mixing creates spuriouscommon subsequences and S is short enough to be absorbed byother MCS solutions. In fact, it is absorbed in one of returned MCSsolutions with length 4 (so an extra character was included) and aprobability value of 2%. The intermingling also creates other MCSsolutions which accounts for the remaining 38% of the returnedMCS solutions with lengths ranging from 4 to 11. We also varied thealphabet size in the experiment, and found that the interminglingwill decrease with larger alphabet size and therefore it would beeasier to locate LCS.To understand the impact of frequency weighting in the ran-dom character selection (line 5 of Algorithm 2) and the numberof characters in the long common sequence on the performanceof RandomMCS algorithm, we perform the following 2 by 2 ex-periments. We have two settings for the weights: uniform or fre-quency based; and two configuration for S (the longest commonsubsequence with length 12): a single alphabet and the original 8distinct alphabets generated by random. The following table showsthe occurrence probabilities of S in the returned 200 MCS solu-tions. It is clear when S is made of all identical characters (i.e,uniform weights frequency-based weightssingle alphabet S
0% 5%8-alphabet S
28% 27%
Table 2: Occurance probabilities for S , the longest commonsubsequence by design alphabet of size 1), there is a significant drop in the probability oflocating the LCS. Nonetheless, random character selection usingfrequency-weighting performs a lot better. The uniform weightsfails to discover S , and the longest returned MCS has a length 9.This is because in the case of uniform weights, the unique alphabetsin the long LCS is one of the many to be selected at random withno frequency weighting and this character is shared by many otherMCSs.We also observe that time to run RandomMCS
200 times is about150sec for L = L = RandomMCS which shows a time complexity linear in L .Our empirical results indicate that LCS typically has a non-negligible occurrence probability among all solutions of MCS andthus will very likely be found by running RandomLCS repeatedly.However, the performance depends on the nature of LCS and howrandom search is carried out in the algorithm.
In this section, we illustrate how methods we developed for findingMCSs can be applied to string pre-processing. Developing auto-mated methods for data pre-processing is an important topic inautomated machine learning, or
AutoML , where the objective is toautomate the end-to-end process of applying machine learning toreal-world problems [6]. We demonstrate how our method can be used to develop a good understanding of string columns in tabu-lar data, and extract important features for downstream machinelearning tasks.
Tabular data is a common form of data representation. It is orga-nized by rows and columns where rows represent individual recordsand columns are the associated attributes. For large data tables withmany rows and columns, it is difficult to obtain a good understand-ing of the data content without laborious manual examination. Forcolumns with string values, we can apply our methods to under-stand the patterns that are common across all column values andextract important information or features for downstream machinelearning.The dataset we use for demonstration contains broadband homerouter data records of customers from a network carrier during a30-day period. It consists of 27 columns and 238330 rows, wherecolumns are device ID and type, associated network node and type,the customer information, and time series of several KPIs. Amongthe 27 columns, there are 8 columns are either strings or DateTime.For each of these columns, we apply our algorithm to uncoverthe longest common subsequence from 100 runs of
RandomMCS algorithm. The resulting patterns are shown in Table 3, wherewe post-processed these common subsequences and representedthem in the form of regular expressions where ∗ (the asteroid sign)indicates any number of characters. As a result, the contents in thestring columns become much more apparent with this information.Colname Patternnetwork.type 2*CN*software.version *day 2015-12-*customer.attr1 *pop.location POP-*linecard.id 2*CN*–*–*sid BB*device.id *0*-Home Hub *0 Type *-+*+* Table 3: String pattern discovered using our algorithm
We can often use the extracted column string patterns in tabulardata to engineer new features.It is clear that from Table 3 that some columns have a clear pat-tern while others do not. For example, both software.version and customer.attr1 do not have a common pattern. On the other hand,the column of device.id shows a clear pattern where it can be repre-sented by the string join of 6 sub-fields, each is a combination ofsome common characteris shared across the values and a varyingsubstring indicated by asteroid ( ∗ ). These subfields can be extractedto represent possibly more informative features for characterizingthe device.id. This feature extraction step can be automated oncepatterns are found and the extracted features can be used for down-stream machine learning. In fact, our methods can also be applied in Cao and Dewei Zhong to auto-detect field separators from an ASCII file and then extractthe columns. In this paper, we develop a randomized algorithm, referred to as
Random-MCS for finding the maximal common subsequence (
MCS )of multiple strings. We show the complexity of our algorithm islinear in the number of strings L . Furthermore, we demonstrate viaboth theoretical and experimental studies that the longest subse-quence from multiple runs of Random-MCS often yields a solutionto
LCS . As for future work, we want to improve the probabilitybound for a single MCS solution and extend our algorithm to thecase when the set of strings is polluted with dirty data.
A EXTENSION OF ALGORITHM 1 IN SAKAI(2019) [17] TO THE CASE OF L STRINGS
The I ≺ ( A , c , i ) denotes the least index such that c does not appear in A (I ≺ ( A , c , i ) , i ] and I ≻ ( A , c , i ) denotes the greatest index such that c does not appear in A ( i , I ≻ ( A , c , i )] . The idea is to cut the stringsinto segments backward and determine the MCS forward. The indexsets idxP and idxR mean the previous indices and rear indices. Forexample, the indices idxP [ j ] and idxR [ j ] determine a segment ofthe j string. So idxP and idxR cut a segment from every string.The Algorithm 4 Common will return − L segments and return c and j if the common char c appears in the j string first.The Algorithm 3 OneMCS finds a specific
MCS for L stringsin the complexity O ( nL log ( n )) . Inspired by [17], we extend thealgorithm from 2 strings to L strings. If it is hard to understand thealgorithm OneMCS , please read [17] first.
REFERENCES [1] TK Attwood and JBC Findlay. 1994. Fingerprinting G-protein-coupled receptors.
Protein Engineering, Design and Selection
7, 2 (1994), 195–203.[2] Lasse Bergroth, Harri Hakonen, and Timo Raita. 2000. A survey of longestcommon subsequence algorithms. In
Proceedings Seventh International Symposiumon String Processing and Information Retrieval. SPIRE 2000 . IEEE, 39–48.[3] Guillaume Bourque and Pavel A Pevzner. 2002. Genome-scale evolution: re-constructing gene orders in the ancestral species.
Genome research
12, 1 (2002),26–36.[4] Yixin Chen, Andrew Wan, and Wei Liu. 2006. A fast parallel algorithm for findingthe longest common sequence of multiple biosequences.
BMC bioinformatics
7, 4(2006), S4.[5] Václáv Chvatal and David Sankoff. 1975. Longest common subsequences of tworandom sequences.
Journal of Applied Probability
12, 2 (1975), 306–315.[6] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, ManuelBlum, and Frank Hutter. 2015. Efficient and robust automated machine learning.In
Advances in neural information processing systems . 2962–2970.[7] Campbell B. Fraser and Robert W. Irving. 1995. Approximation Algorithms forthe Shortest Common Supersequence.
Nordic J. of Computing
2, 3 (Sept. 1995),303–325. http://dl.acm.org/citation.cfm?id=642129.642130[8] Koji Hakata and Hiroshi Imai. 1992. Algorithms for the longest common subse-quence problem.
Genome Informatics
OptimizationMethods and Software
10, 2 (1998), 233–260.[10] Daniel S Hirschberg. 1975. A linear space algorithm for computing maximalcommon subsequences.
Commun. ACM
18, 6 (1975), 341–343.[11] James W Hunt and Thomas G Szymanski. 1977. A fast algorithm for computinglongest common subsequences.
Commun. ACM
20, 5 (1977), 350–353.[12] G. Kawade, S. Sahu, S. Upadhye, N. Korde, and M. Motghare. 2017. An analysison computation of longest common subsequence algorithm. In . 982–987. https://doi.org/10.1109/ISS1.2017.8389325
Algorithm 3
OneMCS returns a single solution for MCS
Input : A List of String
List
Output : Single MCS W Initialize W = [∧ , $ ] , W p and W r are vectors with length L , k = for i in 1 : n ˆ W [ i ] = { , | A i | − } while k < | W |-1 for i in 1 : n W p [ i ] = ˆ W [ i ][ k ] , W r [ i ] = ˆ W [ i ][ k + ] while common (A , W p , W r ) == − for j in 1:| W r | ˆ W [ j ][ k + ] = W r [ j ] − , W r [ j ] = ˆ W [ j ][ k + ] first = 0 first = 1 if ∃ j such that W r [ j ] == W p [ j ] if first == 1 for j in 1 : n ˆ W [ j ][ k + ] = I ≻ ( ˆ W [ j ] , W [ k + ] , W p [ j ] + ) k = k + else idx , c = common (A , W p , W r ) W = W [ , k ] ⊕ c ⊕ W [ k + , | W |] for j in 1 : n if j == idx ˆ W [ j ] = ˆ W [ j ][ : k ] ⊕ W r [ j ] − ⊕ ˆ W [ j ][ k + , : ] else ˆ W [ j ] = ˆ W [ j ][ : k ] ⊕ I ≺ ( ˆ W [ j ] , c , W r [ j ]) − ) ⊕ ˆ W [ j ][ k + , : ] return W Algorithm 4
Common returns the common character and its index
Input : A List of String A = { A , ..., A L } , Previous Index idxP ,Rear Index idxR Output : The list of indices for j in 1: L : if idxP [ j ] > = idxR [ j ] return idxP [ j ] , idxR [ j ] for j in 1: L c = A j [ idxR [ j ]] for i in 1: L if i == j continue if I ≺ ( A i , c , idxR [ i ]) < = idxP [ i ] break if i == L − return { j , c } if j == L − i == L − return { j , c } return -1 [13] Marcos Kiwi, Martin Loebl, and Jiří Matoušek. 2005. Expected length of thelongest common subsequence for large alphabets. Advances in Mathematics . IEEE, 354–363.
Fast Randomized Algorithm for Finding the Maximal Common Subsequences [15] David Maier. 1978. The complexity of some problems on subsequences andsupersequences.
Journal of the ACM (JACM)
25, 2 (1978), 322–336.[16] William J Masek and Michael S Paterson. 1980. A faster algorithm computingstring edit distances.
Journal of Computer and System sciences
20, 1 (1980), 18–31.[17] Yoshifumi Sakai. 2019. Maximal common subsequence algorithms.
TheoreticalComputer Science (2019). [18] Alexey Sorokin. 2016. Using longest common subsequence and character modelsto predict word forms. In
Proceedings of the 14th SIGMORPHON Workshop onComputational Research in Phonetics, Phonology, and Morphology . 54–61.[19] Qingguo Wang, Dmitry Korkin, and Yi Shang. 2010. A fast multiple longestcommon subsequence (MLCS) algorithm.