All instantiations of the greedy algorithm for the shortest superstring problem are equivalent
AAll instantiations of the greedy algorithm for theshortest superstring problem are equivalent
Maksim S. Nikolaev ! Steklov Institute of Mathematics at St. Petersburg, Russian Academy of Sciences
Abstract
In the Shortest Common Superstring problem (SCS), one needs to find the shortest superstringfor a set of strings. While SCS is NP-hard and MAX-SNP-hard, the Greedy Algorithm “choosetwo strings with the largest overlap; merge them; repeat” achieves a constant factor approximationthat is known to be at most 3.5 and conjectured to be equal to 2. The Greedy Algorithm is notdeterministic, so its instantiations with different tie-breaking rules may have different approximationfactors. In this paper, we show that it is not the case: all factors are equal. To prove this, we showhow to transform a set of strings so that all overlaps are different whereas their ratios stay roughlythe same.We also reveal connections between the original version of SCS and the following one: finda superstring minimizing the number of occurrences of a given symbol. It turns out that the latterproblem is equivalent to the original one.
Theory of computation → Approximation algorithms analysis
Keywords and phrases superstring, shortest common superstring, approximation, greedy algorithms,greedy conjecture
In the Shortest Common Superstring problem (SCS), one is given a set of strings and needsto find the shortest string that contains each of them as a substring. Applications of thisproblem include genome assembly [12, 8] and data compression [3, 2, 9]. We refer the readerto the surveys [4, 7] for an overview of SCS as well as its applications and algorithms.While SCS is NP-hard [3] and even MAX-SNP-hard [1], the Greedy Algorithm (GA)“choose two strings with the largest overlap; merge them; repeat” achieves a constantfactor approximation that is proven to be less than or equal to 3.5 [6]. This factor isat least 2 (consider a dataset S = { c ( ab ) n , ( ab ) n c, ( ba ) n } ) and the 30 years old GreedyConjecture [9, 10, 11, 1] claims that this bound is accurate, that is, that GA is 2-approximate.GA is not deterministic as we do not specify how to break ties in case when there aremany pairs of strings with maximum overlap. For this reason, different instantiations of GAmay produce different superstrings for the same input and hence they may have differentapproximation factors. In fact, if S contains only strings of length 2 or less or if S is a setof k -substrings of an unknown string, then there are instantiations of GA [5], that find theexact solution, whereas in general GA fails to do so.The original Greedy Conjecture states that any instantiation of GA is 2-approximate.As this is still widely open, it is natural to try to prove the conjecture at least for some instantiations. This could potentially be easier not just because this is a weaker statement,but also because a particular instantiation of GA may decide how to break ties by askingan almighty oracle. In this paper, we show that this weak form of Greedy Conjecture is infact equivalent to the original one. More precisely, we show, that if some instantiation of GAis λ -approximate, then all instantiations are λ -approximate.To prove this, we introduce the so-called Disturbing Procedure , that, for a given dataset S = { s , . . . , s n } , a parameter m ≫ n , and a sequence of greedy non-trivial merges (mergesof strings with a non-empty overlap), constructs a new dataset S ′ = { s ′ , . . . , s ′ n } , such that, a r X i v : . [ c s . D S ] F e b All instantiations of the greedy algorithm for SCS are equivalent for all i ̸ = j , s ′ i is roughly m times longer than s i , the overlap of s ′ i and s ′ j is roughly m times longer than the overlap of s i and s j , and the mentioned greedy sequence of non-trivialmerges for S is the only such sequence for S ′ .We also find the following curious relation between SCS and its version, where one needsto find a superstring with the smallest number of occurrences of a given symbol: if there isa λ -approximate algorithm for the latter problem, then there is a λ -approximate algorithmfor the former one, and vice versa. Let | s | be the length of a string s and overlap( s, t ) be the overlap of strings s and t , that is,the longest string y , such that s = xy and t = yz . In this notation, a string xyz is a merge of strings s and t . By ε we denote the empty string. By OPT( S ) we denote the optimalsuperstring for the dataset S .Without loss of generality we may assume that the set of input strings S contains no stringthat is a substring of another. This assumption implies that in any superstring all stringsoccur in some order: if one string begins before another, then it also ends before. Hence, wecan consider only superstrings that can be obtained from some permutation ( s σ (1) , . . . , s σ ( n ) )of S after merging adjacent strings. The length of such superstring s ( σ ) is simply | s ( σ ) | = n X i =1 | s i | − n − X i =1 | overlap( s σ ( i ) , s σ ( i +1) ) | . (1)Let A be an instantiation of GA (we denote this by A ∈ GA). By σ A we denote thepermutation corresponding to a superstring A ( S ) constructed by A , and by ( l A (1) , r A (1)) , . . . , ( l A ( n − , r A ( n − s l A ( i ) and s r A ( i ) are merged atstep i . By the definition of GA we have | overlap( s l A ( i ) , s r A ( i ) ) | ≥ | overlap( s l A ( j ) , s r A ( j ) ) | , ∀ i < j < n, and if, for some i , | overlap( s l A ( i ) , s r A ( i ) ) | = 0, then the same holds for any i ′ > i . We denotethe first such i by T A and this is the first trivial merge (that is, one with the empty overlap),after which all the merges are trivial. Note that just before step T A , all the remaining stringshave empty overlaps, so the resulting superstring is just a concatenation of them in someorder and this order does not affect the length of the result. s i s j a b cb c d s ′ i s ′ j α j mm T A overlap( s ′ i , s ′ j ) β i $ $ $ a $ $ $ $ b $ $ $ $ c $$ $ b $ $ $ $ c $ $ $ $ d (a) (b) Figure 1 (a) strings s i and s j from S . (b) the resulting strings s ′ i and s ′ j after disturbing; here, m = 4, T A = 3, α i = 1, β i = 2, α j = 2 and β j = T A ; since α j = β i = 2, we may conclude that s i and s j were merged by A at step 2. . S. Nikolaev 3 Here, we describe the mentioned procedure that gets rid of ties. Consider a dataset S ,an instantiation A ∈ GA and a sentinel $ — a symbol that does not occur in S , anda parameter m whose value will be determined later. For every string s i = c c . . . c n i ∈ S define a string s ′ i = $ m − α i c $ m c $ m c $ m . . . $ m c n i $ T A − β i , (2)where α i is the number of step such that r α i = i , if such step exists and is less than T A , and α i = T A otherwise; note that if α i < T A then s i is the right part of a non-trivial mergeat step α i ; β i is the number of step such that l β i = i , if such step exists and is less than T A , and β i = T A otherwise; note that if β i < T A then s i is the left part of a non-trivial merge atstep β i .Basically, we insert the string $ m before every character of s i and then remove some $’s fromthe beginning of the string and add some $’s to its end (see Fig. 1). The purpose of thisremoval and addition is to disturb slightly overlaps of equal length, so there are no longerany ties in non-trivial merges.We denote the resulting set of disturbed strings { s ′ , . . . , s ′ n } by S ′ , and all entities relatedto this dataset we denote by adding a prime (for example, σ ′ A ). Let us derive some propertiesof S ′ . ▶ Lemma 1.
For all i ̸ = j , k ̸ = l if | overlap( s i , s j ) | = k > , then | overlap( s ′ i , s ′ j ) | = ( m + 1) k − α j + T A − β i ; if | overlap( s i , s j ) | = 0 , then | overlap( s ′ i , s ′ j ) | = min { T A − β i , m − α j } ; if m > n , then disturbing preserves order on overlaps of different lengths, that is, if | overlap( s i , s j ) | > | overlap( s k , s l ) | , then | overlap( s ′ i , s ′ j ) | > | overlap( s ′ k , s ′ l ) | . Proof.
Let overlap( s i , s j ) be c c . . . c k . Consider the string u = $ m − α j c $ m . . . $ m c k $ T A − β i .Clearly, u is the overlap of s ′ i and s ′ j and | u | = ( m + 1) k − α j + T A − β i . Also, that if | overlap( s i , s j ) | = 0 then overlap( s ′ i , s ′ j ) = $ min { T A − β i ,m − α j } .To prove the last statement, note that α j + β i ≤ T A < m and | overlap( s ′ i , s ′ j ) | > ( m + 1) | overlap( s k , s l ) | + T A ≥ | overlap( s ′ k , s ′ l ) | . ◀▶ Lemma 2.
Let B ∈ GA . Then T A = T ′ A = T ′ B and the first T A − merges are the samefor both instantiations. Proof.
We prove by induction that l A ( t ) = l ′ A ( t ) = l ′ B ( t ) and r A ( t ) = r ′ A ( t ) = r ′ B ( t ) for all t < T A .Case t = 1. As A is greedy, then k := | overlap( s l A (1) , s r A (1) ) | ≥ | overlap( s i , s j ) | , for all i ̸ = j , ( i, j ) ̸ = ( l A (1) , r A (1)). Hence | overlap( s ′ i , s ′ j ) | ≤ ( m + 1) k − α j + T A − β i << ( m + 1) k − T A − | overlap( s ′ l A (1) , s ′ r A (1) ) | , and l ′ A (1) = l ′ B (1) = l A (1) as well as r ′ A (1) = r ′ B (1) = r A (1).Suppose that the statement holds for all t ≤ t ′ < T A −
1. Note that at moment t = t ′ + 1the sum α j + β i is strictly greater than 2 t until ( i, j ) = ( l A ( t ) , r A ( t )). Similarly to the base All instantiations of the greedy algorithm for SCS are equivalent case, we have | overlap( s ′ i , s ′ j ) | ≤ ( m + 1) k t − α j + T A − β i << ( m + 1) k t − t + T A − t = | overlap( s ′ l A ( t ) , s ′ r A ( t ) ) | , where k t = | overlap( s l A ( t ) , s r A ( t ) ) | , and the induction step is proven.Now note that starting from step T A all the remaining strings in S have empty overlapsand hence so do the remaining strings in S ′ , as for all of them β i = T A and the minimum inLemma 1.2 is equal to zero. Thus, T A = T ′ A = T ′ B and the lemma is proven. ◀▶ Corollary 3.
As all non-trivial merges coincide, | A ( S ′ ) | = | B ( S ′ ) | . ▶ Theorem 4.
If some instantiation A of GA achieves λ -approximation, then so does anyother instantiation. Proof.
Assume the opposite and consider B ∈ GA as well as a dataset S such that | B ( S ) | >λ | OPT( S ) | . Let S ′ = S ′ ( B, m ) be the corresponding disturbed dataset, where m ≥ n will bespecified later.Note that | s ′ i | /m → | s i | and | overlap( s ′ i , s ′ j ) | /m → | overlap( s i , s j ) | as m approachesinfinity, thanks to Lemma 1.1–2. Then | OPT( S ′ ) | /m → | OPT( S ) | , since1 m | OPT( S ′ ) | = 1 m min σ ( n X i =1 | s ′ i | − n − X i =1 | overlap( s ′ σ ( i ) , s ′ σ ( i +1) ) | ) →→ min σ ( n X i =1 | s i | − n − X i =1 | overlap( s σ ( i ) , s σ ( i +1) ) | ) = | OPT( S ) | , | B ( S ′ ) | /m → | B ( S ) | and hence | A ( S ′ ) | /m → | B ( S ) | , by Corollary 3.As | B ( S ) | − λ | OPT( S ) | >
0, we can choose m so that | B ( S ′ ) | − λ | OPT( S ′ ) | as well as | A ( S ′ ) | − λ | OPT( S ′ ) | are positive. Hence A is not λ -approximate. ◀▶ Corollary 5.
To prove (or disprove) the Greedy Conjecture, it is sufficient to considerdatasets satisfying one of the following three properties: (a) there are no ties between non-empty overlaps, that is, datasets where all the instantiationsof the greedy algortihm work the same; (b) there are no empty overlaps: overlap( s i , s j ) ̸ = ε , ∀ i ̸ = j ; (c) all overlaps are (pairwise) different: | overlap( s i , s j ) | ̸ = | overlap( s k , s l ) | , for all i ̸ = j , k ̸ = l , ( i, j ) ̸ = ( k, l ) . Proof. (a)
Follows directly from the proof of Theorem 4, as we always can use the dataset S ′ instead of S . (b) Append $ to each string of S ′ . Then, every two strings have non-empty overlap that atleast contains $, and in general T A = T ′ A = T ′ B from Lemma 2 does not hold ( T ′ A and T ′ B are always n − T A merges are still the same and after them all theremaining strings have overlaps of length 1 and then the lengths of the final solutionsare the same as well. . S. Nikolaev 5 (c) Append $ n ( T A − β i ) to each string of S ′ instead of $ T A − β i . Then | overlap( s ′ i , s ′ j ) | = ( m + 1) k − α j + nT A − nβ i , provided m is large enough, and α j + nβ i ̸ = α k + nβ l if ( i, j ) ̸ = ( k, l ). Repeating theproofs of Lemmas 1 and 2 with this version of S ′ , we obtain this statement of the lemma. ◁ Consider the following problem: given a dataset S and a symbol important .This problem is similar to SCS: here, we also need to find the “shortest” superstring, butthe length of the string is understood in terms of the number of occurrences of the importantsymbol.By | s | we denote the “length”, that is, the number of occurences of s , andall the remaining entities related to this problem we denote by this subscript. As before,the length of a superstring s ( σ ) obtained from some permutation σ is | s ( σ ) | = n X i =1 | s i | − n − X i =1 | overlap( s σ ( i ) , s σ ( i +1) ) | . (3)GA also works as GA, but instead of merging pairs of strings with longest overlaps, itmerges pairs with longest ones.A natural question arises: is this problem easier than SCS? The following theorem claimsthat these problems are almost equivalent. ▶ Theorem 6. (a)
If there is a λ -approximate A ∈ GA , then GA is λ -approximate. (b) If GA is λ -approximate, then all the instantiations of GA , that merge the longestoverlaps among the longest ones, are λ -approximate. Proof. (a)
Let S be a dataset for SCS and let $ be a symbol that does not occur in thestrings of S . Transform strings of S by adding $ before every symbol (for example, abc turns into $ a $ b $ c ), denote the resulting dataset by S $ and consider it as a dataset forSCS with | s $ | = | s | and | overlap( s $ , t $ ) | = | overlap( s, t ) | for every s ̸ = t . Thus, every superstring for S of length k corresponds to the superstring for S $ oflength k . Therefore, every approximate (greedy or not) algorithm for SCS inducessuch an algorithm (greedy or not, respectively) for SCS and if there is a λ -approximateinstantiation of GA , then there is the corresponding λ -approximate instantiation ofGA and by Theorem 4 the entire GA is λ -approximate. (b) Assume the opposite and consider A ∈ GA satisfying the property from the statementof the theorem, and a dataset S such that | A ( S ) | > λ | OPT ( S ) | . We construct a newdataset S m by replacing every occurence of S with m .Let ( l (1) , r (1)) , . . . , ( l ( n − , r ( n − order or merges produced by A .This order satisfies the following property: if i < j , then | overlap( s l ( i ) , s r ( i ) ) | ≥| overlap( s l ( j ) , s r ( j ) ) | , and if | overlap( s l ( i ) , s r ( i ) ) | = | overlap( s l ( j ) , s r ( j ) ) | , then also | overlap( s l ( i ) , s r ( i ) ) | ≥ | overlap( s l ( j ) , s r ( j ) ) | .The key idea behind the proof is that for m ≫ n this order is greedy. Indeed, | overlap( s m , t m ) | = | overlap( s, t ) | + ( m − | overlap( s, t ) | , All instantiations of the greedy algorithm for SCS are equivalent so if | overlap( s, t ) | > | overlap( u, v ) | , then | overlap( s m , t m ) | > | overlap( u m , v m ) | forlarge m , and if | overlap( s, t ) | = | overlap( u, v ) | and | overlap( s, t ) | ≥ | overlap( u, v ) | ,then | overlap( s m , t m ) | ≥ | overlap( u m , v m ) | for any m . Hence the greedy order for S becomes the greedy order for S m and the solution A ( S ) m obtained from A ( S ) byreplacing all the occurrences of m becomes the greedy solution for S m . Notethat A ( S m ) need not to be equal to A ( S ) m .Since | OPT ( S m ) | /m → | OPT ( S ) | and | A ( S ) m | /m → | A ( S ) | , we may choose m such that | A ( S ) m | − λ | OPT ( S m ) | >
0. As then | A ( S ) m | − λ | OPT( S m ) | > λ -approximate. Then, Theorem 4 implies that GAis not λ -approximate. ◀ The proposed connection between lengths and symbol frequencies reveals a potential frequencystructure of SCS: may we think about this problem in terms of frequencies rather thanlengths? We know that there is a correspondence between greedy and greedy superstrings,but are the greedy solutions 2-approximate and vice versa? May we treat the greedysolutions not only as reasonably short ones, but also as solutions that contain reasonablysmall number of occurrences of each symbol uniformly?
Acknowledgments
Many thanks to Alexander Kulikov for valuable discussions and proofreading the text.
References Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis. Linear approximationof shortest superstrings. In
STOC 1991 , pages 328–336. ACM, 1991. John Gallant.
String compression algorithms.
PhD thesis, Princeton, 1982. John Gallant, David Maier, and James A. Storer. On finding minimal length superstrings.
J.Comput. Syst. Sci. , 20(1):50–58, 1980. Theodoros P. Gevezes and Leonidas S. Pitsoulis.
The shortest superstring problem , pages189–227. Springer, 2014. Alexander Golovnev, Alexander S Kulikov, Alexander Logunov, Ivan Mihajlin, and MaksimNikolaev. Collapsing superstring conjecture. In
Approximation, Randomization, and Com-binatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019) . SchlossDagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Haim Kaplan and Nira Shafrir. The greedy algorithm for shortest superstrings.
Inf. Process.Lett. , 93(1):13–17, 2005. Marcin Mucha. A tutorial on shortest superstring approximation. . Online; accessed 10 February 2021. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to DNAfragment assembly.
Proc. Natl. Acad. Sci. U.S.A. , 98(17):9748–9753, 2001. James A. Storer.
Data compression: methods and theory . Computer Science Press, Inc., 1987. Jorma Tarhio and Esko Ukkonen. A greedy approximation algorithm for constructing shortestcommon superstrings.
Theor. Comput. Sci. , 57(1):131–145, 1988. Jonathan S. Turner. Approximation algorithms for the shortest common superstring problem.
Inf. Comput. , 83(1):1–20, 1989. Michael S. Waterman.