Intractability of the Minimum-Flip Supertree problem and its variants
Sebastian Böcker, Quang Bao Anh Bui, Francois Nicolas, Anke Truss
aa r X i v : . [ c s . CC ] D ec Intractability of the Minimum-Flip Supertree problem and itsvariants
Sebastian B¨ocker , Quang Bao Anh Bui , Fran¸cois Nicolas , and Anke Truss Lehrstuhl f¨ur Bioinformatik, Friedrich-Schiller-Universit¨at Jena, Ernst-Abbe-Platz 2, Jena, Germany, {sebastian.boecker,quangbaoanh.bui,francois.nicolas,anke.truss}@uni-jena.de
Abstract.
Computing supertrees is a central problem in phylogenetics. The supertree methodthat is by far the most widely used today was introduced in 1992 and is called
Matrix Represen-tation with Parsimony analysis (MRP) . Matrix Representation using Flipping (MRF) , which wasintroduced in 2002, is an interesting variant of MRP: MRF is arguably more relevant that MRPand various efficient implementations of MRF have been presented. From a theoretical point ofview, implementing MRF or MRP is solving NP-hard optimization problems. The aim of thispaper is to study the approximability and the fixed-parameter tractability of the optimizationproblem corresponding to MRF, namely Minimum-Flip Supertree. We prove strongly negativeresults.
When studying the evolutionary relatedness of current taxa, the discovered relations areusually represented as rooted trees, called phylogenies . Phylogenies for various taxa sets areroutinely inferred from various kinds of molecular and morphological data sets. A subsequentproblem is computing supertrees [4], i.e. , amalgamating phylogenies for non-identical butoverlapping taxon sets to obtain more comprehensive phylogenies. Constructing supertreesis easy if no contradictory information is contained in the data [1]. However, incompatibleinput phylogenies are the rule rather than the exception in practice. The major problem forsupertree methods is thus dealing with incompatibilities.The supertree method that is by far the most widely used today was independently pro-posed by Baum [3] and Ragan [26] in 1992; it is called
Matrix Representation with Parsimonyanalysis (MRP) [4]. From a theoretical point of view, implementing MRP is designing an algo-rithm for an NP-hard optimization problem [14, 18], so the running times of MRP algorithmsare sometimes prohibitive for large data sets.In 2002, Chen et al. proposed a variant of MRP [11], which was later called
Matrix Rep-resentation using Flipping (MRF) [9]. MRF is arguably more relevant than MRP [4] (seealso [12, 16]), and various efficient implementations of MRF have been presented [10, 13, 16].However, as in the case of MRP, implementing MRF is designing an algorithm for an NP-hardoptimization problem [12], namely
Minimum-Flip Supertree . The aim of the present paperis to study the approximability and the fixed-parameter tractability [17] of
Minimum-FlipSupertree . We prove strongly negative results.
For each finite set X , the cardinality of X is denoted | X | . The ring of integers is denoted Z .Define Z = Z ∪ {−∞ , + ∞} . S. B¨ocker, Q.B.A. Bui, F. Nicolas, A. Truss
Let S be a finite set. A (rooted) phylogeny for S is a subset T of the power set of S thatsatisfies the following properties: ∅ ∈ T , S ∈ T , { s } ∈ T for all s ∈ S , and X ∩ Y ∈ {∅ , X, Y } for all X , Y ∈ T . The elements of S are the leaves of T . The elements of T are the clusters of T . The most natural representation of T is, of course, a rooted graph-theoretic tree with | T | − T and T for S , T is a subset of T if, and only if, the graphrepresentation of T can be obtained from the graph representation of T by contracting(internal) edges. If T is a subset of T and if we assume that hard polytomies never occurthen T is at least as informative as T . A bipartite graph is a triple G = ( C, S, E ), where C and S are two finite sets and E is a subsetof C × S . The elements of E are the edges of G . The elements of ( C × S ) \ E are the non-edges of G . For each c ∈ C , N G ( c ) denotes the neighborhood of c in G : N G ( c ) = { s ∈ S : ( c, s ) ∈ E } .Let M ( G ) denote the set of all quintuples ( s, c, s ′ , c ′ , s ′′ ) ∈ S × C × S × C × S such that( c, s ) ∈ E , ( c, s ′ ) ∈ E , ( c ′ , s ′ ) ∈ E , ( c ′ , s ′′ ) ∈ E , ( c, s ′′ ) / ∈ E , and ( c ′ , s ) / ∈ E . The latterconditions state that the bipartite graph depicted in [25, Figure 4] is an induced subgraphof G . A perfect phylogeny for G is a phylogeny T for S such that N G ( c ) is a cluster of T forevery c ∈ C . We say that G is M -free [5, 11–13, 22] (or Σ -free ) [4, 9, 25]) if the following threeequivalent conditions are met:1. for all c , c ′ ∈ C , N G ( c ) ∩ N G ( c ′ ) ∈ {∅ , N G ( c ) , N G ( c ′ ) } ,2. M ( G ) is empty, and3. there is a perfect phylogeny for G .Put T G = {∅ , S }∪{ N G ( c ) : c ∈ C }∪{{ s } : s ∈ S } . If G is M -free then T G satisfies the followingtwo properties:1. T G is a perfect phylogeny for G .2. T G is a subset of any perfect phylogeny for G . Modelization.
In our model, S is a set of species (or more generally taxa ) and C is a set ofbinary characters . For each ( c, s ) ∈ C × S , ( c, s ) ∈ E means that species s possesses character c and ( c, s ) / ∈ E means that species s does not possess character c . Character data come fromthe morphological and/or molecular properties of the taxa [21]. The assumption of the modelis that for all c ∈ C and all s , s ′ ∈ S , the following two assertions are equivalent:1. Both species s and s ′ possess character c .2. Some common ancestor of species s and s ′ possesses character c .A phylogeny for S satisfies the assumption of the model if, and only if, it is a perfect phylogenyfor G . A bipartite draft-graph (or weighted bipartite fuzzy graph [6]) is a triple H = ( C, S, F ), where C and S are two finite sets and F is a function from C × S to Z . The function F is the weight ntractability of the Minimum-Flip Supertree problem and its variants 3 function of H . The range of F is called the weight range of H . An edge of H is an element e ∈ C × S such that F ( e ) ≥
1. A joker-edge of H is an element e ∈ C × S such that F ( e ) = 0.A non-edge of H is an element e ∈ C × S such that F ( e ) ≤ − e ∈ C × S , the magnitude of F ( e ) is the edit cost of e in H . An edition of H is abipartite graph G of the form G = ( C, S, E ) for some subset E ⊆ C × S . A conflict between G and H is an element e ∈ C × S that satisfies one of the following two conditions:1. e is an edge of G and e is a non-edge of H or2. e is a non-edge of G and e is an edge of H .The sum of the edit costs in H over all conflicts between G and H is denoted ∆ ( G, H ): ∆ ( G, H ) = X e ∈ E max { , − F ( e ) } + X e ∈ E \ ( C × S ) max { , F ( e ) } . The following minimization problem and its (parameterized) decision version generalizeseveral previously studied problems:
Name : Minimum M - Free Edition or Min Edit . Input : A bipartite draft-graph H . Solution : An M -free edition G of H . Measure : ∆ ( G, H ). Name : M - free Edition or Edit . Input : A bipartite draft-graph H and an integer k ≥ Question : Is there an M -free edition G of H such that ∆ ( G, H ) ≤ k ? Parameter : k .For each subset X ⊆ Z , define Min Edit - X as the restriction of Min Edit to thosebipartite draft-graphs whose weight ranges are subsets of X , and similarly, define Edit - X asthe restriction of Edit to those instances (
H, k ) such that the weight range of H is a subset of X . Notably, Min Edit - {− , +1 } is the Minimum-Flip Supertree problem and its restiction
Min Edit - {− , , +1 } is the Minimum-Flip Consensus Tree problem [4, 5, 9–13, 16, 22].
Modelization.
Incomplete and/or possibly erroneous character data sets are naturally mod-eled by bipartite draft-graphs: joker-edges represent incompletenesses and edit costs allowparsimonious error-corrections.
Supertrees.
The most interesting feature of
Min Edit is that it can be thought as a supertreeconstruction problem, and more precisely, the optimization problem underlying MRF [4, 9,11, 12].
Min Edit - X has been studied for several subsets X ⊆ Z [4, 5, 9, 10, 12, 13, 16, 20–22, 25],sometimes implicitely. Let H = ( C, S, F ) be a bipartite draft-graph and let k be a non-negative integer.Put Z + = (cid:8) n ∈ Z : n ≥ (cid:9) and Z − = (cid:8) n ∈ Z : n ≤ (cid:9) . If H has no non-edge, or equiv-alently, if the weight range of H is a subset of Z + then the complete bipartite graph S. B¨ocker, Q.B.A. Bui, F. Nicolas, A. Truss K = ( C, S, C × S ) is an M -free edition of H such that ∆ ( K, H ) = 0. In the same way, if H has no edge then the empty bipartite graph K = ( C, S, ∅ ) is an M -free edition of H suchthat ∆ ( K, H ) = 0. Hence,
Min Edit - Z + and Min Edit - Z − are trivial problems.Now, consider the case where the weight range of H is a subset of {−∞ , + ∞} . Thebipartite graph G = ( C, S, { e ∈ C × S : F ( e ) = + ∞} )is an edition of H such that ∆ ( G, H ) = 0; for every edition G ′ of H , G ′ = G implies ∆ ( G ′ , H ) = + ∞ . Therefore, solving Min Edit on H reduces to deciding whether G is M -free, which can be achieved in O ( | C | | S | ) time [20, 21]. Hence, Min Edit - {−∞ , + ∞} canbe solved in polynomial time because it reduces to the recognition problem associated withthe class of M -free bipartite graphs. More generally, Min Edit - {−∞ , , + ∞} can also besolved in polynomial time because it reduces to the sandwich problem [19] associated withthe class of M -free bipartite graphs: in the case where the the weight range of H is a subsetof {−∞ , , + ∞} , Min Edit can be solved on H in e O ( | C | | S | ) time [25].Put I = {− , + ∞} , D = {−∞ , +1 } , and U = {− , +1 } . Min Edit - I , Min Edit - D , and Min Edit - U (also known as Minimum-Flip Consensus Tree ) are the three unweightededge-modification problems [24] associated with the class of M -free bipartite graphs: MinEdit - I is the insertion (or completion) problem and Min Edit - D is the deletion problem. Edit - I , Edit - D , and Edit - U are NP-complete [12].Put Z ∗ = Z \ { } . Min Edit - Z ∗ is the restriction of Min Edit to those bipartite draft-graphs that have no joker-edge. The most positive result concerning
Min Edit is that
Edit - Z ∗ is FPT: in the case where the weight range of H is a subset of Z ∗ , deciding whether ( H, k ) is ayes-instance of
Edit (and if so, computing an M -free edition G of H such that ∆ ( G, H ) ≤ k )can be achieved in O (6 k | C | | S | ) time [12]. Better FPT algorithms have been presented forthe special cases Min Edit - I [12], Min Edit - D [12], and Min Edit - U [5, 22]. In particular, Edit - U has a polynomial kernel [22].Exact algorithms based Integer Linear Programming [13], as well as heuristics [10, 16],have been tested for Min Edit - {− , , +1 } (also known as Minimum-Flip Supertree ). The aim of the present paper is to complete the study of
Min Edit by proving:
Theorem 1.
For all α , β ∈ Z such that − α < < β and ( α, β ) = (+ ∞ , + ∞ ) , the followingtwo statements hold:1. Edit - {− α, , β } is W[2] -hard and2. if there exists a real constant ρ ≥ such that Min Edit - {− α, , β } is ρ -approximable inpolynomial time then P = NP . The intractabilities of
Min Edit - {− , , + ∞} , Min Edit - {−∞ , , +1 } , and Min Edit - {− , , +1 } (also known as Minimum-Flip Supertree ) follow from Theorem 1.Our proof of Theorem 1 requires the introduction of some material and results from theliterature [8]. For all x , y , z , let h x, y | z i denote the unique phylogeny for { x, y, z } having { x, y } as a cluster: h x, y | z i = {∅ , { x } , { y } , { z } , { x, y } , { x, y, z }} . ntractability of the Minimum-Flip Supertree problem and its variants 5 A resolved triplet is a phylogeny of the form h x, y | z i for some pairwise distinct x , y , z . Givena phylogeny T for some superset of { x, y, z } , we say that h x, y | z i fits T if there exists a cluster X of T such that X ∩ { x, y, z } = { x, y } . Name : Minimum Resolved Triplets Inconsistency or Min
RTI.
Input : A finite set S and a set R of resolved triplets with leaves in S . Solution : A phylogeny T for S . Measure : The number of those elements of R that do not fit T . Name : Resolved Triplets Inconsistency or RTI.
Input : A finite set S , a set R of resolved triplets with leaves in S , and an integer k ≥ Question : Is there a phylogeny T for S such that at most k elements of R do not fit T ? Parameter : k . Theorem 2 (Byrka, Guillemot, and Jansson 2010 [8]). RTI is W[2] -hard.2. If there exists a real constant ρ ≥ such that Min
RTI is ρ -approximable in polynomialtime then P = NP . The idea behind the proof of Theorem 1 is the following: given an instance ( S, R ) of Min
RTI, computing a “good” solution of
Min
RTI on ( S, R ) is computing a “good” MRFsupertree for the phylogenies in R . Proof (Proof of Theorem 1.1).
Theorem 1.1 is deduced from Theorem 2.1: we show that RTIFPT-reduces to
Edit - {− α, , β } . Put γ = min { α, β } . Note that γ is a positive integer.Let ( S, R , k ) be an arbitrary instance of RTI. The reduction maps ( S, R , k ) to an instance( H, γk ) of
Edit - {− α, , β } , where H is as follows. Let C = { , , . . . , |R|} . Write R in theform R = {h x c , y c | z c i : c ∈ C } . Let F be the function from C × S to Z given by: F ( c, s ) = β if s ∈ { x c , y c } s / ∈ { x c , y c , z c }− α if s = z c for all ( c, s ) ∈ C × S . Let H = ( C, S, F ).Clearly (
H, γk ) is computable from ( S, R , k ) in polynomial time. It remains to prove that( S, R , k ) is a yes-instance of RTI if, and only if, ( H, γk ) is a yes-instance of
Edit . If.
Assume that (
H, γk ) is a yes-instance of
Edit . Then, there exists an M -free edition G of H such that ∆ ( G, H ) ≤ γk . Let C ′ denote the set of all c ∈ C such that ( c, s ) is a conflictbetween G and H for at least one s ∈ { x c , y c , z c } . Since there are at least | C ′ | conflicts between G and H , we have γ | C ′ | ≤ ∆ ( G, H ), and thus | C ′ | ≤ k . Let T be a perfect phylogeny for G .For each c ∈ C \ C ′ , we have N G ( c ) ∩ { x c , y c , z c } = { x c , y c } , and thus h x c , y c | z c i fits T . Hence, T is a phylogeny for S such that at most k elements of R do not fit T . Therefore, ( S, R , k ) isa yes-instance of RTI. S. B¨ocker, Q.B.A. Bui, F. Nicolas, A. Truss
Only if.
Assume that ( S, R , k ) is a yes-instance of RTI. Then, there exists a phylogeny T for S such that at most k elements of R do not fit T . Let C ′ denote the set of all c ∈ C such that h x c , y c | z c i does not fit T . For each c ∈ C \ C ′ , let X c be a cluster of T such that X c ∩ { x c , y c , z c } = { x c , y c } . If α ≤ β then let X c = S for each c ∈ C ′ ; if β < α then let X c = { x c } for each c ∈ C ′ . Put G = (cid:0) C, S, S c ∈ C { c } × X c (cid:1) .1. G is an edition of H .2. T is a perfect phylogeny for G because N G ( c ) = X c is a cluster of T for all c ∈ C .Therefore, G is M -free.3. Let Γ denote the set of all conflicts between G and H . If α ≤ β then Γ = { ( c, z c ) : c ∈ C ′ } ;if β < α then Γ = { ( c, y c ) : c ∈ C ′ } . The edit cost in H of every conflict between G and H equals γ . Therefore, we have ∆ ( G, H ) = γ | Γ | = γ | C ′ | ≤ γk .Hence, ( H, γk ) is a yes-instance of
Edit . ⊓⊔ Proof (Proof of Theorem 1.2.).
Let ρ be real number greater than or equal to 1. It followsfrom the proof of Theorem 1.1 that if G is a ρ -approximate solution of Min Edit on H thenany perfect phylogeny for G is a ρ -approximate solution of Min
RTI on ( S, R ). Therefore,if Min Edit is ρ -approximable in polynomial time then Min
RTI is also ρ -approximable inpolynomial time. It is now clear that Theorem 1.2 follows from Theorem 2.2. ⊓⊔ To conclude, let us contrast Theorem 1 with two recent results.The
Maximum Parsimony (MP) problem [2] is the NP-hard optimization problem [14,18]underlying MRP, as
Min Edit is the optimization problem underlying MRF. Although
MinEdit is NP-hard to approximate within any constant factor by Theorem 1.2, MP is 1 . Edit and
Weighted Fuzzy Cluster Editing (WFCE) [6] are closely related: WFCE is the draft-graph edition problem corresponding to the classof P -free graphs. Edit is W[2]-hard by Theorem 1.1 but WFCE has been recently shown tobe fixed-parameter tractable [23] (see also [7, 15]).
References
1. A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowest common ancestorswith an application to the optimization of relational expressions.
SIAM J. Comput. , 10(3):405–421, 1981.2. N. Alon, B. Chor, F. Pardi, and A. Rapoport. Approximate maximum parsimony and ancestral maximumlikelihood.
IEEE/ACM Trans. Comput. Biology Bioinform. , 7(1):183–187, 2010.3. B. R. Baum. Combining trees as a way of combining data sets for phylogenetic inference, and the desir-ability of combining gene trees.
Taxon , 41(1):3–10, 1992.4. O. R. P. Bininda-Emonds, editor.
Phylogenetic Supertrees: Combining Information to Reveal the Tree ofLife , volume 4 of
Computational Biology Series . Kluwer Academic, 2004.5. S. B¨ocker, Q. B. A. Bui, and A. Truss. An improved fixed-parameter algorithm for minimum-flip consensustrees. In M. Grohe and R. Niedermeier, editors,
Proc. of International Workshop on Parameterized andExact Computation (IWPEC 2008) , volume 5018 of
Lect. Notes Comput. Sci. , pages 43–54. Springer-Verlag, 2008.6. H. L. Bodlaender, M. R. Fellows, P. Heggernes, F. Mancini, C. Papadopoulos, and F. Rosamond. Clusteringwith partial information.
Theor. Comput. Sci. , 411(7-9):1202–1211, 2010.7. N. Bousquet, J. Daligault, and S. Thomass´e. Multicut is FPT. In
Proc. of ACM Symposium on Theoryof Computing (STOC 2011) , pages 459–468. ACM, 2011.ntractability of the Minimum-Flip Supertree problem and its variants 78. J. Byrka, S. Guillemot, and J. Jansson. New results on optimizing rooted triplets consistency.
DiscreteAppl. Math. , 158(11):1136–1147, 2010.9. D. Chen, L. Diao, O. Eulenstein, D. Fern´andez-Baca, and M. J. Sanderson. Flipping: A supertree con-struction method. In M. F. Janowitz, F.-J. Lapointe, F. R. McMorris, B. Mirkin, and F. S. Roberts,editors,
Bioconsensus: DIMACS Working Group Meetings on Bioconsensus , volume 61 of
DIMACS Seriesin Discrete Mathematics and Theoretical Computer Science , pages 135–160. Amer. Math. Soc., 2003.10. D. Chen, O. Eulenstein, D. Fern´andez-Baca, and J. G. Burleigh. Improved heuristics for minimum-flipsupertree construction.
Evol. Bioinform. Online , 2:347–356, 2006.11. D. Chen, O. Eulenstein, D. Fern´andez-Baca, and M. Sanderson. Supertrees by flipping. In
Proc. ofConference on Computing and Combinatorics (COCOON 2002) , volume 2387 of
Lect. Notes Comput.Sci. , pages 391–400. Springer-Verlag, 2002.12. D. Chen, O. Eulenstein, D. Fern´andez-Baca, and M. Sanderson. Minimum-flip supertrees: Complexity andalgorithms.
IEEE/ACM Trans. Comput. Biology Bioinform. , 3(2):165–173, 2006.13. M. Chimani, S. Rahmann, and S. B¨ocker. Exact ILP solutions for phylogenetic minimum flip problems.In
Proc. of ACM Conf. on Bioinformatics and Computational Biology (ACM-BCB 2010) , pages 147–153.ACM, 2010.14. W. Day, D. Johnson, and D. Sankoff. The computational complexity of inferring rooted phylogenies byparsimony.
Math. Biosci. , 81(1):33–42, 1986.15. E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs.
Theor. Comput. Sci. , 361(2-3):172–187, 2006.16. O. Eulenstein, D. Chen, J. G. Burleigh, D. Fern´andez-Baca, and M. J. Sanderson. Performance of flipsupertree construction with a heuristic algorithm.
Syst. Biol. , 53(2):299–308, 2004.17. J. Flum and M. Grohe.
Parameterized Complexity Theory . Springer-Verlag, 2006.18. L. Foulds and R. L. Graham. The Steiner problem in phylogeny is NP-complete.
Adv. Appl. Math. ,3(1):43–49, 1982.19. M. C. Golumbic, H. Kaplan, and R. Shamir. Graph sandwich problems.
J. Algorithms , 19(3):449–473,1995.20. D. Gusfield. Efficient algorithms for inferring evolutionary trees.
Networks , 21(1):19–28, 1991.21. D. Gusfield.
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology .Cambridge University Press, 1997.22. C. Komusiewicz and J. Uhlmann. A cubic-vertex kernel for flip consensus tree. In
Proc. of Foundations ofSoftware Technology and Theoretical Computer Science (FSTTCS 2008) , volume 2 of
Leibniz InternationalProceedings in Informatics , pages 280–291. Dagstuhl, 2008.23. D. Marx and I. Razgon. Fixed-parameter tractability of multicut parameterized by the size of the cutset.In
Proc. of ACM Symposium on Theory of Computing (STOC 2011) , pages 469–478. ACM, 2011.24. A. Natanzon, R. Shamir, and R. Sharan. Complexity classification of some edge modification problems.
Discrete Appl. Math. , 113(1):109–128, 2001.25. I. Pe’er, T. Pupko, R. Shamir, and R. Sharan. Incomplete directed perfect phylogeny.
SIAM J. Comput. ,33(3):590–607, 2004.26. M. A. Ragan. Phylogenetic inference based on matrix representation of trees.