[PDF] Phylogenetic mixtures on a single tree can mimic a tree of another topology

Abstract

Full PDF

aa r X i v : . [ q - b i o . P E ] J un Running head:

PHYLOGENETIC MIXTURES ON A SINGLE TREE CAN MIMICANOTHER TOPOLOGY

Phylogenetic mixtures on a single tree can mimica tree of another topology

Frederick A. Matsen and Mike Steel

Biomathematics Research CentreUniversity of CanterburyPrivate Bag 4800Christchurch, New ZealandCorresponding Author:Frederick A. Matsenphone: +64 3 364 2987 x7431fax: +64 3 364 2587email: [email protected]

Keywords: Phylogenetics; Mixture Model; Sequence Evolution; ModelIdentiﬁability 1 bstract

Phylogenetic mixtures model the inhomogeneous molecular evolutioncommonly observed in data. The performance of phylogenetic reconstructionmethods where the underlying data is generated by a mixture model hasstimulated considerable recent debate. Much of the controversy stems fromsimulations of mixture model data on a given tree topology for whichreconstruction algorithms output a tree of a diﬀerent topology; these ﬁndingswere held up to show the shortcomings of particular tree reconstructionmethods. In so doing, the underlying assumption was that mixture model dataon one topology can be distinguished from data evolved on an unmixed tree ofanother topology given enough data and the “correct” method. Here we showthat this assumption can be false. For biologists our results imply that, forexample, the combined data from two genes whose phylogenetic trees diﬀeronly in terms of branch lengths can perfectly ﬁt a tree of a diﬀerent topology.2t is now well known that molecular evolution is heterogeneous, i.e.that it varies across time and position (Simon et al., 1996). A classic exampleis stems and loops of ribosomal RNA: the evolution of one side of a stem isstrongly constrained to match the complementary side, whereas for loopsdiﬀerent constraints exist (Springer and Douzery, 1996). Heterogeneousevolution between genes is also widespread, where even the general features ofevolutionary history for neighboring genes may diﬀer wildly (Ochman et al.,2000). Presently it is not uncommon to use concatenated sequence data frommany genes for phylogenetic inference (Phillips et al., 2004), which can lead tovery high levels of apparent heterogeneity (Baldauf et al., 2000). Furthermore,empirical evidence using the covarion model shows that sometimes more subtlepartitions of the data can exist, for which separate analysis is diﬃcult(Wang et al., 2007).This heterogeneity is typically formulated as a mixture model(Pagel and Meade, 2004). Mathematically, a phylogenetic mixture model issimply a weighted average of site pattern frequencies derived from a number ofphylogenetic trees, which may be of the same or diﬀerent topologies. Eventhough many phylogenetics programs accept aligned sequences as input, theonly data actually used in the vast majority of phylogenetic algorithms is thederived site pattern frequencies. Thus, in these algorithms, any record ofposition is lost and heterogeneous evolution appears identical to homogeneousevolution under an appropriate phylogenetic mixture model. For simplicity, wecall a mixture of site pattern frequencies from two trees (which may be of thesame or diﬀerent topology) a mixture of two trees ; when the two trees have thesame underlying topology, the mixture will be called a mixture of branchlength sets on a tree .Mixture models have proven diﬃcult for phylogenetic reconstructionmethods, which have historically sought to ﬁnd a single process explaining thedata. For example, it has been shown that mixtures of two diﬀerent treetopologies can mislead MCMC-based tree reconstruction (Mossel and Vigoda,2005). It is also known that there exist mixtures of branch length sets on onetree which are indistinguishable from mixtures of branch length sets on a treeof a diﬀerent topology (Steel et al., 1994; ˇStefankoviˇc and Vigoda, 2007a,b).Recently, simulations of mixture models from “heterotachous” (changing ratesthrough time) evolution have been shown to cause reconstruction methods tofail (Ruano-Rubio and Fares, 2007).The motivation for our work is the observation that both theory andsimulations have shown that in certain parameter regimes, phylogeneticreconstruction methods return a tree topology diﬀerent from the one used togenerate the mixture data. The parameter regime in this class of examples issimilar to that shown in Figure 1, with two neighboring pendant edges whichalternate being long and short. After mixing and reconstruction, these edgesmay no longer be adjacent on the reconstructed tree. We call this mixedbranch repulsion . This phenomenon has been observed extensively insimulation (Kolaczkowski and Thornton, 2004; Spencer et al., 2005;Philippe et al., 2005; Gadagkar and Kumar, 2005) and it has been proved that3ertain distance and maximum likelihood methods are susceptible to this eﬀect(Chang, 1996; ˇStefankoviˇc and Vigoda, 2007a,b). Up to this point such resultshave been interpreted as pathological behavior of the reconstructionalgorithms, which has led to a heated debate about which reconstructionmethods perform best in this situation (Steel, 2005;Thornton and Kolaczkowski, 2005). Implicit in this debate is the assumptionthat a mixture of trees on one topology gives diﬀerent site pattern frequenciesthan that of an unmixed tree of a diﬀerent topology. This leads to the naturalquestion of how similar these two site pattern frequencies can be.Here we demonstrate that mixtures of two sets of branch lengths on atree of one topology can exactly mimic the site pattern frequencies of a tree ofa diﬀerent topology under the two-state symmetric model. In fact, there is aprecisely characterizable (codimension two) region of parameter space wheresuch mixtures exist. Consider two quartet trees of topology 12 |

34, as shown inFigure 1. Label the pendant branches 1 through 4 according to the taxonlabels, and label the internal edge with 5. The ﬁrst branch length set will bewritten t , . . . , t and the second s , . . . , s . Now, if k , . . . , k satisfy thefollowing system of inequalities k > k > k > > k , − k k − k k + − k k − k k > , k + k k k · k + k k k > t and s , mixingweights, and positive numbers ℓ , . . . , ℓ such that if for i = 1 , . . . , k i = exp ( − t i − s i )) and t i ≥ ℓ i , the corresponding mixture of two 12 | | d i denote the diﬀerence between the branch lengths for edge i ,i.e. t i − s i . Then (perhaps after changing the arbitrary numbering of the taxa)either d > d > d > > d or d > > d > d > d must be satisﬁed inorder for mixed branch repulsion to occur. Thus, for example, in one set ofbranch lengths the pendant edge for taxa 1 should be long and the pendantedge for taxa 2 should be short, while in the other set of branch lengths theseroles should be reversed. On the other hand, the branch lengths for taxa 3 and4 should be both long for one set and both short for the other. Additionally,at least one of the two internal branch lengths needs to be relatively short.There are other more complex criteria, but the above is necessary for exactmixed branch repulsion to occur. However, as noted below, exact mixedbranch repulsion is not necessary to “fool” model based methods.We believe that this similarity between site pattern frequenciesgenerated by mixtures of branch lengths on one tree and correspondingunmixed frequencies on a diﬀerent tree is what is leading to the mixed branchrepulsion observed in theory and simulation. Furthermore, it is possible thateven the simple case presented here is directly relevant to reconstructions fromdata. First, it is not uncommon to simplify the genetic code from the fourstandard bases to two (pyrimidines versus purines) in order to reduce theeﬀect of compositional bias when working with genome-scale data on deepphylogenetic relationships (Phillips et al., 2004). Second, when working onsuch relationships concatenation of genes is common (Baldauf et al., 2000), forwhich a phylogenetic mixture is the expected result. Finally, the region ofparameter space bringing about mixed branch repulsion may become moreextensive as the number of concatenated genes increases. Therefore inconcatenated gene analysis it may be worthwhile considering incongruence interms of branch lengths and not just in terms of topology (Rokas et al., 2003;Jeﬀroy et al., 2006), as highly incongruent branch lengths may produceartifactual results upon concatenation. Other methods may be useful in thissetting, such as gene order data, gene presence/absence, or coalescent-basedmethods to infer the most likely species tree from a collection of gene trees.Mixed branch repulsion may be more diﬃcult to detect than the usualmodel mis-speciﬁcation issues; in the cases presented here the mis-speciﬁedsingle tree model ﬁts the data perfectly. In contrast, although using the wrongmutation model for reconstruction using maximum likelihood can lead toincorrect tree topologies (Goremykin et al., 2005), the resulting modelmis-speciﬁcation can be seen from a poor likelihood score. In the mixturespresented here, there is no way of telling when one is in the mixed regime onone topology or an unmixed regime on another topology. Furthermore, anymodel selection technique (including likelihood ratio tests, the AkaikeInformation Criterion and the Bayesian Information Criterion) which chooses asimple model given equal likelihood scores would, in this case, choose a simpleunmixed model. Thereby it would select a tree that is diﬀerent from thehistorically correct tree if the true process was generated by a mixture model.The derivation of the zone resulting in mixed branch repulsion is aconceptually simple application of two of the pillars of theoreticalphylogenetics: the Hadamard transform and phylogenetic invariants(Hendy and Penny, 1989; Semple and Steel, 2003; Felsenstein, 2004). TheHadamard transform is a closed form invertible transformation (expressed interms of the discrete Fourier transform) for gaining the expected site patternfrequencies from the branch lengths and topology of a tree or vice versa.Phylogenetic invariants characterize when a set of site pattern frequenciescould be the expected site pattern frequencies for a tree of a given topology.5hey are identities in terms of the discrete Fourier transform of the sitepattern frequencies. Therefore, to derive the above equations, we simply insertthe Hadamard formulae for the Fourier transform of pattern probabilities intothe phylogenetic invariants, then check to make sure the resulting branchlengths are positive.Similar considerations lead to an understanding of when it is possibleto mix two branch length sets on a tree to reproduce the site patternfrequencies of a tree of the same topology (Proposition 3 of Appendix). For aquartet, two cases are possible. First, a pair of neighboring pendant branchlengths can be equal between the two branch length sets of the mixture.Alternatively, the sum of one pair of neighboring pendant branch lengths andthe diﬀerence of the other pair can be equal. For trees larger than quartets,the allowable mixtures are determined by these restrictions on the quartets(results to appear elsewhere). For pairs of branch lengths satisfying thesecriteria, any choice of mixing weights will produce site pattern frequenciessatisfying the phylogenetic invariants.Intuitively, one might expect that when two sets of branch lengths mixto mimic a tree of the same topology, some sort of averaging property wouldhold for the branch lengths. This is true for pairwise distances in the tree butneed not be the case for individual branches, as demonstrated by Figure 2. Infact, it is possible to mix two sets of branch lengths on a tree to mimic a treeof the same topology such that a resulting pendant branch length is arbitrarilysmall while the corresponding branch length in either of the branch length setsbeing mixed stays above some arbitrarily large ﬁxed value.The results in this paper shed some light on the geometry ofphylogenetic mixtures (Kim, 2000). As is well known, the set of phylogenetictrees of a given topology forms a compact subvariety of the space of sitepattern frequencies (Sturmfels and Sullivant, 2005). The ﬁrst part of our workdemonstrates that there are pairs of points in one such subvariety such that aline between those two points intersects a distinct subvariety (see Figure 3).Therefore the convex hull of one subvariety has a region of intersection withdistinct subvarieties. This is stronger than the recently derived result byˇStefankoviˇc and Vigoda (2007a,b) that the convex hulls of the varietiesintersect. The second part of our work shows that there exist pairs of points ina subvariety such that the line between those points intersects the subvariety.Furthermore, it demonstrates that when such a line between two pointsintersects the subvariety in a third point, then a subinterval of the line iscontained in the subvariety.This geometric perspective can aid in understanding practicalproblems of phylogenetic estimation. The question of when maximumlikelihood selects the “wrong” topology given mixture data was initiated byChang (1996) who found a one-parameter space of such examples under thetwo-state symmetric (CFN) model. Recently ˇStefankoviˇc and Vigoda (2007a)found a two-parameter space of such examples for the CFN model, and aone-dimensional space of examples for the Jukes-Cantor DNA (JC) andKimura two and three parameter (K2P, K3P) models. A potential criticism of6hese results is that because the set of examples has lower dimension than theambient parameter space one is unlikely to encounter them in practice.However, a simple geometric argument can show that the dimension ofthe set of all such pathological examples is equal to the dimension of theparameter space for all four of these models. To see why this holds we ﬁrstrecall the deﬁnition of the Kullback-Leibler divergence of probabilitydistribution q from a second distribution p : δ KL ( p, q ) = X i p i log p i q i . The p vector is typically thought of as a data vector and the q vector istypically the model data. Maximum likelihood seeks to ﬁnd the model datavector q which minimizes δ KL ( p, q ). Let V | be the set of all data vectorswhich correspond exactly to trees of topology 12 |

34, and similarly for V | .For V = V | or V | let δ KL ( p, V ) denote the divergence of p from the“closest” point in V , i.e. the minimum of δ KL ( p, v ) where v ranges over V .We show in Lemma 8 that this function exists and is continuous across the setof probability vectors p with all components positive.Now, pick any of the above group-based models, and let y be acorresponding pathological mixture on 12 |

34 for that model supplied byTheorem 2 of ˇStefankoviˇc and Vigoda (2007a). Maximum likelihood choosestopology 13 |

24 over 12 |

34 for a data vector p exactly when δ KL ( p, V | ) is lessthan δ KL ( p, V | ), therefore δ KL ( y, V | ) < δ KL ( y, V | ). By the propertiesof continuous functions, this inequality also holds for all probability vectors y ′ close to y which also have all components positive. Therefore ML will choose13 |

24 over 12 |

34 for all such y ′ . Because the transformation taking branchlength and mixing weight parameters to expected site pattern frequencies iscontinuous, one can change branch lengths and mixing weight arbitrarily by asmall amount and still have ML choose 13 |

24 for the resulting data. This givesthe required full-dimensional space of examples.We now indicate how our results ﬁt into previous work onidentiﬁability and discuss prospects for generalization. For four-state modelswith extra symmetries such as the Jukes-Cantor DNA model and the Kimuratwo-parameter model it is known that there exist linear phylogeneticinvariants which imply identiﬁability of the topology for mixture model data(ˇStefankoviˇc and Vigoda, 2007a). The topology is also identiﬁable forphylogenetic mixtures in which each underlying process is described by aninﬁnite state model (Mossel and Steel., 2004; Mossel and Steel, 2005) – suchprocesses may be relevant to data involving rare (homoplasy-free) genomicchanges. Therefore the pathologies observed here could not occur for thosemodels. Furthermore, Allman and Rhodes (2006) have shown genericidentiﬁability (i.e. identiﬁability for “almost all” parameter regimes) when thenumber of states exceeds the number of mixture classes. As stated above, thedimension of the set of examples presented here is of dimension two less thanthe ambient space (even though the conditions of the Allman and Rhodes7ork is not satisﬁed). However, we note that even when tree topology isgenerically identiﬁable (but not globally identiﬁable) for some model,arguments similar to the above can show that there exist positive-volumeregions where the data is closer to that from a tree of a diﬀerent topology thana tree of the same topology.A related though distinct question concerns identiﬁability undermixture models when the data partitions are known. For example, we mayhave a number of independent sequence data sets for the same set of taxa,perhaps corresponding to diﬀerent genes. In this setting it may be reasonableto assume that the sequence sites within each data set evolve under the samebranch lengths (perhaps subject to some i.i.d. rates-across-sites distribution),but that the branch lengths between the data sets may vary. The underlyingtree topology may be the same or diﬀerent across the data sets, however let usﬁrst consider the case where there is a common underlying topology. In thecase where each data set consists of sequences of length one we are back in thesetting of phylogenetic mixtures considered above. However, for longer blocksof sequences, we might hope to exploit the knowledge that the sequenceswithin each block have evolved under a common mechanism. If the sequencelength within any one data set becomes large we will be able to infer theunderlying tree for that data set correctly, so the interesting question is whathappens when the data sets provide only ‘mild’ support for their particularreconstructed tree. Assume that all (or nearly all) of the data sets containsuﬃciently many sites so that the tree reconstruction method M positivelyfavors the true tree over any particular alternative tree. By this we mean that M returns the true tree with a probability that is greater by a factor of atleast 1 + ǫ (with ǫ >

0) than the probability that M returns each particulardiﬀerent tree. Then it is easily shown that a majority rule selection procedureapplied to the reconstructed trees across the k independent data sets willcorrectly return the true underlying tree topology with a probability that goesto 1 as k grows. Note that this claim holds generally, not just for the two-statesymmetric model. Of course it is also possible that the underlying tree maydiﬀer across data sets– in the case of genes perhaps due to lineage sorting(Degnan and Rosenberg, 2006)– in which case the reconstruction questionbecomes more complex.In a forthcoming article (Matsen, Mossel, and Steel 2007) we furtherinvestigate identiﬁability of mixture models. Using geometric methods wemake some progress towards understanding how “common” non-identiﬁablemixtures should be for the symmetric and non-symmetric two-state models;for mixtures of many trees they appear to be quite common. A newcombinatorial theorem implies identiﬁability for certain types of mixturemodels when branch lengths are clock-like. A simple argument showsidentiﬁability for rates-across-sites models. We also investigate mixed branchrepulsion for larger trees.Many interesting questions remain. First of all, is exact mixed branchrepulsion an issue for any nontrivial model on four states? Also, what is thezone of parameter space for which a mixture of branch lengths on a tree is8loser (in some meaningful way) to the expected site pattern frequencies of atree of diﬀerent topology than to those for a tree of the original topology?How often does mixed branch repulsion present itself given “random” branchlengths? Considering the rapid pace of development in this ﬁeld we do notexpect these questions to be open for long.9 cknowledgments The authors would like to thank Cecile An´e, Andrew Roger, Jack Sullivan, and ananonymous reviewer for comments which greatly improved the paper. Dennis Wongprovided advice on the ﬁgures, and David Bryant’s Maple code was used to check results.Funding for this work was provided by the Allan Wilson Centre for Molecular Ecology andEvolution, New Zealand. eferences Allman, E. S. and J. A. Rhodes. 2006. The identiﬁability of tree topology forphylogenetic models, including covarion and mixture models. J. Comput.Biol. 13:1101–1113.Baldauf, S. L., A. J. Roger, I. Wenk-Siefert, and W. F. Doolittle. 2000. Akingdom-level phylogeny of eukaryotes based on combined protein data.Science 290:972–977.Chang, J. T. 1996. Inconsistency of evolutionary tree topology reconstructionmethods when substitution rates vary across characters. Math. Biosci.134:189–215.Degnan, J. H. and N. A. Rosenberg. 2006. Discordance of species trees withtheir most likely gene trees. PLoS Genet. 2:762–768.Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Press, Sunderland, MA.Gadagkar, S. R. and S. Kumar. 2005. Maximum likelihood outperformsmaximum parsimony even when evolutionary rates are heterotachous. Mol.Biol. Evol. 22:2139–2141.Goremykin, V. V., B. Holland, K. I. Hirsch-Ernst, and F. H. Hellwig. 2005.Analysis of

Acorus calamus chloroplast genome and its phylogeneticimplications. Mol. Biol. Evol. 22:1813–1822.Hendy, M. D. and D. Penny. 1989. A framework for the quantitative study ofevolutionary trees. Syst. Zool. 38:297–309.Jeﬀroy, O., H. Brinkmann, F. Delsuc, and H. Philippe. 2006. Phylogenomics:the beginning of incongruence? Trends Genet. 22:225–231.Kim, J. 2000. Slicing hyperdimensional oranges: the geometry of phylogeneticestimation. Mol. Phylogenet. Evol. 17:58–75.Kolaczkowski, B. and J. W. Thornton. 2004. Performance of maximumparsimony and likelihood phylogenetics when evolution is heterogeneous.Nature 431:980–984.Matsen, F. A., E. Mossel, and M. Steel. 2007. Mixed-up trees: the structure ofphylogenetic mixtures. Submitted to Bull. Math. Biol. arXiv:0705.4328[q-bio.PE].Mossel, E. and M. Steel. 2004. A phase transition for a random cluster modelon phylogenetic trees. Math. Biosci. 187:189–203.Mossel, E. and M. Steel. 2005. How much can evolved characters tell us aboutthe tree that generated them? Pages 384–412 in Mathematics of Evolutionand Phylogeny (O. Gascuel, ed.). Oxford University Press.11ossel, E. and E. Vigoda. 2005. Phylogenetic MCMC algorithms aremisleading on mixtures of trees. Science 309:2207–2209.Moulton, V. and M. Steel. 2004. Peeling phylogenetic ‘oranges’. Adv. Appl.Math. 33:710–727.Ochman, H., J. G. Lawrence, and E. A. Groisman. 2000. Lateral gene transferand the nature of bacterial innovation. Nature 405:299–304.Pagel, M. and A. Meade. 2004. A phylogenetic mixture model for detectingpattern-heterogeneity in gene sequence or character-state data. Syst. Biol.53:571–581.Philippe, H., Y. Zhou, H. Brinkmann, N. Rodrigue, and F. Delsuc. 2005.Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol.5:50.Phillips, M. J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny andthe detection of systematic biases. Mol. Biol. Evol. 21:1455–1458.Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scaleapproaches to resolving incongruence in molecular phylogenies. Nature425:798–804.Ruano-Rubio, V. and M. Fares. 2007. Artifactual phylogenies caused bycorrelated distribution of substitution rates among sites and lineages: thegood, the bad, and the ugly. Syst. Biol. 56:68–82.Semple, C. and M. Steel. 2003. Phylogenetics. vol. 24 of

Oxford Lecture Seriesin Mathematics and its Applications . Oxford University Press, Oxford.Simon, C., L. Nigro, J. Sullivan, K. Holsinger, A. Martin, A. Grapputo,A. Franke, and C. McIntosh. 1996. Large diﬀerences in substitutionalpattern and evolutionary rate of 12S ribosomal RNA genes. Mol. Biol. Evol.13:923–932.Spencer, M., E. Susko, and A. J. Roger. 2005. Likelihood, parsimony, andheterogeneous evolution. Mol. Biol. Evol. 22:1161–1164.Springer, M. S. and E. Douzery. 1996. Secondary structure and patterns ofevolution among mammalian mitochondrial 12S rRNA molecules. J. Mol.Evol. 43:357–373.Steel, M. 2005. Should phylogenetic models be trying to “ﬁt an elephant”?Trends Genet. 21:307–309.Steel, M. A., L. A. Szekely, and M. D. Hendy. 1994. Reconstructing trees whensequence sites evolve at variable rates. J. Comput. Biol. 1:153–163.Sturmfels, B. and S. Sullivant. 2005. Toric ideals of phylogenetic invariants. J.Comput. Biol. 12:204–228. 12hornton, J. W. and B. Kolaczkowski. 2005. No magic pill for phylogeneticerror. Trends Genet. 21:310–311.ˇStefankoviˇc, D. and E. Vigoda. 2007a. Phylogeny of mixture models:Robustness of maximum likelihood and non-identiﬁable distributions. J.Comput. Biol. 14:156–189.ˇStefankoviˇc, D. and E. Vigoda. 2007b. Pitfalls of heterogeneous processes forphylogenetic reconstruction. Syst. Biol. 56:113–124.Wang, H. C., M. Spencer, E. Susko, and A. Roger. 2007. Testing forcovarion-like evolution in protein sequences. Mol. Biol. Evol. 24:294–305.13 ppendix

In this section we provide more precise statements and proofs of thepropositions in the text. The proofs will be presented in the reverse order thanthey were stated in the main text— ﬁrst the fact that it is possible to mix twobranch lengths on a tree to mimic a tree of the same topology, then that it ispossible to mix branch lengths to mimic a tree of a distinct topology.As stated in the main text, the general strategy of the proofs issimple: use the Hadamard transform to calculate Fourier transforms of sitepattern probabilities and then insert these formulas into the phylogeneticinvariants. These steps would become very messy except for a number ofsimpliﬁcations: First, because the discrete Fourier transform is linear, atransform of a mixture is simply a mixture of the corresponding transforms.Second, the fact that the original trees satisfy a set of phylogenetic invariantsreduces the complexity of the mixed invariants. Finally, the product of theexponentials of the branch lengths appear in all formulas, and division leads toa substantial simpliﬁcation.First we remind the reader of the main tools and ﬁx notation. Notethat for the entire paper we will be working with the two-state symmetric(also known as Cavender-Farris-Neyman) model.

The Hadamard transform and phylogenetic invariants

For a given edge e of branch length γ ( e ) we will denote θ ( e ) = exp( − γ ( e )) (1)which ranges between zero and one for positive branch lengths. We call thisnumber the “ﬁdelity” of the edge, as it quantiﬁes the quality of transmission ofthe ancestral state across the edge. For A ⊆ { , . . . , n } of even order, let q A = ( H n − ¯ p ) A be the Fourier transform of the split probabilities, where H n isthe n by n Hadamard matrix (Semple and Steel, 2003).Quartet trees will be designated by their splits, i.e. 13 |

24 refers to aquartet with taxa labeled 1 and 3 on one side of the quartet and taxa 2 and 4on the other.By the ﬁrst identity in the proof of Theorem 8.6.3 of(Semple and Steel, 2003) one can express the Fourier transform of the splitprobabilities in terms of products of ﬁdelities. That is, for any subset A ⊆ { , . . . , n } of even order, q A = Y e ∈P ( T,A ) θ ( e ) (2)where P ( T, A ) is the set of edges which lie in the set of edge-disjoint pathsconnecting the taxa in A to each other. This set is uniquely deﬁned (again, see(Semple and Steel, 2003)).From this equation, we can derive values for the ﬁdelities from theFourier transforms of the split probabilities. In particular, it is simple to write14ut the ﬁdelity of a pendant edge on a quartet. For example, θ = r θ θ θ · θ θ θ θ θ = r q q q for a tree of topology 12 |

34. In general, we have the following lemma:

Lemma 1. If a , b , and c are distinct pendant edge labels on a quartet suchthat a and b are adjacent, then the ﬁdelity of a pendant edge a is r q ab q ac q bc . A similar calculation leads to an analogous lemma for the internaledge:

Lemma 2.

The ﬁdelity of the internal edge of an ab | cd quartet tree is r q ac q bd q ab q cd . This paper will also make extensive use of the method of phylogeneticinvariants. These are polynomial identities in the Fourier transform of thesplit probabilities which are satisﬁed for a given tree topology. Invariants areunderstood in a very general setting (see Sturmfels and Sullivant (2005)),however here we only require invariants for the simplest case: a quartet treewith the two-state symmetric model. In particular, for the quartet tree ab | cd ,the two phylogenetic invariants are q abcd − q ab q cd = 0 (3) q ac q bd − q ad q bc = 0 . (4)A q -vector mimics the Fourier transforms of site pattern frequencies of anontrivial tree exactly when they satisfy the phylogenetic invariants and havecorresponding edge ﬁdelities (given by Lemmas 1 and 2) between zero and one.This paper is primarily concerned with the following situation: amixture of two sets of branch lengths on a quartet tree which mimics the sitepattern frequencies of an unmixed tree. We ﬁx the following notation: the twobranch length sets will be called t i and s i , the corresponding ﬁdelities will becalled θ i and ψ i , and the Fourier transforms of the site pattern frequencies willbe labeled with q and r , respectively. The internal edge of the quartet willcarry the label i = 5, and the pendant edges are labeled according to theirterminal taxa (e.g. i = 2 is the edge terminating in the second taxon). Themixing weight will be written α , and we make the convention that the mixturewill take the t i branch length set with probability α time and s i withprobability 1 − α . 15 ixtures mimicking a tree of the same topology In this section we describe conditions on mixtures such that anontrivial mixture of two branch lengths on 12 |

34 can give the sameprobability distribution as a single tree of the same topology.Mixing two branch length sets on a 12 |

34 quartet tree with the abovenotation leads to the following form of invariant (3) for a resulting tree also oftopology 12 | α + 1 − α )( α q + (1 − α ) r ) − ( α q + (1 − α ) r )( α q +(1 − α ) r ) = 0 . (5)Multiplying out terms then collecting, there will be a α ( q − q q ) termwhich is zero by the phylogenetic invariants for the 12 |

34 topology. Similarly,the terms with (1 − α ) vanish. Dividing by α (1 − α ) which we assume to benonzero, equation (5) becomes q + r − ( q r + r q ) = 0 . Applying invariant (3) for the 12 |

34 topology and simplifying leads to thefollowing equivalent form of (5):( q − r )( q − r ) = 0 . (6)The same sorts of moves lead to the second invariant of the mixed tree: q r + r q − ( q r + r q ) = 0 . (7)The fact that α doesn’t appear in these equations already delivers aninteresting fact: if a mixture of two branch lengths in this setting satisfy thephylogenetic invariants for a single α , then they do so for all α . Geometrically,this means if the line between two points on the subvariety cut out by thephylogenetic invariants intersects the subvariety non trivially then it sitsentirely in the subvariety.We can gain more insight by considering these equations in terms ofﬁdelities. Direct substitution using (2) into (6) gives( θ θ − ψ ψ )( θ θ − ψ ψ ) = 0 . This equation will be satisﬁed exactly when the branch lengths satisfy t + t = s + s or t + t = s + s . (8)The corresponding substitution into (7) and then division by θ θ θ ψ ψ ψ gives after simpliﬁcation (cid:18) θ θ − ψ ψ (cid:19) (cid:18) θ θ − ψ ψ (cid:19) = 016his equation will be satisﬁed exactly when the branch lengths satisfy t − t = s − s or t − t = s − s . (9)To summarize, Proposition 3.

The mixture of two | quartet trees with pendant branchlengths t i and s i satisﬁes the | phylogenetic invariants for the binarysymmetric model exactly (up to renumbering) when either t = s and t = s ,or t + t = s + s and t − t = s − s . As described above this proposition makes no reference to the mixingweight α .In quartets where t = s and t = s , the resulting tree will also havependant branch lengths t and t : Proposition 4.

A mixture of two | quartet trees with branch lengths t i and s i which satisﬁes t = s and t = s will have resulting pendant branchlengths for the ﬁrst and second taxa equal to t and t , respectively.Proof. Let the ﬁdelity of the edges leading to taxon one and two be denoted µ and µ . We have by Lemma 1 with a = 1, b = 2 and c = 3, µ = s ( αθ θ + (1 − α ) ψ ψ ) · ( αθ θ θ + (1 − α ) ψ ψ ψ ) αθ θ θ + (1 − α ) ψ ψ ψ This fraction is equal to θ after substituting ψ = θ and ψ = θ , which areimplied by the hypothesis. The same calculation implies that µ = θ .In the rest of this section we note that anomalous branch lengths canemerge from mixtures of trees mimicking a tree of the same topology. Proposition 5.

It is possible to mix two sets of branch lengths on a tree tomimic a tree of the same topology such that one resulting pendant branchlength is arbitrarily small while the corresponding branch length in either of thebranch length sets being mixed stays above some arbitrarily large ﬁxed value.Proof.

To get such an anomalous mixture, set θ = ψ , θ = ψ , θ = ψ , θ = ψ , θ = ψ , and α = .

5. The equations (8) and (9) are satisﬁed because θ = ψ and θ = ψ , and therefore t = s and t = s . This implies that themixture will indeed satisfy the phylogenetic invariants.Now, because again the Fourier transform of a mixture is the mixtureof the Fourier transform, using Lemma 1 and simplifying gives µ = θ | θ + θ |√ θ θ (10)Now note that by making the ratio θ /θ small, it is possible to have µ be close to one although θ can be small. This setting corresponds (via (1))17o the case of the ﬁrst branch length of the resulting tree to be going to zeroalthough the trees used to make the mixture may have long ﬁrst branchlengths. It can be checked by calculations analogous to (10) that the otherﬁdelities of the tree resulting from mixing will be, in order, √ θ θ , θ , θ , √ θ θ . These are clearly strictly between zero and one, so the resulting treewill have positive branch lengths. Mixtures mimicking a tree of a diﬀerent topology

In this section we answer the question of what branch lengths on aquartet can mix to mimic a quartet of a diﬀerent topology.

Proposition 6.

Let k , . . . , k satisfy the following inequalities: k > k > k > > k > , (11) − k k − k k + − k k − k k > , (12) k + k k k · k + k k k > . (13) Then there exists π such that for any π < k < π − suﬃciently close toeither π or π − there exists a mixing weight such that for any t , . . . , t and s , . . . , s satisfying π = exp ( − t + s )) and k i = exp ( − t i − s i )) for i = 1 , . . . , , the corresponding mixture of two | trees will satisfy thephylogenetic invariants for a single tree of the | topology. The resultinginternal branch length is guaranteed to be positive, and the pendant branchlengths will be positive as long as the pendant branch lengths being mixed aresuﬃciently large.Proof. Let m denote the Fourier transform vector of the site patternfrequencies of the mixture. The invariants for a tree of topology 13 |

24 are (by(3) and (4)) m − m m = 0 (14) m m − m m = 0 . (15)As before, we insert the mixture of the Fourier transforms of thepattern frequencies into the invariants. For the ﬁrst invariant,( α + 1 − α )( α q + (1 − α ) r ) − ( α q + (1 − α ) r )( α q +(1 − α ) r ) = 0 . Multiplying, this is equivalent to α ( q − q q )+ α (1 − α ) ( q + r − ( q r + r q ))+(1 − α ) ( r − r r ) = 0 . (16)18 similar calculation with the second invariant leads to α ( q q − q q )+ α (1 − α ) ( q r + r q − ( q r + r q ))+(1 − α ) ( r r − r r ) = 0 . (17)Rather than (16) and (17) themselves, we can take (16) and thediﬀerence of (16) and (17). Because the q and r vectors come from a tree withtopology 12 |

34, they satisfy q = q q and q q = q q and theequivalent equations for the r . Thus the diﬀerence of (16) and (17) can besimpliﬁed to (assuming α (1 − α ) = 0) q + r − ( q r + r q )= q r + r q − ( q r + r q ) . (18)We would like to ensure that the tree coming from the mixture hasnonzero internal branch length. By Lemma 2 this is equivalent to showing that m m > m m . (19)Substituting in for the mixture ﬁdelities and simplifying results in α ( q q − q q )+ α (1 − α ) ( q r + r q − ( q r + r q ))+(1 − α ) ( r r − r q ) > . The ﬁrst and last terms of this expression vanish because the q and r satisfythe 12 |

34 phylogenetic invariants coming from (3) and (4). Simplifying leads to q r + r q > q r + r q . (20)Deﬁne k i = ψ i /θ i for i = 1 , . . . , ρ = α/ (1 − α ). Note that0 < θ i < min( k − i ,

1) and 0 < k i < ∞ (21)is equivalent to 0 < θ i < < ψ i <

1. Deﬁne χ = k k + k k χ = k k + k k χ = k k + k k χ = 1 + k k k k . Later we will make use of the fact that the χ are invariant under the action ofthe Klein four group.Using these deﬁnitions, direct substitution using (2) into (16), (18),19nd (20) and some simpliﬁcation shows that the set of equations ρ (1 − θ ) + ρ ( χ − θ ψ χ )+(1 − ψ )( χ −

1) = 0 (22) χ − χ = θ ψ ( χ − χ ) (23) χ > χ (24)is equivalent to equations (14), (15) and (19).Equation (23) is simply satisﬁed by setting θ ψ = χ − χ χ − χ . (25)However, in doing so, we must require that this ratio is strictly between zeroand one. The fact that it must be less than one can be written χ + χ < χ + χ (26)which by a short calculation is equivalent to (13). Later it will be shown thatother equations imply that (25) is greater than zero.Assign variables A , B , and C in the standard way such that (22) canbe written Aρ + Bρ + C . The A and C terms are strictly positive, thus theexistence of a 0 < ρ < ∞ satisfying this equation implies B < B − AC > . (27)On the other hand, (27) implies the existence of a 0 < ρ < ∞ satisfying (22).Note that using (25), B < χ − χ − χ χ − χ χ < . Multiplying by χ − χ which is positive by (24) this equation is equivalent to χ χ < χ χ (28)which by a short calculation is equivalent to (12). The conclusion then is thatthe existence of a ρ ≥ B − AC > χ < χ . Therefore, according to(25) the product θ ψ is greater than zero given (24). For convenience, set π = θ ψ , which as described is determined by k , . . . , k . Now, θ being lessthan one and ψ being less than one are equivalent to π < k < π − . (29)In summary, the problem of ﬁnding branch lengths and a mixing parameter20uch that the derived variables satisfy (14), (15) and (19) is equivalent toﬁnding k i and θ i satisfying (12), (13), (21), (24), (25), (29) and B − AC > χ − π χ ) − − π /k )(1 − π k )( χ − > . (30)Note that χ = π χ is impossible using (23) and (28). Therefore (30) canbe satisﬁed while ﬁxing the other variables by taking k close to π or π − while satisfying (29).Now we show that (possibly after relabeling) equation (11) isequivalent to (24) in the presence of the other inequalities. Recall that the χ are invariant under the action of the Klein group acting on the indices of k i .Because the invariants are equivalent to equations which can be expressed interms of the χ with θ and ψ , we can assume that k ≥ k and k ≥ k byrenumbering via an element of the Klein group.Now, subtract χ χ from (28) to ﬁnd χ ( χ − χ ) < ( χ − χ ) χ . Rearranging (26), it is clear that this implies that χ < χ . (31)Inserting the deﬁnition of the χ into (24) and (31) shows that these equationsare equivalent to0 < ( k − k )( k − k ) and 0 < ( k − k )( k − k ) . (32)We have assumed by symmetry that k ≥ k and k ≥ k ; now (32) shows that k can’t be equal to either k or k . Also, (32) shows that k > k and k > k . All of these inequalities put together imply that k > k > k > k ,which directly implies (24).Furthermore, another rearrangement of (26) using the inequality (31)leads to χ < χ . This after substitution gives (1 − k k )(1 − k k ) < k i to be either less than orgreater than one.Note that (12) excludes the case k > k > > k > k ; this leaves k > > k > k > k and k > k > k > > k . We can assume the latterwithout loss of generality by exchanging the θ i and the ψ i (which correspondsto replacing k i with k − i ) and renumbering.So far we have described how to ﬁnd values for the branch lengths sothat the invariants (3) and (4) and the internal branch length inequality (19)are satisﬁed. However, we also need to check that the resulting pendantbranch lengths for the tree are positive. Here we describe how this can beachieved by taking a lower bound on the values of t i .Assume edges a and b are adjacent on the 12 |

34 trees being mixed,and a and c are adjacent on the resulting 13 |

24 tree. Then, by Lemma 1 and212), the ﬁdelity of the pendant a edge is s ( αθ a θ b + (1 − α ) ψ a ψ b )( αθ a θ θ c + (1 − α ) ψ a ψ ψ c ) αθ b θ θ c + (1 − α ) ψ b ψ ψ c . In order to assure that the resulting pendant branch length for edge a ispositive, we must show that the above ﬁdelity is less than one. This isequivalent to showing that θ a must satisfy θ a < s α + (1 − α ) k b k k c ( α + (1 − α ) k a k b )( α + (1 − α ) k a k k c ) (33)for all such a , b , c triples. Thus this equation along with (21) imply upperbounds for θ a ; by the deﬁnition of ﬁdelities these translate to lower bounds for t a . This concludes the proof.Note that the proof actually completely characterizes (up torelabeling) the set of branch lengths and mixing weights such that theresulting mixture mimics a tree of diﬀerent topology. Proposition 7.

If two sets of branch lengths on the | tree mix to mimic atree of the topology | then up to relabeling the associated k i must satisfythe inequalities (11), (12), (13), and (29); the θ i must satisfy the inequalities(21) and (33). The two required equalities are that the product θ ψ mustsatisfy (25), and the associated ρ must satisfy (22). Kullback-Leibler lemma

Lemma 8.

Assume some group-based model G and let ∆ be the probabilitysimplex for distributions on four taxa under G . Let V ⊂ ∆ be the set of allsite-pattern frequencies for some quartet tree under G . Then δ KL ( p, V ) := min v ∈ V δ KL ( p, v ) exists and is continuous for all p in the interior of ∆ .Proof. Note that δ KL ( p, q ) is a continuous function when probabilitydistributions p and q have no components zero, i.e. they sit in the interior ˚∆of the probability simplex ∆. We will show that for any p ∈ ˚∆ there exists anopen neighborhood U of p such that δ KL ( p ′ , V ) exists and is continuous for all p ′ ∈ U . Given p let p min be the smallest component p i of p . Let U = n p ′ ∈ ˚∆ : p ′ i > p min / o . Then choose ε > p min /

2) + 12 p min log(1 /ε ) > sup p ′ ∈ U inf q ∈ V δ KL ( p ′ , q ) . p ′ ∈ U δ KL ( p ′ , q ∗ ) for any point q ∗ ∈ V with no components zero).Let B = { q ∈ V : q i ≥ ε for all i } . V is a compact set(Moulton and Steel, 2004) therefore B ⊂ ˚∆ is compact as well. Now for any p ′ ∈ U and q ′ ∈ V − Bδ KL ( p ′ , q ′ ) = X i p ′ i log p ′ i + X i p ′ i log(1 /q ′ i ) > log( p min /

2) + 12 p min log(1 /ε ) > inf q ∈ V δ KL ( p ′ , q )so the inﬁmum cannot be achieved outside B . Consequently,inf q ∈ V δ KL ( p ′ , q ) = min q ∈ B δ KL ( p ′ , q )for all p ′ ∈ U . Thus the right hand side exists; continuity follows fromstandard analytic arguments. 23able 1: Rounded branch lengths for the examples in Figure 1. The topdivision of the table is example (a); the bottom is example (b). The top twolines in each division are the branch lengths forming the mixture and the thirdline gives the branch lengths for the unmixed tree.weight pendant 1 pendant 2 pendant 3 pendant 4 internal0.748646 1.772261 0.25 0.949306 0.846574 0.3665160.251354 0.25 1.353637 0.4 0.5 0.2133871. 0.888101 0.905792 0.648625 0.654236 0.0860510.936064 1.838398 0.2 1.397309 0.411489 0.0624290.063936 0.2 0.543932 0.2 0.2 0.0553121. 1.011471 0.375718 0.794529 0.305338 0.36082724igure 1: Mixtures of two sets of branch lengths on a tree of a given topologycan have exactly the same site pattern frequencies as a tree of a diﬀerenttopology under the two-state symmetric model. The notation in the diagramshowing x ∗ T + (1 − x ) ∗ T ′ = T means that the indicated mixture of the twobranch lengths sets T and T ′ shown in the diagram gives the same expectedsite pattern frequencies as the tree T . The diagrams show two examples ofthis “mixed branch repulsion;” the general criteria for such mixtures isexplained in the text. The branch length scale in the diagrams is given by theline segment indicating the length of a branch with 0.5 substitutions per site.Note that the mixing weights in this example have been rounded.25igure 2: Mixtures of two sets of branch lengths on a tree of a given topologycan have exactly the same site pattern frequencies as a tree of the sametopology under the two-state symmetric model. The criterion for theoccurrence of this phenomenon is explained in the text and an example isshown in the ﬁgure. Note in particular that the branch lengths need notaverage: for example, the branch length for the pendant edge leading to taxon1 virtually disappears after mixing. 26igure 3: A geometric depiction of the main result. The ambient space is aprojection of the seven-dimensional probability simplex of site patternfrequencies for trees on four leaves. The gray sheet is a subset of atwo-dimensional subvariety of the site pattern frequencies for trees of the 12 | |

24 topology.The horizontal line represents the possible mixtures for the two sets of branchlengths for the 12 |

34 topology in Figure 1a. The fact that these two sets ofbranch lengths can mix to make a tree of topology 13 ||