Gromov meets Phylogenetics - new Animals for the Zoo of Biocomputable Metrics on Tree Space
aa r X i v : . [ m a t h . M G ] A p r Gromov meets Phylogenetics — new Animalsfor the Zoo of Biocomputable Metrics on TreeSpace
Volkmar LiebscherApril 22, 2015,13:14:54
Abstract
We present a new class of metrics for unrooted phylogenetic X -trees derived from the Gromov-Hausdorff distance for (compact) met-ric spaces. These metrics can be efficiently computed by linear orquadratic programming. They are robust under NNI-operations, too.The local behavior of the metrics shows that they are different fromany formerly introduced metrics. The performance of the metrics isbriefly analised on random weighted and unweighted trees as well asrandom caterpillars. The idea for this paper came from a talk of Michelle Kendall at the Porto-bello conference 2015, see [21]. Basically, she postulated, that the biologicalinformation is essentially encoded in the collection of distances between theMRCA of two taxa and the root. If the trees were ultrametric, we couldequivalently just collect the distances between all pairs of taxa. That leadsto our rationale:Instead of trees we compare the induced metric spaces.This approach is feasible since by the work of Buneman [7, 8], see also [34] forthe unweighted case, we can identify tree-induced metrics among all metricsby the famous four point conditions.In fact, this rationale must have been behind the invention of the ℓ and ℓ path difference distances [33, 27] already. Below we invent also an ℓ ∞ version of that metrics, too. 1or (compact) metric spaces there is the well-known Gromov-Hausdorffdistance D GH (( X, d ) , ( X ′ , d ′ )) = inf ϕ,ϕ ′ ρ H ( ϕ ( X ) , ϕ ′ ( X ′ )) (1)where the infimum is taken over all isometric embeddings of X, X ′ into acommon metric space, and ρ H is the Hausdorff metric on the compacts ofthat space.By our rationale, this definition induces a metric on the space of allweighted trees. But, we cannot distinguish trees which yield isomorphicmetric spaces, i.e. with permuted labels. Since our aim is to compare treeswith the same taxon sets we have to adapt the metric (1) to our situation.That makes the definition more complicated (see section 2) since we haveto match the leaf labels, but the idea of embeddings remains. Fortunately,our metric becomes efficiently computable only this way. Simply, we mustsubstitute the Hausdorff metric in (1). Since there are several reasonablecandidates for that, we derive even three different metrics. In all these cases,the value of the metric is the solution of a linear or quadratic program.Clearly, our approach is more general and abstract than other definitionsof phylogenetic metrics to be discussed soon. Those are using much morethe internal structure of trees. Usually, more abstract approaches have morepotential to generalise and to adapt to special situations. Still, this has tobe worked out in the present situation.For mathematical reasons, it is very convenient to include also semimet-rics on the taxon set in the definition. This situation may occur in phylogenyif we do not resolve the topology by all singleton splits, see for instance [30].What about other phylogenetic metrics? The simplest one, though notthe oldest one, seems to be the Robinson-Foulds distance [29, 30]. That one iseasy and efficiently to compute in linear time [12] or even in sublinear approx-imation [26]. But, it has no much power in discriminating trees, since a lot oftrees with similar biological meaning have distance equal to the diameter ofthe unweighted tree space. Much nearer to biology seems to be a variant ofthe Robinson-Foulds distance, the weighted matching distance. It capturessimilarity of splits which entails a lot of biology and is still computable insubcubic time [3, 22].A quite natural, biology adapted way of capturing tree similarity is pro-vided by the tree rearrangement metrics. There are different basic trans-formations giving rise to the NNI-distance [28], SPR-distance and TBR-distance. Unfortunately, computation of those distances is NP-hard and onlyfeasible for small trees [10, 1, 5]. Some fixed parameter approach to computethe (rooted) SPR, e.g, was done in [32]. Even more at the heart of evolution2s the maximum parsimony distance [16]. Still it is NP-hard to compute thatdistance, even over binary unweigthed phylogenetic trees [16, 20].A good alternative to the tree rearrangement metrics is the quartet dis-tance [15]. It is much more biologically plausible than the Robinson-Fouldsdistance and also efficiently computable [6].For weighted phylogenetic trees there is the euclidean type (geodesic)distance on tree space introduced by [2]. The crucial observation was thatin a natural way tree space is a category zero (CAT(0)) space (or spaceof nonpositive curvature) introduced by Gromov. Essentially this propertyimplies uniqueness of geodesics. It was an open problem for some years howto compute the geodesic distance on tree space efficiently. Yet, by [24] wehave a polynomial time algorithm now. The CAT(0) idea was used again in[11] to develop metrics for ultrametric spaces. Again, efficient computationof the geodesics is possible for at least one of the metrics. As observed in thatwork, different, but natural, parametrisations may yield different geodesics.Recently, [21] returned back to the idea of [33], [27] and [2] in applicationto weighted rooted trees, considering all distances of MRCAs of pairs of taxato the root. She also proposes to weight different MRCAs by their depthrespective the root. That idea may be similar to the weighted matchingdistance [3, 22].A good review about recent developments in polynomial time computablemetrics on unweighted phylogenetic trees is contained in [4]. There alsocomplete java implementations are provided. For simplicity, we implementedour metrics in R first.After having this short overview over this situation, we would like tointroduce the notion of a biocomputable metric. That should be a metric onphylogenetic tree space which is computable in polynomial time and whichis able to capture biological similarity. Preferably, it should be also definedfor weighted phylogenetic trees. So, let’s see how Gromovs idea of jointembeddings helps to reach that goal . . . For a set X denote by M ( X ) the set of all semimetrics on X , i.e. all ρ : X × X → R ≥ such that for all x, y, z ∈ X ρ ( x, x ) = 0, ρ ( x, y ) = ρ ( y, x ) and ρ ( x, y ) ≤ ρ ( x, z ) + ρ ( z, y ). Frequently, we describe such a semimetrics in anequivalent fashion by ρ : (cid:0) X (cid:1) → R ≥ where (cid:0) X (cid:1) = {{ x, y } : x, y ∈ X, x = y } .Accordingly, M > ( X ) denote the set of all metrics on X . Further, let M = { ( X, ρ ) :
X < ∞ , ρ ∈ M ( X ) } denote the set of all finite semimetric spaces.3sometries ϕ : ( X, ρ ) → ( Y, ρ ′ ) preserve the semimetrics, i.e. for all x, y ∈ Xρ ( x, y ) = ρ ′ ( ϕ ( x ) , ϕ ( y )).Frequently we need identical copies of our taxon set X . Under slight abuseof notation, we will denote them X ′ = { x ′ : x ∈ X } and X ′′ = { x ′′ : x ∈ X } . Definition 1.
Let X be a finite set. Then we define three functions D , D , D ∞ on M ( X ) × M ( X ) by D ( ρ, ρ ′ ) = inf Y,ϕ,ψ X x ∈ X ¯ d ( ϕ ( x ) , ψ ( x )) D ( ρ, ρ ′ ) = inf Y,ϕ,ψ X x ∈ X ¯ d ( ϕ ( x ) , ψ ( x )) D ∞ ( ρ, ρ ′ ) = inf Y,ϕ,ψ max x ∈ X ¯ d ( ϕ ( x ) , ψ ( x )) where the infimum is taken over all ( Y, ¯ d ) ∈ M and all isometries ϕ :( X, ρ ) → ( Y, ¯ d ) , ψ : ( X, ρ ′ ) → ( Y, ¯ d ) . Remark 1. D ∞ is nearest to the Gromov-Hausdorff distance, which weshould implement via D GH ( ρ, ρ ′ ) = inf Y,ϕ,ψ max x ∈ X min y ∈ X ¯ d ( ϕ ( x ) , ψ ( y )) . (2) On the other hand, we think that the ℓ -like metric D is kind of naturalfor trees. The euclidean geometry which is the basis of D might be good forhaving unique geodesics. This feature is very convenient and at the heart ofthe proposals of [2] and [11]. Let us simplify the optimisation problems present in the definitions of D i a bit. In fact, it is enough to have just one model space Y . For ρ, ρ ′ ∈ M ( X )define the space E ( ρ, ρ ′ ) of their extensions E ( ρ, ρ ′ ) = (cid:8) ¯ d ∈ M ( X ∪ X ′ ) : ∀ x, y ∈ X : ¯ d ( x, y ) = ρ ( x, y ) , ¯ d ( x ′ , y ′ ) = ρ ′ ( x, y ) (cid:9) . Further, k·k i denotes the usual ℓ i − norm on R X . Lemma 1.
For i = 1 , , ∞ D i ( ρ, ρ ′ ) = inf ¯ d ∈ E ( ρ,ρ ′ ) (cid:13)(cid:13) ( ¯ d ( x, x ′ )) x ∈ X (cid:13)(cid:13) i (3) Proof.
Note that ≤ holds trivially. 4n the other side, for ( Y, ˜ d ) ∈ M and isometries ϕ : ( X, ρ ) → ( Y, ˜ d ), ψ : ( X, ρ ′ ) → ( Y, ˜ d ) define ¯ d : (cid:0) X ∪ X ′ (cid:1) → R ≥ by¯ d ( x, y ) = ρ ( x, y ) = ˜ d ( ϕ ( x ) , ϕ ( y ))¯ d ( x ′ , y ′ ) = ρ ′ ( x, y ) = ˜ d ( ψ ( x ) , ψ ( y ))¯ d ( x, y ′ ) = ˜ d ( ϕ ( x ) , ψ ( y ))for all x, y ∈ X . Now ˜ d ∈ M ( Y ) implies ¯ d ∈ M ( X ∪ X ′ ). The ≥ in (3)follows now from (cid:13)(cid:13) ( ¯ d ( x, x ′ )) x ∈ X (cid:13)(cid:13) i = (cid:13)(cid:13)(cid:13) ( ˜ d ( ϕ ( x ) , ψ ( x ))) x ∈ X (cid:13)(cid:13)(cid:13) i Observe that the previous lemma is at the heart of the computation of thedistances since that amounts just to the minimization of a convex functionover the convex set E ( ρ, ρ ′ ). Lemma 2.
For i = 1 , , ∞ there exists a d ∗ i ∈ E ( ρ, ρ ′ ) ⊂ M ( X ∪ X ′ ) suchthat D i ( ρ, ρ ′ ) = k ( d ∗ i ( x, x ′ )) x ∈ X k i Proof.
Clearly, the sublevel sets of the convex function k·k i are compact onthe convex space E ( ρ, ρ ′ ). Thus there must exist a minimal point of thatfunction. Theorem 1. D i , i = 1 , , ∞ are complete metrics on M ( X ) .Proof. Symmetry is clear.If D i ( ρ, ρ ′ ) = 0 choose d ∗ i ∈ E ( ρ, ρ ′ ) according to the previous lemma.Obviously, we obtain d ∗ i ( x, x ′ ) = 0 for all x ∈ X . The triangle inequalityimplies for all x, y ∈ Xρ ( x, y ) = d ∗ i ( x, y ) = d ∗ i ( x ′ , y ′ ) = ρ ′ ( x, y )such that ρ = ρ ′ .Now let there be ρ, ρ ′ , ρ ′′ ∈ M ( X ) and i arbitrary. Using again the abovelemma we choose d ∈ M ( X ∪ X ′ ) extending ρ, ρ ′ and d ∈ M ( X ′ ∪ X ′′ )extending ρ ′ , ρ ′′ such that D i ( ρ, ρ ′ ) = k ( d ( x, x ′ )) x ∈ X k i D i ( ρ ′ , ρ ′′ ) = k ( d ( x ′ , x ′′ )) x ∈ X k i d ∈ M ( X ∪ X ′ ∪ X ′′ ) extendingboth d , d : d | ( X ∪ X ′ ) = d and d | ( X ′∪ X ′′ ) = d . We see now D i ( ρ, ρ ′′ ) ≤ k ( d ( x, x ′′ )) x ∈ X k i ≤ k ( d ( x, x ′ ) + d ( x ′ , x ′′ )) x ∈ X k i ≤ k ( d ( x, x ′ )) x ∈ X k i + k ( d ( x ′ , x ′′ )) x ∈ X k i = k ( d ( x, x ′ )) x ∈ X k i + k ( d ( x ′ , x ′′ )) x ∈ X k i = D i ( ρ, ρ ′ ) + D i ( ρ ′ , ρ ′′ ) . Completeness will be proved later in Lemma 8.As already said in the introduction, we are mainly interested in metricson tree space. Let G = ( V, E, q ) be a weighted connected graph, i.e. E ⊆ (cid:0) V (cid:1) and q : E → R ≥ . The we define the induced semimetric on V by d qG ( x, y ) = inf { len( p ) : p path from x to y } (4)As usual, len( x x . . . x m ) = m X i =1 q ( { x i − , x i } )is here the length of the path ( x x . . . x m ). For unweighted graphs ( V, E )we choose q ( { x, y } ) = 1 for all { x, y } ∈ E .So let the tree space T ( X ) be the set of all weighted unrooted generalisedphylogenetic X -trees. A weighted unrooted generalised phylogenetic X -treeis a quadruple ( V, E, q, µ ), where λ : X → V is a (not necessarily injective)map such that ( V, E ) is the minimal tree containing µ ( X ) and q : E → R > is a weight function. Phylogenetic X -trees without weights are included bygiven all edges after contraction a weight of 1 and by requiring µ to beinjective. The corresponding subspace will be denoted T ( X ). The set ofbinary (bifurcating) phylogenetic X -trees is denoted T ( X ). Now we definefor τ, τ ′ ∈ T ( X ) under abuse of notation D i ( τ, τ ′ ) = D i ( d τ | ( X ) , d τ ′ | ( X ))where ρ ∈ M ( X ) is induced by the tree τ and ρ ′ by τ via (4). Again, allthree are metrics on tree space. This can be seen from the following result,provided in essence by [7]. 6 emma 3. For ρ ∈ M ( X ) there exists an unrooted generalised phylogenetic X − tree τ ∈ T ( X ) with ρ = d τ | ( X ) if and only if for for all x, y, z, w ∈ X thefour point condition ρ ( x, y ) + ρ ( z, w ) ≤ max( ρ ( x, z ) + ρ ( y, w ) , ρ ( x, w ) + ρ ( y, z )) (5) is fulfilled.Proof. Identifying points x, y ∈ X with ρ ( x, y ) = 0 we can assume that ρ isa metric. That (5) is necessary and sufficient now for the existence of τ wasshown in [7]. The splits of τ are identified by situations where in (5) strictinequality holds. Minimality of the vertex set of τ (according to definition)implies that different edges in τ induce different splits. The weight of theedge corresponding to a split by (5) computes directly from the differenceof the right and the left hand side in (5). Thus τ ∈ T ( X ) is uniquelydetermined.Let us compute some example. Example 1.
We want to compare for X = { A, B, C, D } the two unweigthed X − trees τ = AB • • CD and τ ′ = AC • • BD with corresponding distances ρ, ρ ′ .We want to derive possible extensions of ρ, ρ ′ by verifying that for some δ A , δ B , δ C , δ D ≥ the graph distances on the weighted graph G = AB •• CD A ′ C ′ •• B ′ D ′ δ A δ C δ B δ D reproduce both ρ and ρ ′ . One obvious choice is δ A = 0 , δ B = 1 , δ C = 1 , δ D = 0 ,i.e. G = B •• C A = A ′ C ′ •• B ′ D = D ′ s consistent. Obviously, we embedded now both τ and τ ′ into the metric spaceof the graph G . We see D ∞ ≤ , D ≤ √ and D ≤ . In fact equalityholds, but this we can prove only later in Example 2. Additionally, we obtain
Lemma 4.
For λ ≥ , i = 1 , , ∞ , and ρ j ∈ M ( X ) , j = 1 , , , , thefollowing are true:1. D i ( λρ , λρ ) = λD i ( ρ , ρ ) .2. D i ( ρ + ρ , ρ + ρ ) ≤ D i ( ρ , ρ ) + D i ( ρ , ρ ) .3. D i ( λρ + (1 − λ ) ρ , ρ ) ≤ λD i ( ρ , ρ ) + (1 − λ ) D i ( ρ , ρ ) .Proof. The first relation follows from λ ¯ d ∈ E ( λρ , λρ ) ⇐⇒ ¯ d ∈ E ( ρ , ρ ).The second relation follows from ¯ d + ¯ d ∈ E ( ρ + ρ , ρ + ρ ) for all¯ d ∈ E ( ρ , ρ ) and ¯ d ∈ E ( ρ , ρ ).The third relation is just a consequence of the first two. Clearly,
Lemma 5. D and D ∞ can be computed solving a linear program. For thecomputation of D a quadratic program has to be solved.Proof. This follows immediately from Lemma 1.So, we are sure that we can compute the distance in a computing timepolynomially bounded in n = X [19]. In the na¨ıve way, the linear (quadratic)program has the n variables ǫ xy = ¯ d ( x, y ′ ) and O( n ) restrictions comingessentially from the triangle inequalities in triangles of the form x, y, z ′ orsimilar. But we can do the computation more efficiently. The essentialobservation is that the objective function depends on the unknown values( ¯ d ( x, x ′ )) x ∈ X only. The reformulation of the constraints is provided by thefollowing theorem. It will be proved later in section A. Theorem 2 (quadrangle inequalities) . Let ρ, ρ ′ ∈ M ( X ) and ( δ x ) x ∈ X ∈ R X ≥ be given. Then there exists a ¯ d ∈ M ( X ∪ X ′ ) with ¯ d ( x, y ) = ρ ( x, y ) x, y ∈ X ¯ d ( x ′ , y ′ ) = ρ ′ ( x, y ) x, y ∈ X ¯ d ( x, x ′ ) = δ x x ∈ X f and only if for all x = y ∈ X the following inequalities are fulfilled: δ x + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) || δ x − δ y | ≤ ρ ( x, y ) + ρ ′ ( x, y ) (6)Thus we have just n variables δ x = ¯ d ( x, x ′ ) and O( n ) constraints foreach rectangle x, y, y ′ , x ′ in the optimisation problems (3). Formally, D i ( ρ, ρ ′ )solves the program k δ k i → min under δ x ≥ x ∈ Xδ x + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | x = y ∈ X | δ x − δ y | ≤ ρ ( x, y ) + ρ ′ ( x, y ) x = y ∈ X (7) Example 2.
Let us continue Example 1. Since ρ ( A, D ) = ρ ′ ( A, D ) , we seefrom the upper parts of (6) δ A + δ B ≥ δ A + δ C ≥ δ B + δ D ≥ δ C + δ D ≥ Consequently, D ( ρ, ρ ′ ) ≥ δ A + δ B + δ C + δ D ≥ . We already saw that we can realise this minimum. The calculation of D ∞ ( ρ, ρ ′ ) =1 was already done, D ( ρ, ρ ′ ) = √ is immediate. It is very interesting that the upper bounds on the differences are notused in the calculation. In fact, we could not observe any situation wherethey had to be used to determine the minimum. This can be seen also fromthe numerical results in section 7, especially Figure 6. But, we are stilllacking a proof that we may omit these constraints safely. This leads us tothe definition of further distances ˜ D i ( ρ, ρ ′ ) as solution of k δ k i → min under δ x ≥ x ∈ Xδ x + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | x = y ∈ X (8)with the obvious extension to tree space. Lemma 6. ˜ D i are metrics on M ( X ) and T ( X ) , too. roof. Observe that exacly like for the problem (7), also the minimum of (8)is attained.Symmetry of the definition is clear. Further, ˜ D i ( ρ, ρ ′ ) = 0 if and only if δ = 0 is feasible for the problem (7). That means ρ ( x, y ) = ρ ′ ( x, y ) for all { x, y } ∈ (cid:0) X (cid:1) and ρ = ρ ′ .For the proof of the triangle inequality choose optimal solutions δ ∈ R X ≥ of (8) and δ ∈ R X ≥ of the version of (8) for ρ ′ , ρ ′′ . We see for { x, y } ∈ (cid:0) X (cid:1) that δ x + δ x + δ y + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | + | ρ ′ ( x, y ) − ρ ′′ ( x, y ) | ≥ | ρ ( x, y ) − ρ ′′ ( x, y ) | such δ + δ is feasible for the version of (8) for ρ, ρ ′′ . We obtain˜ D i ( ρ, ρ ′′ ) ≤ (cid:13)(cid:13) δ + δ (cid:13)(cid:13) i ≤ (cid:13)(cid:13) δ (cid:13)(cid:13) i + (cid:13)(cid:13) δ (cid:13)(cid:13) i = ˜ D i ( ρ, ρ ′ ) + ˜ D i ( ρ ′ , ρ ′′ ) . This completes the proof.
Remark 2.
Interestingly, there is a striking similarity between the feasibleset of (8) and the tight span of a distance matrix introduced in [14]. Yet, | ρ − ρ ′ | is not a semimetric in general and we do not see a deeper connectionat the moment. First we compare our metrics to the pathwise difference metrics. Recall thatthose are defined by [33, 27] D P Di ( τ , τ ) = (cid:13)(cid:13)(cid:13) ( ρ τ ( x, y ) − ρ τ ( x, y )) { x,y }∈ ( X ) (cid:13)(cid:13)(cid:13) i Interestingly, it seems that D P D ∞ was not used before. May be, we canimmediately explain this. Again we abbreviate n = X . Theorem 3.
For τ , τ ∈ T ( X ) it holds D ( τ , τ ) ≥ D ( τ , τ ) ≥ D ∞ ( τ , τ ) ≥ √ n D ( τ , τ ) ≥ n D ( τ , τ )˜ D ( τ , τ ) ≥ ˜ D ( τ , τ ) ≥ ˜ D ∞ ( τ , τ ) ≥ √ n ˜ D ( τ , τ ) ≥ n ˜ D ( τ , τ ) n D P D ( τ , τ ) ≥ D ( τ , τ ) ≥ ˜ D ( τ , τ ) ≥ n − D P D ( τ , τ ) √ n D P D ( τ , τ ) ≥ D ( τ , τ ) ≥ ˜ D ( τ , τ ) ≥ q n − D P D ( τ , τ ) D ∞ ( τ , τ ) = ˜ D ∞ ( τ , τ ) = D P D ∞ ( τ , τ )10 roof. The first relations are well-known for k·k i and translate directly.For the second relation we use the first inequality in (6). This gives usfor all x = y ∈ X δ x + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | δ x + δ y ≥
12 ( δ x + δ y ) ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | max { δ x : x ∈ X } ≥
12 ( δ x + δ y ) ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | Summing up the first or the second inequalities for all { x, y } ∈ (cid:0) X (cid:1) gives theestimates for i = 1 , ≥ -estimate for i = ∞ follows by taking the maximum of the thirdinequality over all { x, y } ∈ (cid:0) X (cid:1) . On the other hand, setting δ z = max (cid:26) | ρ τ ( x, y ) − ρ τ ( x, y ) | : { x, y } ∈ (cid:18) X (cid:19)(cid:27) z ∈ X , (6) is clearly fulfilled and we obtain also the ≤ -estimate.The first estimates yield the rest of the second estimates and completethe proof.By the same arguments as in Lemma 2, both (7) and (8) possess minimalpoints δ ∗ ∈ R X ≥ . As a corollary of the last theorem we find a useful upperbound for the elements of these vectors: Lemma 7.
In the minimisation problems (7) or (8), we may restrict min-imisation to δ ∈ R X ≥ which fulfil additionally δ x ≤ D ∞ ( ρ, ρ ′ ) = D P D ∞ ( ρ, ρ ′ ) . E.g., the minimisation problems k δ k i → min under ≤ δ x ≤ D ∞ ( ρ, ρ ′ ) x ∈ Xδ x + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | x = y ∈ X | δ x − δ y | ≤ ρ ( x, y ) + ρ ′ ( x, y ) x = y ∈ X (9) and k δ k i → min under ≤ δ x ≤ D ∞ ( ρ, ρ ′ ) x ∈ Xδ x + δ y ≥ | ρ ( x, y ) − ρ ′ ( x, y ) | x = y ∈ X (10) yield again D i ( ρ, ρ ′ ) and ˜ D i ( ρ, ρ ′ ) as minimal values, respectively. roof. Define ˜ δ by ˜ δ x = min( δ x , D ∞ ( ρ, ρ ′ )). By the above relation, ˜ δ isagain in the feasible set of (7) and (8) respectively. Further, (cid:13)(cid:13)(cid:13) ˜ δ (cid:13)(cid:13)(cid:13) i ≤ k δ k i completes the proof. Lemma 8. M ( X ) and T ( X ) are complete in each D i , i = 1 , , ∞ .Proof. Clearly, M ( X ) is complete w.r.t. D P D ∞ . Since all metrics on M ( X )are equivalent by Theorem 3, the same should be true for D i . On T ( X )we have to observe additionally, that T ( X ) is closed since both sides of thefour point conditions (5) depend continuously on the metric. Then Lemma3 implies completeness of T ( X ).To show that the new metrics are biologically meaningful, we show thatthey don’t change much under an NNI (nearest neighbour interchange) op-erations. Such an operation is given by AB • • CD AC • • BD or AB • • CD AD • • CB where A, B, C, D denote different subtrees. The minimal number of NNI op-erations to reach τ ′ ∈ T ( X ) from τ ∈ T ( X ) is the NNI-distance D NNI ( τ, τ ′ )[28]. Theorem 4.
Consider τ, τ ′ ∈ T ( X ) which are away by one NNI operation.Then D ( τ, τ ′ ) ≤ nD ( τ, τ ′ ) ≤ √ nD ∞ ( τ, τ ′ ) = 1 Especially, D NNI ( τ, τ ′ ) ≥ D ∞ ( τ, τ ′ ) ≥ √ n D ( τ, τ ′ ) ≥ n D ( τ, τ ′ ) . Proof.
Let be τ = AB • • CD and τ ′ = AC • • BD where A, B, C, D are the four subtrees of τ, τ ′ corresponding to a four-partition of X . 12hen we observe the following structure of the matrix ∆ ∈ R X ≥ , ∆ x,y =( | ρ τ ( x, y ) − ρ τ ′ ( x, y ) | ) x,y ∈ X : ∆ = or more precisely ∆ x,y = x ∈ A ∪ D, y ∈ B ∪ C y ∈ A ∪ D, x ∈ B ∪ C Remark 3.
Similar estimates could be done for the SPR-metrics. By [1]this has natural implications to the TBR-metrics, too. Further we see thatthe size of the − neighbourhood of a tree τ ∈ T ( X ) in the D ∞ − metric is atleast n − . How large are those bounds compared to the diameter of the space T ( X )?We have some crude estimates: Lemma 9.
For all τ , τ ∈ T ( X ) it holds D ( τ , τ ) ≤ n · n − D ( τ , τ ) ≤ √ n · n − D ∞ ( τ , τ ) ≤ n − Proof. D ∞ ( τ , τ ) ≤ n − follows immediately from D P D ∞ ( τ , τ ) ≤ n − τ , τ have at least one and at most ( n −
1) edges. The-orem 3 implies the other two inequalities and the estimate on the NNI-metricare immediate consequences of its definition.Now we want to show that there are trees such that the distance betweenthem is of the same order in n . Lemma 10.
Let us be given n = 4 m + 1 for some m ∈ N , m ≥ , X = { , . . . , m + 1 } . Suppose τ is the unrooted caterpillar tree with cherries { , } and { m, m + 1 } : τ = 12 • • • • · · · m − • m − • • m m + 113 nd τ ′ is obtained from τ by reversing the order of the even labels, i.e. i isinterchanged with m + 1 − i ) for i = 1 , . . . , m : τ ′ = 14 m • • m − • • · · · • m − • • m + 1 Then D ( τ, τ ′ ) ≥ ˜ D ( τ, τ ′ ) ≥ m − m + 2 D ( τ ′ , τ ′ ) ≥ ˜ D ( τ, τ ′ ) ≥ q m − m + m − D ∞ ( τ, τ ′ ) = 2 m − Proof.
It is easy to see that for 1 < i < j < n = 4 m + 1 ρ ( i, j ) = j i = 1 , , j ≤ m m i = 1 , , j = 4 m + 14 m + 1 − i ≤ i, j = 4 m, m + 1 j − i + 2 otherwiseFirst, the formula for D ∞ ( τ, τ ′ ) follows immediately from Theorem 3.Continuing, we obtain from (8) the following constraints δ + δ ≥ m − δ + δ ≥ m − δ + δ ≥ m − δ m − + δ m ≥ δ m − + δ m − ≥ m − δ m − + δ m ≥ m − D ( τ, τ ′ ).Now a + b ≥ ( a + b ) gives us δ + δ ≥ m − δ + δ ≥ m − δ + δ ≥ m − ... δ m − + δ m ≥ δ m − + δ m − ≥ m − δ m − + δ m ≥ m − D ( τ, τ ′ ). Remark 4.
Using the results from the next section and computation similarto the second next section we could derive the same order of magnitude of D i for general n . From Lemma 4 we obtain for “small” semimetrics ρ ′ , ρ ′′ ∈ M ( X ) immediatelythat D i ( ρ + ρ ′ , ρ + ρ ′′ ) ≤ D i ( ρ ′ , ρ ′′ ) . Notably, we can even sharpen this estimate:
Lemma 11.
For all ρ ∈ M > ( X ) there is a ε > such that for ρ ′ , ρ ′′ ∈ M ( X ) with D ∞ (0 , ρ ′ ) , D ∞ (0 , ρ ′′ ) < εD i ( ρ + ρ ′ , ρ + ρ ′′ ) = ˜ D i ( ρ ′ , ρ ′′ ) Remark 5.
For τ ∈ T ( X ) the condition ρ ∈ M > ( X ) just means that thelabeling is injective. Thus it is weaker than to say that τ is an inner pointof some orthant of tree space as considered in [2], meaning the tree is binaryand all edge lengths are positive.Further, this result is another proof that the ˜ D i are really metrics, seeLemma 6. In the following, let 0 ∈ M ( X ) denote the zero semimetric on X . Proof of Lemma 11.
By Lemma 7, we may add the constraints δ x ≤ D ∞ ( ρ + ρ ′ , ρ + ρ ′′ ) = 2 D ( ρ ′ , ρ ′′ ) to (7) and (8) to get problems (9) and (10), respec-tively.Now it is easy to derive that for ε = 12 min (cid:26) ρ ( x, y ) : { x, y } ∈ (cid:18) X (cid:19)(cid:27) and ρ, ρ ′ ∈ M ( X ), D ∞ (0 , ρ ′ ) , D ∞ (0 , ρ ′′ ) < ε , the constraints | δ x − δ y | ≤ ρ ( x, y ) + ρ ′ ( x, y ) + ρ ′′ ( x, y )are automatically fulfilled. Removing them yields problem (10).15 xample 3. So it is interesting to ask for ˜ D i (0 , τ lA,B ) for a very simple τ ,we choose τ = A Bl where A | B is a split of X and l is the length of thissplit.We see that the constraints from (8) turn into δ x ≥ x ∈ Xδ x + δ y ≥ l x ∈ A, y ∈ B Now (8) is symmetric under permutations of A and under permutations of B . Thus we may simply assume that δ x = (cid:26) a x ∈ Ab x ∈ B for some a, b ∈ R ≥ with a + b ≥ l .For computing ˜ D , we find k δ k = Aa + Bb ≥ Aa + B ( l − a ) . The later function of a has minimum ˜ D (0 , τ lA,B ) = min( A, B ) l .Similarly we find for ˜ D k δ k = Aa + Bb ≥ Aa + B ( l − a ) . Now the minimum is ˜ D (0 , τ lA,B ) = q A Bn l .Summarisingly, we observe that different splits of a tree get differentweights.Moreover, we see that the minimal points δ ∗ i fulfil all contraints in (7).This shows D i = ˜ D i . Further, the same computations are valid if we compute D i ( τ lA,B , τ l ′ A,B ) with | l − l ′ | replacing l : D ( τ lA,B , τ l ′ A,B ) = min( A, B ) | l − l ′ | D ( τ lA,B , τ l ′ A,B ) = r A Bn | l − l ′ | Example 4.
We want to compute ˜ D i (0 , τ l,l ′ A,B,C ) for τ = A B Cl l ′ This tree is the essence of two trees with same shape but differing in thelengths of two edges. gain, symmetry gives us to consider only δ x = a x ∈ Ab x ∈ Bc x ∈ C for some a, b, c ∈ R ≥ which fulfil now a + b ≥ lb + c ≥ l ′ a + c ≥ l + l ′ (11) which gives us a linear or quadratic program in R ≥ .For computing ˜ D , we want Aa + Bb + Cc min on this set. We know, that this minimum is achieved in a corner of the fea-sible set. But, we see easily that not all inequalties in (11) could be equalitiesunless b = 0 . Thus at least one of a, b, c must be zero and we obtaine theminimal value as min { Al + Cl ′ , ( B + C ) l + Cl ′ , Al + ( B + C ) l ′ } A distinction of cases whether A R B + C and C R A + B givesus in any case one of the value as minimum. Thus in any case, ˜ D (0 , τ l,l ′ A,B,C ) is a linear combination of l and l ′ , i.e. some weighted ℓ − distance.The computation of ˜ D would mean solving the quadratic program Aa + Bb + Cc min For this problem, we only know that the solution is the projection of thenull vector onto the affine hyperspace determined by some face of the fea-sible set. This projection is linear in l and l ′ . This means that ˜ D is theminimum of several quadratic functions in l, l ′ . Since the algebra is rathertedious we stop here now with the indication that this minimum is just a sin-gle quadratic function similar to the linear case before. A numerical test forseveral cardinalities and random lengths l, l ′ provided in Figure 1 shows thatthe parallelogramm equality is fulfilled in all considered situations. Thus thelocal geometry seems to be euclidean. This was our original expectation whenwe introduced D . But even if this would be true in general, we are alreadyasured by the previous example that we do not to compute the geodesic metricfrom [2]. d ( l ) + d ( l ) d ( l + l ) + d ( l - l ) d ( l ) + d ( l ) d ( l + l ) + d ( l - l ) Parallelogramm Identities −4 −2 0 2 4 D ( , t A,B,Cl,1 ) l =
1, =
1, = =
1, =
2, = =
4, =
1, = Figure 1: Test of the parallelogramm equality for random lengths l, l ′ and A = B = C = 1 (above left), A = 1 , B = 2 , C = 3 (aboveright). On the x − axis ˜ D (0 , τ l ,l ′ A,B,C ) + ˜ D (0 , τ l ,l ′ A,B,C ) is presented. On the y − axis ˜ D (0 , τ l + l ,l ′ + l ′ A,B,C ) + ˜ D (0 , τ l − l ,l ′ − l ′ A,B,C ) is plotted. Below, the curves l ˜ D (0 , τ l, A,B,C ) for different scenarios on A, B, C are plotted.18 Monotony
For any X -tree τ let τ | X denote the restriction to X ⊆ X . Observe thatfor τ ∈ T ( X ) in general τ | X / ∈ T ( X ). Lemma 12.
Let X ⊇ X and τ, τ ′ ∈ T ( X ) . Then for i = 1 , , ∞ D i ( τ, τ ′ ) ≥ D i ( τ | X , τ ′ | X )˜ D i ( τ, τ ′ ) ≥ ˜ D i ( τ | X , τ ′ | X ) Proof.
This follows immediately from the same inequalities for semimetricson X . Then, restricting d ∗ i ∈ E ( ρ, ρ ′ ) from Lemma 2 to X ∪ X ′ yields anelement of E ( ρ | X , ρ ′ | X ). Moreover, k ( δ x ) x ∈ X k i ≥ k ( δ x ) x ∈ X k i for δ ∈ R X ≥ completes the calculation. Remark 6.
This result naturally holds for many other phylogenetic metrics:for the pathwise difference, NNI-, SPR-, TBR- and maximum parsimonymetrics, for example. For the tree rearrangement metrics is was shown in [1,Lemma 2.2].
The different metrics were implemented by R [35] programs. For solving lin-ear and quadratic programs the glpkAPI library [38] and quadprog library[37] were used, respectively. The corresponding R -script can be downloadedfrom the website [40]. Some testing showed best performance in terms ofcomputing time for the dual simplex algorithm in the ℓ -case. The com-puting time for obtaining the distance between random trees of size 100 wasaround 0.3s which is quite reasonable, see Figure 2. It also compares with thecomputing time of the geodesic distance. The random trees were generatedby the function rtree of the R library phangorn [36].We also compared D i and ˜ D i with several other phylogenetic metrics,essentially the pathwise difference, the geodesic distance and the Robinson-Foulds metric, for n = 10 leaves. For the computation of the geodesic (BHV-)metric the R -package distory [39] was used. The results are presented inFigure 3. Numerially, we could observe D i = ˜ D i in all cases, seee Figure6 at the end of the paper. A remarkable correlation between the differentGromov-type and the pathwise difference metrics can be observed. There isnot much correlation to the geodesic distance. May be, the different weigthson the internal edges (see example 3) are responsible for that.19 p1 l d1 l d1 ( ) l~ l l~ BHV RF . . . . . . . . . . computing times of tree metrics, n = type t i m e Figure 2: Computing times different metrics (logarithmic scale) for randomtrees with n = 100 using the dual simplex algorithm. From left: D but withprimal simplex algorithm, D , D for n = 200, ˜ D , D , ˜ D , the geodesic andthe Robinson-Foulds metric.Similar pictures are found for unweighted trees, see Figure 4. Interest-ingly, D = ˜ D turns out to integer-valued now, see the same figure. Thatis quite a bit surprising since the matrix corresponding to the linear pro-gram (8) is not totally unimodular in the sense of [18], it contains the 3 × with determinant − D (over the sample) than from random trees. In comparison, the lower boundfrom Lemma 10 would be much smaller: n − n + 2 = 17. What have we achieved? We constructed at least two new biocomputablemetrics for comparing unrooted, but possibly weighted, phylogenetic trees.We think this approach is valuable and could generalise well. One direction isthe extension to rooted trees. We should then just measure the distance of theinduced metrics on X ∪ { root } . Another generalisation could be phylogeneticnetworks. Outside phylogeny, there should be applications to other kinds offinite labeled metric spaces. At the moment, we are only aware of the papersof F.Memoli, e.g. [23], which deals with ℓ p − type Gromov-Hausdorff metrics.In general, we follow [31] in arguing that there is no universal metric20 l2 linf . . . PD1 PD2 . . . BHV RF l1 l2 linf PD1 PD2 BHV RF Figure 3: Comparison of different metrics for random trees with n = 10.Above from upper left: D , D , D ∞ , D P D , D P D , the geodesic and theRobinson-Foulds metric. Below, the distributions are presented in boxplots.21 l2 linf . . . PD1 PD2 . . BHV RF l1 l2 linf PD1 PD2 BHV RF frequencies of D ( t , t ) f r equen cy Figure 4: Comparison of different metrics for random unweighted trees with n = 10. Above from upper left: D , D , D ∞ , D P D , D P D , the geodesic andthe Robinson-Foulds metric. In the middle, the distributions are presentedin boxplots. At the bottom, the frequency table of D is presented.22 l2 linf . . . PD1 PD2 . . BHV
10 15 20 25 1.5 2.0 2.5 3.0 3.5 15 20 25 30 6 8 10 12 14 RF l1 l2 linf PD1 PD2 BHV RF frequencies of D ( t , t ) f r equen cy Figure 5: Comparison of different metrics for random caterpillars with n =10. Above from upper left: D , D , D ∞ , D P D , D P D , the geodesic and theRobinson-Foulds metric. In the middle, the distributions are presented inboxplots. At the bottom, the frequency table of D is presented with thelower bound from Lemma 10 in red. 23or phylogenetic trees which suits perfectly for all purposes. We think thatevery application has its own choice, and we added a further choice to thisportfolio. Yet, we should discuss further properties of phylogenetic metricsto guide the users. Monotony as considered in section 6 is a, yet trivial,beginning in this direction. Here we want to discuss some important resultsof the present paper and possible extensions only.It looks interesting to extend the metric to tree shapes, with allowing thelabels to be permuted. Still that metric differs from the Gromov-Hausdorffmetric since we allow only matching of the labels in contrast to the weakerversion in (2). For the Gromov-Hausdorff distance it is shown in [25] thatit is again NP-hard to compute it. We expect the same for the permutationapproach.One important topic which raised up already in [3, 22, 11, 21] is thequestion how to weight the edges of the trees. We obtained natural weightsfrom our approach in Example 3. If those weights do not fit the intentionof the applicant, it is easy to shorten or lengthen the edges of the trees andobtain other metric spaces which could be easily compared. There is also thepossibility to weight the labels, for instance to account for uneven sampling.Then we could adjust to this by weighting the k·k i norms which leads againto similar computations. Note that we met already such a weighted approachin the computations in the Examples 3 and 4. Further, also a Kantorovich-Wasserstein approach similar to [23] might be feasible if the weights of theleaves differ between the trees. In summary, our approach is natural but canbe well adjusted to the needs of applications.We showed several properties of the new metrics including compatibilitywith the NNI-metric, a lot of estimates with the pathwise difference met-rics, local properties related to the lower bound metrics ˜ D i , and monotony.Of course, there are many more questions in this context. Especially wewould like to sharpen the estimates. We do not know much about the1 − neighbourhoods on T ( X ), e.g. whether there are islands in the sense of[3]. There are a lot of connections with the quartet, SPR-,TBR-, maximumparsimony, weighted matching and BHV-metrics to explore, too. Numericalcomparison was done for the R -implemented distances only.We expect the diameter between two unweighted X -trees to be realised bycaterpillar trees. The simulation result in Figure 5 points into this direction.A more sharp estimate than provided in Lemma 9 and Lemma 10 would bequite interesting, too. It is still not clear whether and why D or ˜ D takesintegers values only on T ( X ).The geometry induced by the euclidean type metrics D , ˜ D should befurther explored, too. It should be interesting to prove it is locally euclideanand to find out how the geodesics look like. Possibly, the geodesic distance24ith respect to D is even another metric.Most interesting we find the question whether D i = ˜ D i . Provable equalitycould save some computing time, at least. For the time until this problemis solved, we just know there are new animals in the zoo of phylogeneticdistances . . . but not, how many. Acknowledgements
First of all, I have to thank Mareike Fischer for in-troducing me to the world of phylogenetic distances. She helped also a lot forgetting a clear notation. Second, I’m very grateful to J¨urgen Eichhorn whounconsciously draw my attention to metrics between metric spaces. Third,I’d like to thank Michelle Kendall for her inspiring talk at the Portobello con-ference 2015 and additional discussion later. Fourth, I thank Mike Steel formany interesting discussions, useful hints, his kind hospitality during my stayin Christchurch 2010, and for the organisation of the amazing 2015 workshopin Kaikoura with an inspiring and open atmosphere. Further, Andrew Fran-cis, Alexander Gavryushkin, Stefan Gr¨unewald, Marc Hellmuth and Giuliodalla Riva gave useful hints and inspiration in many discussions.
References [1] B. L. Allen and M. Steel, Subtree Transfer Operations and Their InducedMetrics on Evolutionary Trees, Ann. Comb. :1–15, 2001[2] L.J. Billera, S.P. Holmes, and K. Vogtmann, Geometry of the Space ofPhylogenetic Trees, Adv. Appl. Math., (4): 733-767, 2001.[3] D. Bogdanowicz and K. Giaro, Matching Split Distance for UnrootedBinary Phylogenetic Trees, IEEE/ACM Transactions on ComputationalBiology and Bioinformatics (1):150-160, 2012[4] D. Bogdanowicz, K. Giaro, and B. Wr´obel, TreeCmp: Comparison ofTrees in Polynomial Time, Evol. Bioinform. Online : 475–487, 2012.[5] M.L. Bonet and K. St.John. On the complexity of uSPR distance.IEEE/ACM Trans. Comp. Biol. Bioinf. (3): 572—576, 2010[6] G.S. Brodal, R. Fagerberg, and C.N.S. Pedersen, Computing the quar-tet distance between evolutionary trees on time O( n log n ), Proceed-ings of the 12th International Symposium on Algorithms and Compu-tation (ISAAC). Springer Verlag, Lecture Notes in Computer Science,Vol. 2223, pp. 731–737, 2001 257] P. Buneman, The Recovery of Trees from Measures of Dissimilarity. InD.G. Kendall and P. Tautu, eds., Mathematics the the Archeological andHistorical Sciences, pages 387–395. Edinburgh University Press, 1971[8] P. Buneman, A note on the metric properties of trees, J. Comb. Th., (1): 48-50, 1974[9] J. Cristina, Gromov-Hausdorff convergence of metric spaces, preprint,Helsinki 2008 [10] B. DasGupta, X. He, T. Jiang, M. Li, J. Tromp, and L. Zhang, OnDistances between Phylogenetic Trees, Proc. Eighth ACM/SIAM Symp.Discrete Algorithms (SODA ’97), pp. 427–436, 1997[11] A. Gavryushkin and A. Drummond, The space of ultrametric phyloge-netic trees, preprint 2014 arXiv:1410.3544v1 [12] W.H.E. Day, Optimal algorithms for comparing trees with labeledleaves, J. Class. 2(1):7-28, 1985.[13] E. Deza and M.M. Deza, Encyclopedia of Distances, Springer 2009[14] A.W.M. Dress, Trees, tight extensions of metric spaces, and the cohomo-logical dimension of certain groups: A note on combinatorial propertiesof metric spaces, Adv. Math. (3): 321-402, 1984[15] G. F. Estabrook, F.R. McMorris, and C.A. Meacham, Comparison ofUndirected Phylogenetic Trees Based on Subtrees of Four EvolutionaryUnits, Syst. Zool. (2):193-200, 1985[16] M. Fischer and S. Kelk, On the Maximum Parsimony distance be-tween phylogenetic trees. In press at Ann. Comb. (preliminary version: Arxiv: 1402.1553) [17] A. Gu´enoche, B. Leclerc, V. Makarenkov, On the extension of a partialmetric to a tree metric, Discr. Math. : 229-248, 2004.[18] A.J. Hoffman, J. Kruskal, Introduction to Integral Boundary Points ofConvex Polyhedra, in M. J¨unger et al. (eds.), 50 Years of Integer Pro-gramming, 1958-2008, Springer Berlin 2010, pp. 49–50[19] N. Karmarkar, A new polynomial-time algorithm for linear program-ming. Combinatorica (4): 373–395,1984.2620] S. Kelk and M. Fischer, On the complexity of computing MP distancebetween binary phylogenetic trees, arXiv:1412.4076 [21] M. Kendall, A new metric for the comparison of phylogenetic trees, talkat the New Zealand phylogenetic conference, Portobello, 2015[22] Y. Lin, V. Rajan, and B.M.E. Moret, A metric for phylogenetic treesbased on matching, IEEE/ACM Trans. Comp. Biol. Bioinf. (4): 1014-1022, 2012[23] F. M´emoli, On the Use of Gromov-Hausdorff Distances for Shape Com-parison. Symposium on Point Based Graphics 2007, Prague, September2007.[24] M. Owen and J. Provan, A Fast Algorithm for Computing GeodesicDistances in Tree Space, IEEE/ACM Trans. Comp. Biol. Bioinf. (1):2-13, 2011, arXiv:0907.3942 [25] P.M. Pardalos and H. Wolkowicz(Eds.), Quadratic assignment and re-lated problems, DIMACS Series in Discrete Mathematics and Theoreti-cal Computer Science, 16. American Mathematical Society, Providence,RI, 1994. Papers from the workshop held at Rutgers University, NewBrunswick, New Jersey, May 20–21, 1993[26] N.D. Pattengale, E.J. Gottlieb, B.M. Moret, Efficiently computing theRobinson-Foulds metric, J. Comput. Biol. (6):724-735, 2007.[27] D. Penny and M.D. Hendy, The Use of Tree Comparison Metrics, Syst.Biol. (1): 75-82, 1985[28] D.F. Robinson, Comparison of Labeled Trees with Valency Three, J.Comb. Th. :105-119(1971)[29] D.F. Robinson and L.R. Foulds, Comparison of weighted labelled trees,in Combinatorial Mathematics VI, Lecture Notes in Mathematics 748,Springer, Berlin, 1979, pp. 119- 126.[30] D.F. Robinson and L.R. Foulds, Comparison of Phylogenetic Trees,Math. Biosciences, : 131-147, 1981[31] M.A. Steel and D. Penny, Distributions of Tree Comparison Metrics —Some New Results, Syst. Biol. (2): 126-141, 19932732] C. Whidden, R.G. Beiko, and N. Zeh. Fixed-Parameter and Approxi-mation Algorithms for Maximum Agreement Forests of MultifurcatingTrees. To appear in Algorithmica, 2015, arXiv:1305.0512 [33] W.T. Williams and H.T. Clifford, On the comparison of two classifica-tions of the same set of elements, Taxon : 519-522, 1971[34] K.A. Zaretskii, Constructing a tree on the basis of a set of distancesbetween the hanging vertices. (in Russian) Uspekhi Mat. Nauk (6):90–92, 1965.[35] R Core Team, R: A language and environment for statistical comput-ing, R Foundation for Statistical Computing, Vienna, Austria. URL , 2015[36] K.P. Schliep, phangorn: Phylogenetic analysis in R, Bioinformatics (4): 592-593, 2011[37] B.A. Turlach (R port by Andreas Weingessel), quadprog: Functionsto solve Quadratic Programming Problems, R package version 1.5-5, http://CRAN.R-project.org/package=quadprog , 2013[38] G. Gelius-Dietrich, glpkAPI: R Interface to C API of GLPK, R packageversion 1.3.0, http://CRAN.R-project.org/package=glpkAPI , 2015[39] J. Chakerian and S. Holmes, distory: Distance Be-tween Phylogenetic Histories, R package version 1.4.2, http://CRAN.R-project.org/package=distory , 2013[40] V. Liebscher, R file for figures in the present paper, 2015 A On metric extensions
Several times we met the problem whether a partial dissimilarity on X , i.e.a map q : E → R ≥ , E ⊆ (cid:0) X (cid:1) has an extension to a metric on X . This seemsto be a well-known problem, one folklore solution I found in [17]: Theorem 5.
If the graph G = ( X, E ) is simple and connected then q : E → R ≥ extends to a semimetric on X if and only if for all x, y ∈ X , { x, y } ∈ E , q ( x, y ) = d qG ( x, y ) . d qG was introduced in (4).Although this presents a complete solution of the extension problem wewant to sharpen this criterion for improved applicability. Still the next resultshould be folklore but I could not find it in literature. If p = x x . . . x m is acycle in a graph ( X, E ) we call any pair { x i , x j } ∈ E , 0 ≤ i, j ≤ m − , ≤| i − j | ≤ m − p . A cycle p without chord is called minimal cycle. Theorem 6.
If the graph G = ( X, E ) is simple and connected, then q : E → R ≥ extends to a metric on X if and only if for all minimal cycles p of G and all edges { x, y } in p q ( x, y ) ≤ len( p ) . (12) Proof.
We assume the opposite. Thus we find a (non-minimal) cycle p = x x , . . . x m = x such that e = { x , x } violates (12). We may assumew.l.o.g. that the length of p , m is minimal.Non-minimality of p implies that there is a chord { x i , x j } of p . Since m is minimal, we know d ( x i , x j ) ≤ j − X k = i d ( x k , x k +1 )and d ( x , x ) ≤ i − X k =1 d ( x k , x k +1 ) + d ( x i , x j ) + n − X k = j d ( x k , x k +1 )Substituting the first inequality into the RHS of the second one yields d ( x , x ) ≤ n − X k =1 d ( x k , x k +1 ) . This is (12). This contradiction completes the proof.We can use this result for the
Proof of Theorem 2.
We are using Theorem 6 below on X ∪ X ′ with E = (cid:0) X (cid:1) ∪ (cid:0) X ′ (cid:1) ∪ {{ x, x ′ } : x ∈ X } . The minimal cycles in ( X ∪ X ′ ) are eithertriangles in X , triangles in X ′ or rectangles x, y, y ′ , x ′ . For the two former,(12) is equivalent to the triangle inequalities for ρ, ρ ′ . For the latter, (12) isthe same as (6).The following result was used in the proof of Theorem 1. Lemma 13.
Suppose
X, Y, Z are disjoint sets and there are given d ∈ M ( X ∪ Y ) and d ∈ M ( Y ∪ Z ) such that d | ( Y ) = d | ( Y ) . Then there existsa d ∈ M ( X ∪ Y ∪ Z ) such that d | ( X ∪ Y ) = d and d | ( Y ∪ Z ) = d . roof. Now we apply the theorem to the graph ( X ∪ Y ∪ Z, (cid:0) X ∪ Y (cid:1) ∪ (cid:0) Y ∪ Z (cid:1) )with w ( u, v ) = (cid:26) d ( u, v ) u, v ∈ X ∪ Yd ( u, v ) u, v ∈ Y ∪ Z . Since both X ∪ Y and Y ∪ Z arecliques in this graph, the only minimal cycles are triangles. For them (12) isfulfilled by definition of w . 30 l1s l2 l2s Figure 6: Equality of D i with ˜ D i , i = 1 , nn