[PDF] The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models

Abstract

The cophenetic metrics d φ,p , for p∈0∪[1,∞[ , are a recent addition to the kit of available distances for the comparison of phylogenetic trees. Based on a fifty years old idea of Sokal and Rohlf, these metrics compare phylogenetic trees on a same set of taxa by encoding them by means of their vectors of cophenetic values of pairs of taxa and depths of single taxa, and then computing the L p norm of the difference of the corresponding vectors. In this paper we compute the expected value of the square of d φ,2 on the space of fully resolved rooted phylogenetic trees with n leaves, under the Yule and the uniform probability distributions.

Full PDF

TThe expected value of the squared euclidean copheneticmetric under the Yule and the uniform models

Gabriel Cardona, Arnau Mir, Francesc Rossell´o ∗ Department of Mathematics and Computer Science, University of the Balearic Islands,E-07122 Palma de Mallorca, Spain

Abstract

The cophenetic metrics d ϕ,p , for p ∈ { } ∪ [1 , ∞ [, are a recent addition to thekit of available distances for the comparison of phylogenetic trees. Based on aﬁfty years old idea of Sokal and Rohlf, these metrics compare phylogenetic treeson a same set of taxa by encoding them by means of their vectors of copheneticvalues of pairs of taxa and depths of single taxa, and then computing the L p norm of the diﬀerence of the corresponding vectors. In this paper we computethe expected value of the square of d ϕ, on the space of fully resolved rootedphylogenetic trees with n leaves, under the Yule and the uniform probabilitydistributions. Keywords:

Phylogenetic tree, Cophenetic metric, Uniform model, Yule model,Sackin index, Total cophenetic index

1. Introduction

The deﬁnition and study of metrics for the comparison of rooted phylogenetictrees on the same set of taxa is a classical problem in phylogenetics [10, Ch. 30],and many metrics have been introduced so far with this purpose. A recentaddition to the set of metrics available in this context are the cophenetic metrics d ϕ,p introduced in [8]. Based on a ﬁfty years old idea of Sokal and Rohlf, thesemetrics compare phylogenetic trees on a same set of taxa by ﬁrst encodingthe trees by means of their vectors of cophenetic values of pairs of taxa anddepths of single taxa, and then computing the L p norm of the diﬀerence of thecorresponding vectors.Once the disimilarity between two phylogenetic trees has been computedthrough a given metric, it is convenient in many situations to assess its signiﬁ-ance. One possibility is to compare the value obtained with its expected, or ∗ Corresponding author

Email addresses: [email protected] (Gabriel Cardona), [email protected] (Arnau Mir), [email protected] (Francesc Rossell´o)

Preprint submitted to Elsevier April 22, 2019 a r X i v : . [ q - b i o . P E ] J a n ean, value: is it much larger, much smaller, similar? [28] This makes it neces-sary to study the distribution of the metric, or, at least, to have a formula forthe expected value of the metric for any number n of leaves. The distributionof several metrics has been studied so far: see, for instance, [5, 6, 16, 17, 28].The expected value of a distance depends on the probability distribution onthe space of phylogenetic trees under consideration. The most popular distri-bution on the space T n of binary phylogenetic trees with n leaves is the uniformdistribution, under which all trees in T n are equiprobable. But phylogeneticistsconsider also other probability distributions on T n , deﬁned through stochasticmodels of evolution [10, Ch. 33]. The most popular is the so-called Yule model[14, 29], deﬁned by an evolutionary process where, at each step, each currentlyextant species can give rise, with the same probability, to two new species. Un-der this model, diﬀerent phylogenetic trees with the same number of leaves mayhave diﬀerent probabilities, which depend on their shape.In this paper we provide explicit formulas for the expected values under theuniform and the Yule models of the square of the euclidean cophenetic metric d ϕ, . The proofs of these formulas are based on long and tedious algebraiccomputations and thus, to ease the task of the reader interested only in theformulas and the path leading to them, but not in the details, we have movedthese computations to an Appendix at the end of the paper.Besides the aforemenentioned application of this value in the assessmentof tree comparisons, the knowledge of formulas for the expected value of d ϕ, under diﬀerent models may allow the use of d ϕ, to test stochastic models of treegrowth, a popular line of research in the last years which so far has been mostlybased on shape indices; see, for instance, [3, 19]. As a proof of concept, in §

2. Preliminaries

In this paper, by a phylogenetic tree on a set S of taxa we mean a fullyresolved, or binary, rooted tree with its leaves bijectively labeled in S . Weunderstand such a rooted tree as a directed graph, with its arcs pointing awayfrom the root. To simplify the language, we shall always identify a leaf of aphylogenetic tree with its label. We shall also use the term phylogenetic treewith n leaves to refer to a phylogenetic tree on the set { , . . . , n } . We shalldenote by T ( S ) the space of all phylogenetic trees on S and by T n the space ofall phylogenetic trees with n leaves.Let T be a phylogenetic tree. If there exists a directed path from u to v in T , we shall say that v is a descendant of u and also that u is an ancestor of v . The lowest common ancestor LCA T ( u, v ) of a pair of nodes u, v in T is theunique common ancestor of them that is a descendant of every other commonancestor of them. The depth δ T ( v ) of a node v in T is the distance (in numberof arcs) from the root of T to v . The cophenetic value ϕ T ( i, j ) of a pair of leaves i, j in T is the depth of their LCA. To simplify the notations, we shall oftenwrite ϕ T ( i, i ) to denote the depth δ T ( i ) of a leaf i .2iven two phylogenetic trees T, T (cid:48) on disjoint sets of taxa

S, S (cid:48) , respectively,we shall denote by T (cid:98) T (cid:48) the phylogenetic tree on S ∪ S (cid:48) obtained by connectingthe roots of T and T (cid:48) to a (new) common root. Every phylogenetic tree T ∈ T n is obtained as T k (cid:98) T (cid:48) n − k , for some 1 (cid:54) k (cid:54) n −

1, some subset S k ⊆ { , . . . , n } with k elements, some tree T k on S k and some tree T (cid:48) n − k on S ck = { , . . . , n }\ S k .Actually, every phylogenetic tree in T n is obtained in this way twice.The Yule , or

Equal-Rate Markov , model of evolution [14, 29] is a stochasticmodel of phylogenetic trees’ growth. It starts with a node, and at every step aleaf is chosen randomly and uniformly and it is splitted into two leaves. Finally,the labels are assigned randomly and uniformly to the leaves once the desirednumber of leaves is reached. This corresponds to a model of evolution where, ateach step, each currently extant species can give rise, with the same probability,to two new species. Under this stochastic model, if T ∈ T n is a phylogenetictree with set of internal nodes V int ( T ), and if for every v ∈ V int ( T ) we denoteby (cid:96) T ( v ) the number of its descendant leaves, then the probability of T is [4, 27] P Y ( T ) = 2 n − n ! (cid:89) v ∈ V int ( T ) (cid:96) T ( v ) − . The uniform , or

Proportional to Distinguishable Arrangements , model [22] isanother stochastic model of phylogenetic trees’ growth. Unlike the Yule model,its main feature is that all phylogenetic trees T ∈ T n have the same probability: P U ( T ) = 1(2 n − n − n − n − · · · · . From the point of view of tree growth, this model is described as the processthat starts with a node labeled 1 and then, at the k -th step, a new pendant arc,ending in the leaf labeled k + 1, is added either to a new root (whose other childwill be, then, the original root) or to some edge, with all possible locations ofthis new pendant arc being equiprobable [9, 26]. Although this is not an explicitmodel of evolution, only of tree growth, several interpretations of it in terms ofevolutionary processes have been given in the literature: see [3, p. 686] and thereferences therein.

3. Main results

Let T ∈ T n be a phylogenetic tree with n leaves. The cophenetic vector of T is ϕ ( T ) = (cid:0) ϕ T ( i, j ) (cid:1) (cid:54) i (cid:54) j (cid:54) n ∈ R n ( n +1) / , with its elements lexicographically ordered in ( i, j ). It turns out [8] that themapping ϕ : T n → R n ( n +1) / sending each T ∈ T n to its cophenetic vector ϕ ( T ), is injective up to isomorphism. As it is well known, this allows to inducemetrics on T n from metrics deﬁned on powers of R . In particular, in this paper3e consider the cophenetic metric d ϕ, on T n induced by the euclidean distance: d ϕ, ( T , T ) = (cid:88) (cid:54) i (cid:54) j (cid:54) n ( ϕ T ( i, j ) − ϕ T ( i, j )) . To distinguish it from other cophenetic metrics obtained through other L p normes, we shall call it the euclidean cophenetic metric . Example 1.

Consider the phylogenetic trees

T, T (cid:48) ∈ T depicted in Fig. 1.Their total cophenetic vectors are ϕ ( T ) = (2 , , , , , , , , , ϕ ( T (cid:48) ) = (1 , , , , , , , , , d ϕ, ( T, T (cid:48) ) = 7. As we shall see below, the expected values of thesquare of d ϕ, on T under the uniform and the Yule models are, respectively,10 .

56 and 9 .

41, and hence these two trees are quite more similar than averagewith respect to the euclidean cophenetic metric under both models. T T (cid:48) Figure 1: Two phylogenetic trees with 4 leaves.

Let D n the random variable that chooses a pair of trees T, T (cid:48) ∈ T n andcomputes d ϕ, ( T, T (cid:48) ) . Its expected values under the Yule and the uniformmodels are given by the following two theorems. Recall that the n -th harmonicnumber H n is deﬁned as H n = (cid:80) ni =1 /i . Theorem 2.

For every n (cid:62) , the expected value of D n under the Yule modelis E Y ( D n ) = 2 nn − (cid:0) n − n − n + 1) H n − n + 1) H n (cid:1) . Theorem 3.

For every n (cid:62) , the expected value of D n under the uniformmodel is E U ( D n ) = 13 (4 n +18 n − n ) − n ( n + 3)2 · (2 n − n − − n ( n + 7)4 Å (2 n − n − ã Since H n ∼ ln( n ) and (2 n − / (2 n − ∼ √ πn , these formulas imply that E Y ( D n ) ∼ n , E U ( D n ) ∼ (cid:16) − π (cid:17) n . We shall prove the formulas in Theorems 2 and 3 by reducing the computa-tion of the expected value of D n to that of the following random variables:4 S n , the random variable that chooses a tree T ∈ T n and computes itsSackin index S [23], deﬁned by S ( T ) = n (cid:88) i =1 δ T ( i ) • Φ n , the random variable that chooses a tree T ∈ T n and computes its totalcophenetic index Φ [18], deﬁned byΦ( T ) = (cid:88) (cid:54) i

1] on T n invariant under relabelings. Theprobability distributions p Y and p U deﬁned by the Yule and the uniform models,respectively, are invariant under relabelings, and therefore the expected valuesunder these speciﬁc models, which will be denoted by E Y and E U , respectively,are special cases of E . Proposition 4. E ( D n ) = 2 E (Φ (2) n ) − · E ( S n ) n − · E (Φ n ) n ( n − . Proof.

To simplify the notations, let • ϕ n be the random variable that chooses a tree T ∈ T n and computes ϕ T (1 , • δ n be the random variable that chooses a tree T ∈ T n and computes δ T (1).Let us compute now E ( D n ) from its very deﬁnition: E ( D n ) = (cid:88) ( T,T (cid:48) ) ∈T n d ϕ, ( T, T (cid:48) ) p ( T ) p ( T (cid:48) )= (cid:88) ( T,T (cid:48) ) ∈T n (cid:16) (cid:88) (cid:54) i (cid:54) j (cid:54) n ( ϕ T ( i, j ) − ϕ T (cid:48) ( i, j )) (cid:17) p ( T ) p ( T (cid:48) )= (cid:88) (cid:54) i (cid:54) j (cid:54) n (cid:88) ( T,T (cid:48) ) ∈T n ( ϕ T ( i, j ) + ϕ T (cid:48) ( i, j ) − ϕ T ( i, j ) ϕ T (cid:48) ( i, j )) p ( T ) p ( T (cid:48) )= (cid:88) (cid:54) i (cid:54) j (cid:54) n (cid:16) (cid:88) ( T,T (cid:48) ) ∈T n ϕ T ( i, j ) p ( T ) p ( T (cid:48) ) + (cid:88) ( T,T (cid:48) ) ∈T n ϕ T (cid:48) ( i, j ) p ( T ) p ( T (cid:48) ) − (cid:88) ( T,T (cid:48) ) ∈T n ϕ T ( i, j ) ϕ T (cid:48) ( i, j ) p ( T ) p ( T (cid:48) ) (cid:17) (cid:88) (cid:54) i (cid:54) j (cid:54) n (cid:16) (cid:88) T ∈T n ϕ T ( i, j ) p ( T ) + (cid:88) T (cid:48) ∈T n ϕ T (cid:48) ( i, j ) p ( T (cid:48) ) − (cid:16) (cid:88) T ∈T n ϕ T ( i, j ) p ( T ) (cid:17)(cid:16) (cid:88) T (cid:48) ∈T n ϕ T (cid:48) ( i, j ) p ( T (cid:48) ) (cid:17)(cid:17) = (cid:88) (cid:54) i (cid:54) j (cid:54) n (cid:16) (cid:88) T ∈T n ϕ T ( i, j ) p ( T ) − (cid:16) (cid:88) T ∈T n ϕ T ( i, j ) p ( T ) (cid:17) (cid:17) = 2 (cid:88) T ∈T n (cid:16) (cid:88) (cid:54) i (cid:54) j (cid:54) n ϕ T ( i, j ) (cid:17) p ( T ) − (cid:88) (cid:54) i

For every n (cid:62) ,(a) E Y (Φ (2) n ) = 5 n ( n − − n ( H n − (b) E U (Φ (2) n ) = 16 n (4 n + 21 n − − n ( n + 3) (2 n − n − E ( S n ), E (Φ n ), and E (Φ (2) n ) by their values under theYule and the uniform models, respectively. We leave the last details to thereader.

4. An experiment on TreeBASE

In this section we report on a very simple experiment to show how d ϕ, canbe used to test evolutionary hypotheses. In this experiment, we have comparedthe expected value of d ϕ, on T n under the uniform and the Yule models withits average value on the set TreeBASE bin,n of binary phylogenetic trees with n leaves contained in TreeBASE [20].To perform this experiment, we have taken some decisions. First, since thereare only very few values n >

50 such that | TreeBASE bin,n | >

10, we have decidedto consider only those binary trees contained in TreeBASE with n (cid:54)

50 leaves.On the other hand, even for those n such that TreeBASE bin,n is relatively large,in most cases it does not contain many pairs of trees with the same taxa. So,instead of computing the average value of d ϕ, on TreeBASE bin,n by averagingthe values d ϕ, ( T, T (cid:48) ) for pairs of trees

T, T (cid:48) with exactly the same n taxa, wehave made use of the formula given in Proposition 4, as if TreeBASE bin,n wasclosed under relabelings: that is, we have taken only into account the shapes ofthe trees contained in it. This is consistent with the fact that our ﬁnal goal isto test models of evolution that produce tree shapes.So, we have computed the average values of Φ (2) , of the Sackin index S , andof the total cophenetic index Φ on TreeBASE bin,n , and we have taken as averagevalue of d ϕ, on this set the result of appying the formula in Proposition 4. Thedetailed results of these computations, as well as the Python and R scripts usedto compute and analyze them, are available in the Supplementary Material webpage http://bioinfo.uib.es/~recerca/phylotrees/expectedcophdist/ .Fig. 2 plots the log of these average values as a function of log( n ). We haveadded the curves of the log of the expected values of D n under the Yule distri-bution (lower, dotted curve) and under the uniform distribution (upper, dashedcurve), again as a function of log( n ). The graphic shows that the expectedvalue of d ϕ, on (the shapes of) the phylogenetic trees contained in TreeBASEis better explained by the uniform model than by the Yule model. This agreeswith the results of similar experiments using other measures (see, for instance,[3, 18]).

5. Conclusions and discussion

In this paper we have obtained formulas for the expected values under theYule and the uniform models of the square of the euclidean cophenetic met-ric d ϕ, , deﬁned by the euclidean distance between cophenetic vectors. Theseformulas are explicit and hold on spaces T n of fully resolved phylogenetic treeswith any number n of leaves. 7 number of leaves l og o f t he e x pe c t ed v a l ue o f t he s qua r e o f t he d i s t an c e Figure 2: Log-log plots of the mean of D n for the binary trees in TreeBASE with a ﬁxednumber n of leaves, of E Y ( D n ) (dotted curve) and E U ( D n ) (dashed curve). These formulas have been obtained through long algebraic manipulations ofsums of sequences. To double-check our results, we have computed the exactvalue of E Y ( D n ) and E U ( D n ) for n = 3 , . . . ,

7, by generating all trees with up to7 leaves. Moreover, we have computed numerical approximations to these valuesfor n = 10 , , . . . , n = 3 , . . . ,

7. Theresults of the simulations for n = 10 , , . . . , http://bioinfo.uib.es/~recerca/phylotrees/expectedcophdist/ . 3 4 5 6 7 E Y ( D n ) 2.66667 9.40741 21.1833 38.712 62.5562 E U ( D n ) 2.66667 10.56 26.2367 52.3023 91.4086 Table 1: Values of E Y ( D n ) and E U ( D n ) for n = 3 , . . . ,

7. They agree with those given byour formulas.

The formulas for E Y ( D n ) and E U ( D n ) grow in diﬀerent orders: E Y ( D n ) isin Θ( n ), while E U ( D n ) is in Θ( n ). Therefore, they can be used to test the Yuleand the uniform models as null stochastic models of evolution for collections ofphylogenetic trees reconstructed by diﬀerent methods. We have reported on aﬁrst experiment of this type, which reinforces the conclusion that “real world”phylogenetic trees (that is, those contained in TreeBASE) are not consistentwith the Yule model of evolution. We plan to report in a future paper on more8xtensive tests on stochastic models of evolutionary processes, including Ford’s α -model [11] and Aldous’ β -model [2]. Acknowledgements

The research reported in this paper has been partially supported by theSpanish government and the UE FEDER program, through projects MTM2009-07165 and TIN2011-15874-E.

References [1] M. Abramowitz, I. Stegun,

Handbook of Mathematical Functions with For-mulas, Graphs, and Mathematical Tables . Dover (1964).[2] D. Aldous. Probability distributions on cladograms. Random discrete struc-tures, IMA Vol. Math. Appl. 76 (Springer,1996), 1–18.[3] M. G. B. Blum, O. Fran¸cois, Which random processes describe the Treeof Life? A large-scale study of phylogenetic tree imbalance. Sys. Biol. 55(2006), 685–691.[4] J. Brown, Probabilities of evolutionary trees. Syst. Biol. 43 (1994), 78–91.[5] D. Bryant, M. Steel, Computing the distribution of a tree metric.IEEE/ACM Trans. Comp. Biol. Bioinf. 16 (2009), 420–426.[6] G. Cardona, A. Mir, F. Rossell´o, The expected value under the Yule modelof the squared path-diﬀerence distance. Appl. Math. Let. 25 (2012), 2031–2036.[7] G. Cardona, A. Mir, F. Rossell´o, Exact formulas for the variance of severalbalance indices under the Yule model. To appear in J. Math. Bio. http://dx.doi.org/10.1007/s00285-012-0615-9 [8] G. Cardona, A. Mir, L. Rotger, F. Rossell´o, D. S´anchez, Cophenetic metricsfor phylogenetic trees, after Sokal and Rohlf. BMC Bioinformatics (2013)14:3[9] L. L. Cavalli-Sforza, A. Edwards, Phylogenetic analysis. Models and esti-mation procedures. Am. J. Hum. Genet., 19 (1967), 233–257.[10] J. Felsenstein,

Inferring Phylogenies . Sinauer Associates Inc., 2004.[11] D. Ford. Probabilities on cladograms: Introduction to the alpha model. arXiv:math/0511246 [math.PR] (2005).[12] http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/03/07/02 . 913] http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric2F1/03/03/01 .[14] E. Harding, The probabilities of rooted tree-shapes generated by randombifurcation. Adv. Appl. Prob. 3 (1971), 44–77.[15] S. B. Heard, Patterns in Tree Balance among Cladistic, Phenetic, andRandomly Generated Phylogenetic Trees. Evolution 46 (1992), 1818–1826.[16] M. Hendy, C. Little, D. Penny, Comparing Trees with Pendant VerticesLabelled. SIAM J. Applied Math. 44 (1984), 1054–1065.[17] A. Mir, F. Rossell´o, The mean value of the squared path-diﬀerence distancefor rooted phylogenetic trees. J. Math. Anal. Appl. 371 (2010), 168–176.[18] A. Mir, F. Rossell´o, L. Rotger, A new balance index for phylogenetic trees.Math. Biosc. 241 (2013), 125–136.[19] A. Mooers, S. B. Heard, Inferring evolutionary process from phylogenetictree shape. Quart. Rev. Biol. 72 (1997), 31–54.[20] V. Morell, TreeBASE: the roots of phylogeny. Science 273 (1996), 569–560. [21] M. Petkovsek, H. Wilf, D. Zeilberger, A = B . AK Peters Ltd. (1996).Available online at .[22] D. E. Rosen, Vicariant Patterns and Historical Explanation in Biogeogra-phy. Syst. Biol. 27 (1978), 159–188.[23] M. J. Sackin, “Good” and “bad” phenograms. Sys. Zool, 21 (1972), 225–226.[24] R. Sokal, F. Rohlf, The Comparison of Dendrograms by Objective Methods.Taxon 11 (1962), 33–40.[25] M. Steel, Distribution of the symmetric diﬀerence metric on phylogenetictrees. SIAM J. Discr. Math. 1 (1988), 541–551.[26] M. Steel, A. McKenzie, Distributions of cherries for two models of trees.Math. Biosc. 164 (2000), 81–92.[27] M. Steel, A. McKenzie, Properties of phylogenetic trees generated by Yule-type speciation models. Math. Biosc. 170 (2001), 91–112.[28] M. A. Steel, D. Penny, Distributions of tree comparison metrics—some newresults, Syst. Biol. 42 (2) (1993) 126–141.[29] G. U. Yule, A mathematical theory of evolution based on the conclusionsof Dr J. C. Willis. Phil. Trans. Royal Soc. (London) Series B 213 (1924),21–87. 10 ppendix: Proof of Proposition 5 Proof of Proposition 5.(a)

For every T ∈ T n , letΦ( T ) = S ( T ) + Φ( T ) = (cid:88) (cid:54) i (cid:54) j (cid:54) n ϕ T ( i, j ) , and let Φ n be the random variable that chooses a tree T ∈ T n and computesΦ( T ). We have that E Y (Φ n ) = E Y ( S n ) + E Y (Φ n ) = n ( n − . To compute E Y (Φ (2) n ), we shall use an argument similar to the one used inthe proof of [6, Prop. 3]. Notice that E Y (Φ (2) n ) = (cid:88) T ∈T n Φ (2) ( T ) · p Y ( T )= 12 n − (cid:88) k =1 (cid:88) Sk (cid:40) { ,...,n }| Sk | = k (cid:88) T k ∈T ( S k ) (cid:88) T (cid:48) n − k ∈T ( S ck ) Φ (2) ( T k (cid:98) T (cid:48) n − k ) · p Y ( T k (cid:98) T (cid:48) n − k )Now, on the one hand, we have the following easy lemma on P Y ( T (cid:98) T (cid:48) ): see [7,Lem. 1]. Lemma 6.

Let ∅ (cid:54) = S k (cid:40) { , . . . , n } with | S k | = k , let T k ∈ T ( S k ) and T (cid:48) n − k ∈T ( S ck ) . Then, P Y ( T k (cid:98) T (cid:48) n − k ) = 2( n − (cid:0) nk (cid:1) P ( T k ) P ( T (cid:48) n − k ) . On the other hand, we have the following recursive expression for Φ (2) ( T (cid:98) T (cid:48) ). Lemma 7.

Let ∅ (cid:54) = S k (cid:40) { , . . . , n } with | S k | = k , let T k ∈ T ( S k ) and T (cid:48) n − k ∈T ( S ck ) . Then Φ (2) ( T k (cid:98) T (cid:48) n − k ) = Φ (2) ( T k )+Φ (2) ( T (cid:48) n − k )+2Φ( T k )+2Φ( T (cid:48) n − k )+ Ç k + 12 å + Ç n − k + 12 å . Proof.

Let us assume, without any loss of generality, that S = { , . . . , m } and S (cid:48) = { m + 1 , . . . , n } . Then ϕ T k (cid:98) T (cid:48) n − k ( i, j ) =  ϕ T k ( i, j ) + 1 if 1 (cid:54) i, j (cid:54) kϕ T (cid:48) n − k ( i, j ) + 1 if k + 1 (cid:54) i, j (cid:54) n (2) ( T k (cid:98) T (cid:48) n − k ) = (cid:88) (cid:54) i (cid:54) j (cid:54) n ϕ T k (cid:98) T (cid:48) n − k ( i, j ) = (cid:88) (cid:54) i (cid:54) j (cid:54) k ( ϕ T k ( i, j ) + 1) + (cid:88) k +1 (cid:54) i (cid:54) j (cid:54) n ( ϕ T (cid:48) n − k ( i, j ) + 1) = (cid:88) (cid:54) i (cid:54) j (cid:54) k ( ϕ T k ( i, j ) + 2 ϕ T k ( i, j ) + 1) + (cid:88) k +1 (cid:54) i (cid:54) j (cid:54) n ( ϕ T (cid:48) n − k ( i, j ) + 2 ϕ T (cid:48) n − k ( i, j ) + 1)= Φ (2) ( T k ) + 2Φ( T k ) + Ç k + 12 å + Φ (2) ( T (cid:48) n − k ) + 2Φ( T (cid:48) n − k ) + Ç n − k + 12 å . So, if we set f ( a, b ) = (cid:0) a +12 (cid:1) + (cid:0) b +12 (cid:1) , we have that E Y (Φ (2) n )= 12 n − (cid:88) k =1 Ç nk å (cid:88) T k ∈T k (cid:88) T (cid:48) n − k ∈T n − k (cid:104) Φ (2) ( T k ) + Φ (2) ( T (cid:48) n − k ) + 2(Φ( T k ) + Φ( T (cid:48) n − k ))+ f ( k, n − k ) (cid:105) n − (cid:0) nk (cid:1) P Y ( T k ) P Y ( T (cid:48) n − k )= 1 n − n − (cid:88) k =1 (cid:104) (cid:88) T k (cid:88) T (cid:48) n − k Φ (2) ( T k ) P Y ( T k ) P Y ( T (cid:48) n − k )+ (cid:88) T k (cid:88) T (cid:48) n − k Φ (2) ( T (cid:48) n − k ) P Y ( T k ) P Y ( T (cid:48) n − k )+2 (cid:88) T k (cid:88) T (cid:48) n − k Φ( T k ) P Y ( T k ) P Y ( T (cid:48) n − k )+2 (cid:88) T k (cid:88) T (cid:48) n − k Φ( T (cid:48) n − k ) P Y ( T k ) P Y ( T (cid:48) n − k )+ (cid:88) T k (cid:88) T (cid:48) n − k f ( k, n − k ) P Y ( T k ) P Y ( T (cid:48) n − k ) (cid:105) = 1 n − n − (cid:88) k =1 (cid:104) (cid:88) T k Φ (2) ( T k ) P Y ( T k ) + (cid:88) T (cid:48) n − k Φ (2) ( T (cid:48) n − k ) P Y ( T (cid:48) n − k )+2 (cid:88) T k Φ( T k ) P Y ( T k ) + 2 (cid:88) T (cid:48) n − k Φ( T (cid:48) n − k ) P Y ( T (cid:48) n − k ) + f ( k, n − k ) (cid:105) = 1 n − n − (cid:88) k =1 (cid:104) E Y (Φ (2) k ) + E Y (Φ (2) n − k ) + 2 E Y (Φ k ) + 2 E Y (Φ n − k )+ Ç k + 12 å + Ç n − k + 12 å (cid:105) = 2 n − n − (cid:88) k =1 E Y (Φ (2) k ) + 4 n − n − (cid:88) k =1 E Y (Φ k ) + 13 n ( n + 1) .

12n particular E Y (Φ n − ) = 2 n − n − (cid:88) k =1 E Y (Φ (2) k ) + 4 n − n − (cid:88) k =1 E Y (Φ k ) + 13 n ( n − . and therefore E Y (Φ (2) n ) = n − n − · n − n − (cid:88) k =1 E Y (Φ (2) k ) + 2 n − E Y (Φ (2) n − )+ n − n − · n − n − (cid:88) k =1 E Y (Φ k ) + 4 n − E Y (Φ n − )+ n − n − · n ( n −

1) + n = n − n − E Y (Φ (2) n − ) + 2 n − E Y (Φ (2) n − ) + 4 n − E Y (Φ n − ) + n = nn − E Y (Φ (2) n − ) + 5 n − . Setting x n = E Y (Φ (2) n ) /n , this recurrence becomes x n = x n − + 5 − n and the solution of this recursive equation with x = E Y (Φ (2)1 ) = 0 is x n = n (cid:88) k =2 (cid:16) − k (cid:17) = 5( n − − H n −

1) = 5 n + 3 − H n from where we deduce that E Y (Φ (2) n ) = 5 n + 3 n − nH n , as we claimed. Proof of Proposition 5.(b)

To compute E U (Φ (2) n ), we shall use an argument similar to the one used in[17]. For every k = 1 , . . . , n −

1, let f k,n = |{ T ∈ T n | ϕ T (1 ,

2) = k }| = |{ T ∈ T n | ϕ T ( i, j ) = k }| for every 1 (cid:54) i < j (cid:54) nd k,n = |{ T ∈ T n | δ T (1) = k }| = |{ T ∈ T n | δ T ( i ) = k }| for every 1 (cid:54) i (cid:54) n (where | X | denotes the cardinal of the set X ). Lemma 8.

For every n (cid:62) , E U (Φ (2) n ) = 1(2 n − (cid:16) n n − (cid:88) k =1 k · d k,n + Ç n å n − (cid:88) k =1 k · f k,n (cid:17) roof. Under the uniform model, E U (Φ (2) n ) = (cid:80) T ∈T n Φ (2) ( T )(2 n − , where (cid:88) T ∈T n Φ (2) ( T ) = (cid:88) T ∈T n (cid:88) (cid:54) i (cid:54) j (cid:54) n ϕ T ( i, j ) = (cid:88) (cid:54) i (cid:54) j (cid:54) n (cid:88) T ∈T n ϕ T ( i, j ) = (cid:88) (cid:54) i (cid:54) n (cid:88) T ∈T n δ T ( i ) + (cid:88) (cid:54) i

1) for k (cid:62) Lemma 9.

For every n (cid:62) , f ,n = (2 n − and f k,n = (2 n − k − k (2 n − k − · F Å , − n, k + 2 − n k +52 − n, k − n + 3 ; 1 ã for every k = 1 , . . . , n − .Proof. Let us start by proving f ,n = (2 n − n . It is clearthat f , = 1 = (2 · − f ,n − = (2( n − − T with n leaves such that ϕ T (1 ,

2) = 0, that is, where

LCA T (1 ,

2) is the root, is obtained by taking a phylogenetic tree T (cid:48) with n − ϕ T (cid:48) (1 ,

2) = 0 and adding a new pendant edge, ending in the leaf n , to any edge in T (cid:48) . Then, since there are f ,n − = (2 n − T (cid:48) ∈ T n − such that ϕ T (cid:48) (1 ,

2) = 0, and each one of them has 2( n − − f ,n = (2 n − n − n − . Now, to compute f k,n for k (cid:62)

1, we shall study the structure of a tree T ∈ T n such that ϕ T (1 ,

2) = k ; to simplify the notations, let us denote by x the node LCA T (1 , k , and by T the subtree of T rooted at x .Then, on the one hand, T is a phylogenetic tree on a subset S ⊆ { , . . . , n } containing 1 ,

2, and since its root x is the LCA of 1 and 2 in T , we have that ϕ T (1 ,

2) = 0. On the other hand, there is a path ( r = v , v , v , . . . , v k +1 = x )in T from r to x . For every j = 1 , . . . , k , let T j be the subtree rooted at thechild of v j other than v j +1 ; see Fig. 3.So, the tree T is determined by: • A number 0 (cid:54) m (cid:54) n − k −

2, so that m + 2 will be the number of leavesof the phylogenetic tree T rooted at LCA T (1 , • A subset { i , . . . , i m } of { , . . . , n } . There are (cid:0) n − m (cid:1) such subsets. • A phylogenetic tree T on { , , i , . . . , i m } such that ϕ T (1 ,

2) = 0. Thereare f ,m +2 = (2 m )!! such trees. • An ordered k -forest , that is, an ordered sequence of phylogenetic trees( T , T , . . . , T k ) such that (cid:83) ki =1 L ( T i ) = { , . . . , n } − { , , i , . . . , i m } . Thenumber of such ordered k -forests is (see, for instance, [17, Lem. 1])(2 n − m − k − k ( n − m − k − n − m − k − . x T ... T k T T Figure 3: The structure of a tree T with ϕ T (1 ,

2) = k . f k,n can be computed as f k,n = n − k − (cid:88) m =0 (number of ways of choosing { i , . . . , i m } ) · (number of trees in T m +2 with ϕ T (1 ,

2) = 0) · (number of ordered k -forests on n − m − n − k − (cid:88) m =0 Ç n − m å · (2 m )!! · (2 n − m − k − k ( n − m − k − n − m − k − = k n − k − (cid:88) m =0 ( n − m !2 m (2 n − m − k − m !( n − m − n − m − k − n − m − k − = ( n − k n − k − n − k − (cid:88) m =0 m (2 n − m − k − n − m − n − m − k − m = m !(2 − n ) m = ( − m ( n − n − m − k + 2 − n ) m = ( − m ( n − k − n − k − m − Å k + 52 − n ã m = ( − m (2 n − k − m (2 n − k − m − , Å k − n + 3 ã m = ( − m (2 n − k − m (2 n − k − m − F Å , − n, k + 2 − n k +52 − n, k − n + 3 ; 1 ã = (cid:88) m (cid:62) (1) m · (2 − n ) m · ( k + 2 − n ) m ( k +52 − n ) m · ( k − n + 3) m · m != (cid:88) m (cid:62) m !( n − n − k − m (2 n − k − m − m (2 n − k − m − n − m − n − k − m − n − k − n − k − m != n − k − (cid:88) m =0 ( n − n − k − n − k − m − m ( n − m − n − k − m − n − k − n − n − k − n − k − n − k − (cid:88) m =0 (2 n − k − m − m ( n − m − n − k − m − n − k − (cid:88) m =0 (2 n − k − m − m ( n − m − n − k − m − n − k − n − n − k − F Å , − n, k + 2 − n k +52 − n, k − n + 3 ; 1 ã f k,n = ( n − k n − k − n − k − (cid:88) m =0 m (2 n − m − k − n − m − n − m − k − n − k n − k − · (2 n − k − n − n − k − F Å , − n, k + 2 − n k +52 − n, k − n + 3 ; 1 ã = (2 n − k − k (2 n − k − · F Å , − n, k + 2 − n k +52 − n, k − n + 3 ; 1 ã as we claimed.We must compute now the sums n − (cid:88) k =1 k · d k,n , n − (cid:88) k =1 k · f k,n . To do that, we shall use the following auxiliary lemma.

Lemma 10.

For every n (cid:62) and m (cid:62) , let U n,m = n − (cid:88) k =0 k m ( n + k − k !2 k . Then, U n, = (2 n − U n, = ( n − n − − (2 n − U n, = ( n − n − − (2 n − n − U n, = ( n + 3 n − n − n − − (3 n + n − n − Proof.

The proof of these identities is standard, using well known equalities forhypergeometric functions and the lookup algorithm given in [21, p. 36]. Weshall prove in detail the identity for m = 2, and we leave the details of the restto the reader.Notice that U n, = n − (cid:88) k =0 k ( n + k − k !2 k = n − (cid:88) k =1 k ( n + k − k !2 k = n − (cid:88) k =0 ( k + 1) ( n + k − k + 1)!2 k +1 = ∞ (cid:88) k =0 ( k + 1) ( n + k − k + 1)!2 k +1 − ∞ (cid:88) k = n − ( k + 1) ( n + k − k + 1)!2 k +1 Set X n = ∞ (cid:88) k =0 ( k + 1) ( n + k − k + 1)!2 k +1 , Y n = ∞ (cid:88) k = n − ( k + 1) ( n + k − k + 1)!2 k +1 We compute now these two summands.17s to X n , X n = ( n − ∞ (cid:88) k =0 ( k + 1) ( n + k − n − k + 1)!2 k If we set t k = ( k + 1) ( n + k − n − k + 1)!2 k , we have that t k +1 t k = ( k + 2)( k + n )( k + 1) · lookup algorithm [21, p. 36], we have that X n = ( n − · F Å , n ã = ( n − · n · F Å n, −

11 ; − ã (using (15.3.4) in [1, p. 559])= ( n − n − (cid:88) k (cid:62) ( n ) k ( − k (1) k · ( − k k != ( n − n − (cid:16) ( n ) ( − (1) · ( −

0! + ( n ) ( − (1) · ( − (cid:17) = ( n − n − ( n + 1)As to Y n , Y n = ∞ (cid:88) k =0 ( k + n − (2 n + k − k + n − k + n − = ( n − (2 n − n − n − · ∞ (cid:88) k =0 ( k + n − (2 n + k − k + n − k · ( n − (2 n − n − If we take now t k = ( k + n − (2 n + k − k + n − k · ( n − (2 n − n − we have that t k +1 t k = ( n + k )(2 n + k − k + n − · lookup algorithm [21, p. 36], we have that Y n = ( n − (2 n − n − n − · F Å , n, n − n − , n − ã = ( n − (2 n − n − n − (cid:104) F Å n − , n − ã + 1 n − · F Å n − , n ; 12 ã (cid:105) (using [12]) . F Å n − , n − ã = 2 · F Å − n, n − − ã (using (15.3.4) in [1, p. 559])= 2 · n − Γ( n − n − (cid:104) Γ( n − n )Γ(1) + 2Γ( n − )Γ( ) (cid:105) (using [13])= 2 · n − ( n − n − (cid:104) ( n − · (2 n − n − (cid:105) = 2 n − ( n − n − F Å n − , n ; 12 ã = 2 · F Å , − nn ; − ã (using (15.3.4) in [1, p. 559])= 4 · Γ( n )2 − n ) Γ(2 n − (cid:16) Γ( n − )Γ( ) + Γ( n + )Γ( ) + 2Γ( n ) (cid:17) (using [13])= 2 n − ( n − n − (cid:16) (2 n − n − + (2 n − n − + 2 · ( n − (cid:17) = 2 n − ( n − n − n − n − n · ( n − Y n = ( n − (2 n − n − n − (cid:104) n − ( n − n − n − · n − ( n − n − n − n − n · ( n − (cid:105) = 2 n − ( n + 1)( n − n − U n, = X n − Y n = 2 n − ( n + 1)( n − − (2 n − n − n − − (2 n − n − Lemma 11.

For every n (cid:62) , n − (cid:88) k =1 k d k,n = (4 n − n − − n − . Proof.

By equation (1), n − (cid:88) k =1 k d k,n = n − (cid:88) k =1 k (2 n − k − n − k − n − k − = n − (cid:88) k =0 ( n − k − ( n + k − k !2 k = ( n − U n, − n − U n, + 3( n − U n, − U n, = ( n − (2 n − − n − (cid:0) ( n − n − − (2 n − (cid:1) +3( n − (cid:0) ( n − n − − (2 n − n − (cid:1) − (cid:0) ( n + 3 n − n − n − − (3 n + n − n − (cid:1) = (4 n − n − − n − n − . emma 12. For every n (cid:62) , n − (cid:88) k =1 k f k,n = 13 (4 n + 1)(2 n − −

32 (2 n − . Proof.

To simplify the notations, set S n = n − (cid:80) k =1 k f k,n . As we have seen in theproof of Lemma 9, f k,n = ( n − k n − k − n − k − (cid:88) m =0 m (2 n − m − k − n − m − n − m − k − S n = ( n − n − n − (cid:88) k =1 k k n − k − (cid:88) m =0 m (2 n − k − m − n − k − n − k − m − n − n − n − (cid:88) k =1 k k n − k − (cid:88) m =0 n − k − − m ( k + 2 m − k + m )! m != ( n − n − n − (cid:88) k =1 k k (cid:32) k + n − k − (cid:88) m =1 m m Ç k + 2 m − k + m å (cid:33) = ( n − n − (cid:32) − n + 22 n − + n − (cid:88) k =1 k k n − k − (cid:88) m =1 m m Ç k + 2 m − k + m å (cid:33) Set now S (cid:48) n = n − (cid:88) k =1 k k n − k − (cid:88) m =1 m m Ç k + 2 m − k + m å = n − (cid:88) k =1 k k n − k − (cid:88) m =1 m m Ç k + 2 m − k + m å Since S (cid:48) = 0, we have that S (cid:48) n = n − (cid:88) p =3 ( S (cid:48) p +1 − S (cid:48) p )and S (cid:48) p +1 − S (cid:48) p = ( p − p + p − (cid:88) k =1 k k ( p − k − p − k − Ç p − k − p − å = ( p − p + 12 p − p − (cid:88) k =1 k (2 p − k − − k ( p − k − p − p − k − p − p + 12 p − ( p − p − (cid:88) k =1 k (2 p − k − − k ( p − k − p − p + 12 p − ( p − p − (cid:88) k =1 ( p − k − ( p + k − k − p +2 ( k + 1)!20 ( p − p + 12 p − ( p − p − (cid:88) k =2 ( p − k − ( p + k − k k != ( p − p + 12 p − ( p − (cid:104) p − (cid:88) k =0 ( p − k − ( p + k − k k ! − ( p − ( p − −

12 ( p − ( p − (cid:105) = − ( p − p − + 12 p − ( p − p − (cid:88) k =0 ( p − k − ( p + k − k k != − ( p − p − + 1(2 p − (cid:0) (4 p − p − − p − (cid:1) (by Lemma 11)= − ( p − p − + (4 p −

1) (2 p − p − − S (cid:48) n = n − (cid:88) p =3 (cid:16) (4 p −

1) (2 p − p − − ( p − p − − (cid:17) Now, applying

Gosper’s algorithm [21, p. 77] we have that n − (cid:88) p =3 (4 p −

1) (2 p − p − · n +1 (cid:16) n − n − Ç n − n − å − · n (cid:17) and then S (cid:48) n = 13 · n +1 (cid:16) n − n − Ç n − n − å − · n (cid:17) − · n − n + 2)2 n +1 − n − n + 22 n − − n + 1) + (4 n + 1)(2 n − n − . Finally, S n = ( n − n − Å − n + 22 n − + S (cid:48) n ã = − n − n − + (4 n + 1)(2 n − n + 1)(2 n − −

32 (2 n − . E U (Φ (2) n ) = 1(2 n − (cid:16) n n − (cid:88) k =1 k · d k,n + Ç n å n − (cid:88) k =1 k · f k,n (cid:17) = 1(2 n − (cid:104) n ((4 n − n − − n − Ç n å (cid:16)

13 (4 n + 1)(2 n − −

32 (2 n − (cid:17)(cid:105) = 16 n (4 n + 21 n − − n ( n + 3)4 · (2 n − n −−