[PDF] Isomorphism and Symmetries in Random Phylogenetic Trees

Abstract

The probability that two randomly selected phylogenetic trees of the same size are isomorphic is found to be asymptotic to a decreasing exponential modulated by a polynomial factor. The number of symmetrical nodes in a random phylogenetic tree of large size obeys a limiting Gaussian distribution, in the sense of both central and local limits. The probability that two random phylogenetic trees have the same number of symmetries asymptotically obeys an inverse square-root law. Precise estimates for these problems are obtained by methods of analytic combinatorics, involving bivariate generating functions, singularity analysis, and quasi-powers approximations.

Full PDF

IISOMORPHISM AND SYMMETRIES IN RANDOMPHYLOGENETIC TREES

MIKL ´OS B ´ONA AND PHILIPPE FLAJOLET

Abstract.

The probability that two randomly selected phylogenetic trees ofthe same size are isomorphic is found to be asymptotic to a decreasing exponen-tial modulated by a polynomial factor. The number of symmetrical nodes in arandom phylogenetic tree of large size obeys a limiting Gaussian distribution,in the sense of both central and local limits. The probability that two randomphylogenetic trees have the same number of symmetries asymptotically obeysan inverse square-root law. Precise estimates for these problems are obtainedby methods of analytic combinatorics, involving bivariate generating functions,singularity analysis, and quasi-powers approximations. Introduction

Every high school student of every civilized part of the world is cognizant of the tree of species , also known as the “tree of life”, in relation to Darwin’s theory ofevolution (Figure 1). We observe n diﬀerent species, and form a group with theclosest pair (under some suitable proximity criterion), then repeat the process withthe n − phylogenetic tree , also known as “cladogram”, is obtained: such a treehas the n species at its external nodes, also called “leaves”; it has n − § , n ]. In classicalcombinatorial terms, the set of phylogenetic trees thus corresponds to the set B of rooted non-plane binary trees , which are labeled at their leaves .We let B n be the subset of B corresponding to trees of size n (those with n leaves)and denote by b n := |B n | the corresponding cardinality. Considering the listing ofall unlabeled trees of sizes 1 , , , L ) ( R ) , the reader is invited to verify that b = 1, b = 1, b = 3, and that b = 15is obtained by counting all possible labelings (3 and 12, respectively) of the twotrees L, R shown on the right of (1).A general formula for the numbers b n is well known and straightforward to prove.Indeed, if we introduce the exponential generating function B ( z ) := (cid:88) n ≥ b n z n n ! , Date : January 6, 2009. a r X i v : . [ m a t h . P R ] J a n M. B ´ONA AND P. FLAJOLET

Figure 1.

Left: the representation of a pylogenetic tree in Darwin’sown handwriting. Right: an illustration of the Tree of Life by Haeckelin

The Evolution of Man , published in 1879. (Source: Entry “Tree oflife”,

Wikipedia .) then the fact that each element of B n is built up from its two subtrees implies that(2) B ( z ) = z + 12 B ( z ) . See the books by Stanley [20, pp. 13–15] or Flajolet–Sedgewick [9, § B ( z ) is the solution of the quadratic equation (2) that is agenerating function. That is, B ( z ) = 1 − √ − z. This leads to the following exact formula for the numbers b n . Proposition 1.

The number of phylogenetic trees on n labeled nodes is b n = 1 · · · · · (2 n − ≡ (2 n − . There is a natural way to associate an unlabeled rooted binary non-plane tree toeach element t ∈ B n , by simply removing all the labels of t . We will say that twoelements t, t (cid:48) ∈ B n are isomorphic if removing their labels will associate them tothe same unlabeled tree. This leads to the following intriguing question. Question.

What is the probability p n that two phylogenetic trees,selected uniformly at random in B n , are isomorphic? Note that, in our running example, the case of n = 4, we have p = (cid:0) (cid:1) + (cid:0) (cid:1) = .Indeed, if we selected two elements of B at random, there is a (3 / = (1 / chance that they will both belong to the isomorphism class of L , and (12 / =(4 / that they both belong to the isomorphism class of R , where L and R are thetwo trees of (1). ANDOM PHYLOGENETIC TREES 3

In this paper, we will use a multivariate generating function argument (Section 2)in conjunction with an analysis of singularities in the complex plane (Section 3) toanswer the isomorphism question in Theorem 1. In Section 4, we will extend ouranalysis to distributional estimates of the number of symmetrical nodes in phylo-genetic trees and in their unlabeled counterparts, known as Otter trees: see Theo-rems 2 and 3 for central and local limit laws , respectively. Such results in particularquantify the distribution of the log-size of the automorphism group of the randomtrees under consideration. In Section 5, we will work out an explicit estimate of theprobability that two random trees have the same number of symmetries.2.

Isomorphism: a Generating Function Argument

Unlabeled Trees.

Let U n be the set of all unlabeled rooted binary non-planetrees with n leaves, and let u n = |U n | be the corresponding count, with ordinarygenerating function U ( z ) := (cid:88) n ≥ u n z n . Such trees are often called

Otter trees , since Otter was the ﬁrst to study theirenumeration [17]. We can build a generic element of U n by taking a tree t (cid:48) ∈ U k and a tree t (cid:48)(cid:48) ∈ U n − k , and joining their roots to a new root. As the order of t and t (cid:48) is not signiﬁcant, we get each tree t ∈ U n twice this way, except that, if thetwo subtrees of t are identical, we get t only once. This leads to the functionalequation [9, 12, 17, 18]:(3) U ( z ) = z + 12 (cid:0) U ( z ) + U ( z ) (cid:1) . The numbers u n are listed as sequence A001190 (the “Wedderburn–Etheringtonnumbers” ) in the On-line Encyclopedia of Integer Sequences by Neil Sloane [19]and are the answers to various combinatorial enumeration problems. The ﬁrst fewvalues of the sequence { u n } n ≥ are 1, 1, 1, 2, 3, 6, 11, 23, 46, 98.2.2. A multivariate generating function.

Let t ∈ B n , and let t ∈ B n . ByProposition 1, there are (2 n − possibilities for the ordered pair ( t , t ), where t and t do not have to be distinct. Our goal is to count such ordered pairs in which t and t are isomorphic. This number, divided by (2 n − will then provide theprobability p n that two randomly selected elements of B n are isomorphic.Let t ∈ U n . Then the number of diﬀerent labelings of the leaves of t is(4) w ( t ) = n !2 sym( t ) , where sym( t ) is the number of non-leaf nodes v of t such that the two subtreesstemming from v are identical. For example, if n = 4, and t is the tree L of (1),then we have w ( t ) = 3, and indeed, t has n ! / = 24 / t is thetree R of (1), then we have w ( t ) = 1, and t has 24 / B n correspond to elements of U n . Set(5) W n = (cid:88) t ∈U n sym( t ) . As we have mentioned above, n ! / sym( t ) is the number of labeled trees in the iso-morphism class corresponding to t . Summing this number over all isomorphism M. B ´ONA AND P. FLAJOLET classes, we obtain the total number of trees in B n . That is, n ! W n = 1 · · · · · (2 n − . For instance, W = + = , and 4! · = 15 = 5!!.Let(6) F ( z, u ) = (cid:88) t ∈U u sym( t ) z | t | be the bivariate generating function of Otter trees, with z marking the number ofleaves, and u marking non-leaf nodes with two identical subtrees. In particular, F ( z, u ) = z + uz + uz + ( u + u ) z + higher degree terms. The crucial observationabout F ( z, u ) is the following. Lemma 1.

The bivariate generating function F ( z, u ) that enumerates Otter treeswith respect to the number of symmetrical nodes satisﬁes the functional equation (7) F ( z, u ) = z + 12 F ( z, u ) + (cid:18) u − (cid:19) F ( z , u ) . Proof.

If a tree consists of more than one node, then it is built up from its twosubtrees. As the order of the two subtrees is not signiﬁcant, we will get each tree twice this way, except the trees whose two subtrees are identical. If t and t arethe two subtrees of t whose roots are the two children of the root of t , thensym( t ) = (cid:26) sym( t ) + sym( t ) , if t and t are not identicalsym( t ) + sym( t ) + 1 , if t and t are identical . The ﬁrst term of the right-hand side of (7) represents the tree on one node, thesecond term represents all other trees as explained in the preceding paragraph, andthe third term is the correction term for trees in which the two subtrees of the rootare identical. (cid:3)

Note that various specializations of F ( z, u ) have a known combinatorial meaning.Indeed,( i ) If u = 1, then F ( z,

1) = (cid:80) t ∈ U z | t | is simply the ordinary generating func-tion U ( z ) of Otter trees with respect to their number of leaves. We havediscussed this generating function in Subsection 2.1, and mentioned thatits coeﬃcients u n are the Wedderburn–Etherington numbers, which formsequence A001190 in [19].( ii ) If u = 2, then F ( z,

2) = (cid:80) t ∈ U z | t | sym( t ) is the ordinary generating functionof the total number of automorphisms in all Otter trees. The coeﬃcientsconstitute sequence A003609 in [19]. Interested readers may consult McK-eon’s studies [14, 15] for details. The ﬁrst few elements of the sequence are1, 2, 2, 10, 14, 42, 90, 354.( iii ) If u = 1 /

2, then F (cid:18) z, (cid:19) = (cid:88) t ∈U z | t | − sym( t ) = (cid:88) n W n z n = (cid:88) n (2 n − z n n ! , is the exponential generating function B ( z ) of labeled trees in disguise. Wehave discussed this generating function in the Introduction. The numbers(2 n − u = 1 / z n ] g ( z ) denote the coeﬃcient of z n in the power series g ( z ). ANDOM PHYLOGENETIC TREES 5

Lemma 2.

For all positive integers n ≥ , the probability p n that two phylogenetictrees of size n are isomorphic satisﬁes p n = (cid:18) n !(2 n − (cid:19) · [ z n ] F (cid:18) z, (cid:19) . Proof.

Consider the sample space whose elements are the elements of U n , and inwhich the probability of t ∈ U n is(8) κ ( t ) := n !2 sym( t ) · n − w ( t )(2 n − . (For probabilists, κ is the image on U n of the uniform distribution of B n .) Forinstance, if n = 4, then this space has two elements, (the two trees L, R of (1)), onehas probability 1 /

5, and the other has probability 4 /

5. If we select two elements ofthis space at random, the probability that they coincide is p n = (cid:88) t ∈U n κ ( t ) = 1(2 n − (cid:88) t ∈U n w ( t ) = n ! (2 n − (cid:88) t ∈U n (cid:18) (cid:19) sym( t ) . Our claim now follows since (cid:80) t ∈U n (cid:0) (cid:1) sym( t ) is indeed the coeﬃcient of z n in F ( z, / (cid:3) Isomorphism: Singularity Analysis

By Lemma 2, our goal is now to ﬁnd the coeﬃcient of z n in the one-variablegenerating function f ( z ) := F ( z, / . Lemma 1 shows that the formal power series F ( z, u ) is the solution of the quadraticequation (7) that satisﬁes F (0 ,

0) = 0. That is,(9) F ( z, u ) = 1 − (cid:112) − z − (2 u − F ( z , u ) . Iterated applications of (9), starting with u = 1 /

4, show that f ( z ) ≡ F ( z, /

4) = 1 − (cid:115) − z + 12 F (cid:18) z , (cid:19) = 1 − (cid:118)(cid:117)(cid:117)(cid:116) − z − (cid:115) − z + 78 F (cid:18) z , (cid:19) = · · · . In the limit, there results that f ( z ) admits a “continued square-root” expansion f ( z ) = 1 − (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − z − (cid:118)(cid:117)(cid:117)(cid:116) − z − (cid:115) − z − (cid:114) · · · , out of which initial elements of the sequence ( p n ) n ≥ are easily determined:1 , , , , , , , , , , · · · . In order to compute the growth rate of the coeﬃcients of f ( z ), we will analyze thedominant singularity (or singularities) of this power series. The interested reader isinvited to consult the book Analytic Combinatorics by Flajolet and Sedgewick [9]for more information on the notions and techniques that we are going to use. Part

M. B ´ONA AND P. FLAJOLET of the diﬃculty of the problem is that the functional relation (9) has the characterof an inclusion–exclusion formula: F ( z, u ) does not depend positively on F ( z , u ),as soon as u ≤ /

2, which requires suitably crafted arguments, in contrast to the(simpler) asymptotic analysis of u n = [ z n ] F ( z, location , type , and number of the dominant sin-gularities of f ( z ), that is, singularities that have smallest absolute value (modulus).3.1. Location.

First, it is essential for our analytic arguments to establish that f ( z ) has a radius of convergence strictly less than 1. Our starting point parallelsLemmas 1–2 of McKeon [15], but we need a speciﬁc argument for the upper bound. Lemma 3.

Let ρ be the largest real number such that f ( z ) is analytic in the interiorof a disc centered at the origin that has radius ρ . The following inequalities hold: . < ρ < . . Proof. ( i ) Lower bound.

Note that f ( z ) is convergent in some disc of radius at least f ( z ) = F ( z, /

4) are at most as large as the coeﬃcientsof F ( z, U ( z ) of Otter trees, and the latter is known tobe convergent in a disc of radius 0 . · · · : see Otter’s original paper [17] andFinch’s book [6, § F ( z,

1) = U ( z ).( ii ) Upper bound.

For ﬁxed n , let a , a , · · · , a u n be the numbers of our labeledtrees whose underlying unlabeled tree is the ﬁrst, second, . . . , last Otter tree of size n . Then the relation(10) p n ≡ a + a + · · · + a u n ( a + a + · · · + a u n ) > u n , results from the Cauchy-Schwarz inequality. (In words: the probability of coinci-dence of two elements from a ﬁnite probability space is smallest when the distribu-tion is the uniform one.)As we mentioned, it is proved in [17] that the generating function (cid:80) n u n x n converges in a disc of radius at least 0.4. Therefore, the series (cid:80) n u n x n convergesin a disc of radius at most 1 / . .

5, and by (10), this implies that (cid:80) n p n x n converges in a disc of radius less than 2.5. Now Lemma 2 shows that F ( z, / . / . F ( z, /

4) are, up to polynomial factors, 4 n times larger than the coeﬃcients of (cid:80) n p n x n . It follows that ρ < . (cid:3) A well-known theorem of Pringsheim states that if a function g ( z ) is representablearound the origin by a series expansion that has non-negative coeﬃcients and radiusof convergence R , then the real number R is actually a singularity of g ( z ). Applyingthis theorem to f ( z ), we see that the positive real number ρ must be a singularityof f ( z ).3.2. Type.

Recall that a function g ( z ) analytic in a domain Ω is said to havea square-root singularity at a boundary point α if, for some function H analyticat 0, the representation g ( z ) = H ( √ z − α ) holds in the intersection of Ω and aneighborhood of α . (In particular, if g ( z ) = (cid:112) γ ( z ) with γ analytic at α , then g ( z )has a square-root singularity at α whenever γ ( α ) = 0 and γ (cid:48) ( α ) (cid:54) = 0.) Lemma 4.

All dominant singularities (of modulus ρ ) of f ( z ) are isolated and areof the square-root type. ANDOM PHYLOGENETIC TREES 7

Proof.

In order to see this, note that ρ < ρ < √ ρ . Therefore, the power series F ( z , /

4) (that has radius of convergence √ ρ ) is analytic in the interior of the disc of radius ρ , and so is the power series F ( z , /

16) since its coeﬃcients are smaller than the corresponding coeﬃcients of F ( z , / f ( z ) = F (cid:18) z, (cid:19) = 1 − (cid:115) − z + 12 F (cid:18) z , (cid:19) are of the square-root type: they are to be found amongst the roots of the expressionunder the square-root sign in (9), that is, amongst the zeros of 1 − z + F ( z , / ρ . As 1 − z + F ( z , /

16) is analytic in the disc centered atthe origin with radius at least √ ρ > ρ , it has isolated roots. Hence f ( z ) has onlya ﬁnite number of singularities on the circle | z | = ρ , and each is of square-roottype. (cid:3) The argument of the proof (see (11)) also shows that ρ is determined as thesmallest positive root of the equation(12) 1 − ρ + 12 F (cid:18) ρ , (cid:19) = 0 . Number.

In order to complete our characterization of the dominant singularstructure of f ( z ), we need the following statement. Lemma 5.

The point ρ is the only singularity of smallest modulus of f ( z ) .Proof. The argument is somewhat indirect and it proceeds in two stages.First we show that, as a power series, f ( z ) converges for each z with | z | = ρ .To this purpose, we need to recall brieﬂy some principles of singularity analysis, asexpounded in [9, Ch. VI]. Let g ( z ) be a function analytic in | z | < R with ﬁnitelymany singularities at the set { α j } on the circle | z | = R ; assume in addition that g ( z ) has a square-root singularity at each α j in the sense of Subsection 3.2. Then,one has [ z n ] g ( z ) = O (cid:0) R − n n / (cid:1) . (This corresponds to the O –transfer theoremof [9, Th. VI.3, p. 390], with amendments for the case of multiples singularities tobe found in [9, § VI.5]; see also (14) below.) It follows from this general estimateand Lemma 4 that [ z n ] f ( z ) = O ( ρ − n n / ) . Therefore, the series expansion of f ( z ) converges absolutely as long as | z | ≤ ρ , and,in particular, it converges for all z with modulus ρ .Now, we are in a position to prove that f ( z ) has no singularity other than ρ onthe circle | z | = ρ . Let us assume the contrary; that is, there is a real number z (cid:54) = ρ such that | z | = ρ and z is a singularity of f ( z ) ≡ F ( z, / f ( z ) ≡ F ( z , /

4) = 1, since the expression under the square-rootsign in (9) is equal to 0, corresponding to a singularity of square-root type. Onthe other hand, one has a priori | f ( z ) | ≤ f ( ρ ), as a consequence of the triangleinequality and the fact, proved above, that f ( z ) converges on | z | = ρ . Now it followsfrom the strong triangle inequality that the equality f ( z ) = f ( ρ ) is only possibleif all the terms f n z n that compose the (convergent) series expansion of f ( z ) arepositive real. (Here f m = [ z m ] f ( z ).) However, since, in particular, f = 1 isnonzero, this implies that z = ρ , and a contradiction has been reached. (This partof the argument is also closely related to the Daﬀodil Lemma of [9, p. 266].) (cid:3) M. B ´ONA AND P. FLAJOLET

The asymptotics of p n . As a result of Lemmas 3–5, the function f ( z ) hasonly one dominant singularity, and that singularity ρ is of the square-root type.One then has, for a family of constants h k , the local singular expansion:(13) f ( z ) = 1 + ∞ (cid:88) k =0 h k (1 − z/ρ ) k +1 / , which is valid for z near ρ . The conditions of the singularity analysis process assummarized in [9, § VI.4] are then satisﬁed. Consequently, each singular elementof (13) relative to f ( z ) can be translated into a matching asymptotic term relativeto [ z n ] f ( z ), according to the rule(14) σ ( z ) = (1 − z/ρ ) θ −→ [ z n ] σ ( z ) = ρ − n (cid:18) n − θ − n (cid:19) ∼ ρ − n n − θ − Γ( − θ ) . In particular, we have [ z n ] f ( z ) ∼ C · ρ − n n − / , for some C .Hence Lemma 2, combined with Lemmas 4–5 and the routine asymptotics of n ! / (2 n − Theorem 1.

The probability that two phylogenetic trees of size n are isomorphicadmits a complete asymptotic expansion (15) p n ∼ a · b − n · n / (cid:32) (cid:88) k c k n k (cid:33) , where a , b = 4 ρ , and the c k are computable constants, with values a = 3 . · · · , b = 2 . · · · , and c approximately equal to − . . The function F ( z, u ) can be determined numerically to great accuracy (by meansof the recursion corresponding to the functional equation (9)). So, the value ρ = 0 . · · · , is obtained as the smallest positive root of (12); the constant a then similarlyresults from an evaluation of F (cid:48) (cid:0) ρ , (cid:1) ; the constant c , which could in principlebe computed in the same manner, was, in our experiments, simply estimated fromthe values of p n for small n . The formula (15), truncated after its c /n term, thenappears to approximate p n with a relative accuracy better than 10 − for n ≥ − for n ≥

38, and 10 − for n ≥ Symmetrical Nodes and Automorphisms

In the course of our investigations on analytic properties of the bivariate gener-ating function F ( z, u ), we came up with a few additional estimates, which improveon those of McKeon [15]. In essence, what is at stake is a perturbative analysisof F ( z, u ) and its associated singular expansions, for various values of u , in a waythat reﬁnes the developments of the previous section. We oﬀer here a succinct ac-count: details can be easily supplemented by referring to Chapter IX of the book Analytic Combinatorics [9].

Theorem 2. ( i ) Let X n be the random variable representing the number of sym-metrical nodes in a random Otter tree of U n . Then, X n satisﬁes a limit law ofGaussian type, ∀ x ∈ R : lim n →∞ P (cid:0) X n ≤ µn + σx √ n (cid:1) = 1 √ π (cid:90) x −∞ e − w / dw, ANDOM PHYLOGENETIC TREES 9

20 30 40 500.000.020.040.060.080.10 15 20 25 30 35 400.000.020.040.060.080.100.12

Figure 2.

Histograms of the distribution of the number of symmetricalnodes in trees of size 100, compared to a matching Gaussian. Left: Ottertrees of U . Right: phylogenetic trees of B . for some positive constants µ and σ . Numerically, µ = 0 . · · · . ( ii ) Let Y n be the random variable representing the number of symmetrical nodesin a random phylogenetic tree of B n . Then, Y n satisﬁes a limit law of Gaussiantype, ∀ x ∈ R : lim n →∞ P (cid:0) Y n ≤ (cid:98) µn + (cid:98) σx √ n (cid:1) = 1 √ π (cid:90) x −∞ e − w / dw, for some positive constants (cid:98) µ and (cid:98) σ . Numerically, (cid:98) µ = 0 . · · · .Proof (Sketch). ( i ) The case of Otter trees ( X n , U n ). In accordance, with generalprinciples [9, Ch. IX], we need to estimate the generating polynomial(16) ϕ n ( u ) := [ z n ] F ( z, u ) , when u is close to 1, with F ( z, u ) as speciﬁed by (6) and (7). For u in a small enoughcomplex neighborhood Ω of 1, the radius of convergence of F ( z , u ) is larger thansome ρ > ρ , where ρ ≈ . ρ ( u ) to the analytic equation(17) 1 − ρ ( u ) + ( u − F ( ρ ( u ) , u ) = 0(compare with (12)), such that ρ (1) = ρ is the dominant singularity of the generat-ing function F ( z,

1) of Otter trees. By the analytic version of the implicit functiontheorem (equivalently, by the Weierstrass Preparation Theorem), this function ρ ( u )depends analytically on u , for u near 1.In addition, by (9), the function F ( z, u ) has a singularity of the square-roottype at ρ ( u ). Also, for u ∈ Ω and Ω taken small enough, the triangle inequalitycombined with the previously established properties of F ( z,

1) may be used toverify that there are no other singularities of z (cid:55)→ F ( z, u ) on | z | = | ρ ( u ) | . Thereresults, from singularity analysis and the uniformity of the process [9, p. 668], theasymptotic estimate(18) ϕ n ( u ) = c ( u ) ρ ( u ) − n n − / (1 + o (1)) , n → + ∞ , uniformly with respect to u ∈ Ω, for some c ( u ) that is analytic at u = 1. Then, theprobability generating function of X n , which equals ϕ n ( u ) /ϕ n (1) satisﬁes what isknown as a “quasi-powers approximation . That is, it resembles (analytically) theprobability generating function of a sum of independent random variables,(19) ϕ n ( u ) ϕ n (1) = c ( u ) c (1) (cid:18) ρ (1) ρ ( u ) (cid:19) n [1 + ε n ( u )] , where sup u ∈ Ω | ε n ( u ) | tends to 0 as n → ∞ . The Quasi-powers Theorem (see [9, § IX.5] and [13]) precisely applies to such approximations by quasi-powers and im-plies that the distribution of X n is asymptotically normal.( ii ) The case of phylogenetic trees ( Y n , B n ). The starting point is a simple com-binatorial property of ϕ n ( u ), as deﬁned in (16):(20) ϕ n ( u/

2) = 1 n ! (cid:88) t ∈U n n !2 sym( t ) u sym( t ) = 1 n ! (cid:88) t ∈B n u sym( t ) . (The ﬁrst form results from the deﬁnition (6) of F ( z, u ); the second form relies onthe expression (4) of the number of diﬀerent labellings of an Otter tree that give riseto a phylogenetic tree.) Thus, ϕ n taken with an argument near 1 / B n .From this point on, the analysis of symmetries in phylogenetic trees is entirelysimilar to that of Otter trees. For u in a small complex neighborhood (cid:98) Ω of 1 / z (cid:55)→ F ( z, u ) has a dominant singularity ρ ( u ) that is ananalytic solution of (17) and is such that ρ (1 /

2) = 1 /

2, the radius of convergence of B ( z ) ≡ F ( z, / u ∈ (cid:98) Ω now near 1 /

2. In particular,(21) ϕ n ( u ) ϕ n (1 /

2) = (cid:98) c ( u ) (cid:98) c (1 / (cid:18) (cid:98) ρ (1 / (cid:98) ρ ( u ) (cid:19) n [1 + (cid:98) ε n ( u )] , where (cid:98) ε n ( u ) → u := v/

2, with v near 1), the distribution of Y n is asymptotically normal. (cid:3) Figure 2 shows that the ﬁt with a Gaussian is quite good, even for comparativelylow sizes ( n = 100). Phrased diﬀerently, the statement of Theorem 2 means thatthe logarithm of the order sym( t ) of the automorphism group of a random tree t (either in U n or in B n ) is normally distributed . In the case of U n , the expectation ofthe cardinality of this group has been determined by McKeon [15] to grow roughlyas 1 . n . In the case of phylogenetic trees ( B n ), we ﬁnd an expected growth ofthe rough form 1 . n , where the exponential rate 1 . · · · is exactly 1 / (2 ρ ),with ρ , still, the radius of convergence of U ( z ) ≡ F ( z, B n .)As a matter of fact, the histograms of Figure 2 suggest that a convergencestronger than a plain convergence in law (corresponding to convergence of thedistribution function) holds. The situation is loosely evocative of the fact (Erd˝os–Tur´an Theorem) that the logarithm ofthe order of a random permutation of size n is normally distributed; see, e.g., [5, 11, 16]. ANDOM PHYLOGENETIC TREES 11

Deﬁnition 1.

Let ( ξ n ) be a family of random variables with expectation µ n = E ( ξ n ) and variance σ n = V ( ξ n ) . It is said to satisfy a local limit law with density g ( x ) ifone has (22) lim n →∞ sup x ∈ R | σ n P ( ξ n = (cid:98) µ n + xσ n (cid:99) ) − g ( x ) | = 0 . In other terms, we expect the probability of ξ n being at x standard deviations awayfrom its mean to be well approximated by g ( x ) /σ n . This concept is discussed inthe case of sums of random variables by Gnedeneko and Kolmogorov in [10, Ch. 9]and, in a broader combinatorial context, by Bender [1] and Flajolet–Sedgewick [9, § IX.9].

Theorem 3.

The number of symmetrical nodes in either an unlabeled tree ( X n on U n ) or a phylogenetic tree ( Y n on B n ) satisﬁes a local limit law of the Gaussiantype. That is, in the sense of Deﬁnition 1, a local limit law holds, with density g ( x ) = 1 √ π e − x / . Proof. ( i ) The unlabeled case ( X n , U n ). The proof essentially boils down to estab-lishing that f n ( u ) = [ z n ] F ( z, u )is small compared to [ z n ] F ( z, u satisﬁes | u | = 1 and stays awayfrom 1; then, Theorem IX.14, p. 696, from [FlSe08] does the rest. The argumentsare variations of the ones previously used.Since a tree of size n has less than n symmetrical nodes, we have | f n ( u ) | ≤| u | n f n (1) for any | u | ≥

1. There results that the convergence of the series expansionof F ( z, u ) is dominated by that of F ( | zu | , | u | ≥

1. Apply the factexplained in the previous sentence, with z and u instead of z and u , to get thatthe coeﬃcients of F ( z , u ) are less than the coeﬃcients of F ( | z u | , | z u | < . | zu | < .

75, say. Nowchoose η so that (1 + η )( ρ + η ) < .

75, where ρ is the radius of convergence ofOtter trees ( ρ ≡ ρ (1) ≈ . F ( z , u ) is bivariate analytic whenever | z | < ( ρ + η ) and | u | < η . In accordance with previously developed arguments,this implies that, for any ﬁxed u satisfying | u | ≤ η , the function z (cid:55)→ F ( z, u )has only ﬁnitely many singularities, each of the square-root type, in | z | ≤ ρ + η .For u in a small complex neighborhood of 1, we already know that z (cid:55)→ F ( z, u )has only one dominant singularity at some ρ ( u ), which is a root of1 − ρ ( u ) + (2 u − F ( ρ ( u ) , u ) = 0 . (This property lies at the basis of the central limit law of the previous theorem.)Consider now a u such that | u | = 1, but u (cid:54)∈ Ω. We argue that z (cid:55)→ F ( z, u ) isanalytic at all points z such that | z | = ρ . Indeed for such values of u and z , wehave, by the strong triangle inequality ,(23) | F ( z, u ) | < F ( ρ , , the reason being that, in the expansion F ( z, u ) = z + uz + uz + · · · , the valuesof the monomials u k z n cannot be all collinear, unless u = 1. The inequality (23)combined with the fact that F ( ρ ,

1) = 1 implies that z (cid:55)→ F ( z, u ) cannot besingular (since, as we know, the only possibility for a singularity would be that itis of the square-root type and F ( z, u ) = 1). Thus, for | u | = 1 and u (cid:54)∈ Ω, the function z (cid:55)→ F ( z, u ) is analytic at all pointsof | z | = ρ . Hence, it is analytic in | z | ≤ ρ + δ , for some δ >

0. By usual exponentialbounds, there results that, for some

K >

0, one has(24) | f n ( u ) | < K ( ρ + δ/ − n , | u | = 1 , u (cid:54)∈ Ω . As expressed by Theorem IX.14 of [9], the existence of a quasi-powers approxima-tion (when u is near 1), as in (18) and (19), and of the exponentially small bound(when u (cid:54)∈ Ω is away from 1), as provided by (24), suﬃces to ensure the existenceof a local limit law.( ii ) The labeled case ( Y n , B n ). In accordance with (20), the function F ( z, u/

2) isthe bivariate exponential generating function of phylogenetic trees, with z markingsize and u marking the number of symmetrical nodes. Consider once more | u | = 1and distinguish the two cases u ∈ (cid:98) Ω (for which the proof of Theorem 2 provides aquasi-powers approximation) and u (cid:54)∈ (cid:98) Ω. In the latter case, arguments that entirelyparallel those applied to unlabeled trees give us that z (cid:55)→ F ( z, u/

2) has no singu-larity on | z | = 1 /

2. This implies, for u (cid:54)∈ (cid:98) Ω, the exponential smallness of (cid:98) ϕ n ( u/ (cid:3) Coincidence of the Number of Symmetries

From a statistician’s point of view, it may be of interest to determine the prob-ability for two trees to be “ similar ” (rather than plainly isomorphic), given somestructural similarity distance between non-plane trees—see, for instance, the workof Ycart and Van Cutsem [21] for a study conducted under probabilistic assump-tions that diﬀer from ours. Combinatorial generating functions can still be usefulin this broad range of problems, as we now show by considering the following ques-tion: determine the probability that two randomly chosen trees τ, τ (cid:48) of the same sizehave the same number of symmetrical nodes . This probability a priori lies in theinterval [ n , n have the same number of cycles is asymptoticto (2 √ π log n ) − ; B´ona and Knopfmacher [2] examine combinatorially and asymp-totically the probability that various types of integer compositions have the samenumber of parts, and several other coincidence probabilities are studied in [7]. Thefollowing basic lemma trivializes the asymptotic side of several such questions. Lemma 6.

Let C be a combinatorial class equipped with an integer-valued para-meter χ . Assume that the random variable corresponding to χ restricted to C n (under the uniform distribution over C n ) satisﬁes a local limit law with density g ( x ) ,in the sense of Deﬁnition 1. Let the variance of χ on C n be σ n and assume that g ( x ) is continuously diﬀerentiable. Then, the probability that two objects c, c (cid:48) ∈ C n The reasoning corresponding to that theorem is simple: start from[ u k ] f n ( u ) = 12 iπ Z | u | =1 f n ( u ) duu k +1 . Use (24) to neglect the contribution corresponding to u (cid:54)∈ Ω; appeal to the saddle point methodapplied to the quasi-powers approximation to estimate the central part u ∈ Ω, and conclude.

ANDOM PHYLOGENETIC TREES 13 admit the same value of χ satisﬁes the asymptotic estimate (25) P (cid:20) χ ( c ) = χ ( c (cid:48) ) , c, c (cid:48) ∈ C n (cid:21) ∼ Kσ n , where K := (cid:90) ∞−∞ g ( x ) dx. Note that, for g ( x ) the standard Gaussian density, one has K = 1 / (2 √ π ). Proof (sketch).

Let (cid:36) n be the probability of coincidence; that is, the left hand-sideof (25). Observe that, by hypothesis, we must have σ n → ∞ . The baseline is that (cid:36) n = (cid:88) k P C n [ χ ( c ) = k ] ∼ σ n (cid:88) x ∈E n g ( x ) , with E n := 1 σ n ( Z ≥ − { µ n } ) , µ n := E C n [ χ ] ∼ σ n (cid:90) ∞−∞ g ( x ) dx. To justify this chain rigorously, ﬁrst restrict attention to values of x in a ﬁniteinterval [ − A, + B ], so that the tails ( (cid:82) B ) g are less than some small (cid:15) . Then,with x ∈ [ − A, + B ], make use of the approximation (22) provided by the assumptionof a local limit law. Next, approximate the sum of g ( x ) taken at regularly spacedsampling points (a Riemann sum) by the corresponding integral. Finally, completeback the tails. (cid:3) Given the local limit law expressed by Theorem 3, an immediate consequence ofLemma 6 is the following.

Theorem 4.

For Otter trees ( U n ) and phylogenetic trees ( B n ) , the asymptoticprobabilities that two trees of size n have the same number of symmetries admit theforms U n : 12 σ √ πn , B n : 12 (cid:98) σ √ πn , where σ, (cid:98) σ are the two “variance constants” of Theorem 2. In summary, as we see in several particular cases here, qualitatively similar phe-nomena are expected in trees, whether plane or non-plane trees, labelled or unla-belled, whereas, quantitatively , the structure constants (for instance, µ and (cid:98) µ inTheorem 2; σ and (cid:98) σ in Theorem 4) tend to be model-speciﬁc. Yet another instanceof such universality phenomena is the height of Otter trees, analysed in [3], whichis to be compared to the height of plane binary trees [8]: both scale to √ n and leadto the same elliptic-theta distribution, albeit with diﬀerent scaling factors. Acknowledgements.

The work of M. B´ona was partially supported by the NationalScience Foundation and the National Security Agency. The work of P. Flajolet was partlysupported by the French ANR Project SADA (“Structures Discr`etes et Algorithmes”).

References [1]

Bender, E. A.

Central and local limit theorems applied to asymptotic enumeration.

Journalof Combinatorial Theory 15 (1973), 91–111.[2]

B´ona, M., and Knopmacher, A.

On the probability that certain compositions have thesame number of parts.

Annals of Combinatorics (2008). To appear, 19pp.[3]

Broutin, N., and Flajolet, P.

The height of random binary unlabelled trees. In

Pro-ceedings of Fifth Colloquium on Mathematics and Computer Science: Algorithms, Trees,Combinatorics and Probabilities (Blaubeuren, 2008), U. R¨osler, Ed., vol. AI, pp. 121–134. [4]

Diestel, R.

Graph Theory . No. 173 in Graduate Texts in Mathematics. Springer Verlag,2000.[5]

Erd˝os, P., and Tur´an, P.

On some problems of a statistical group theory III.

Acta Math.Acad. Sci. Hungar. 18 (1967), 309–320.[6]

Finch, S.

Mathematical Constants . Cambridge University Press, 2003.[7]

Flajolet, P., Fusy, E., Gourdon, X., Panario, D., and Pouyanne, N.

A hybrid ofDarboux’s method and singularity analysis in combinatorial asymptotics.

Electronic Journalof Combinatorics 13 , 1:R103 (2006), 1–35.[8]

Flajolet, P., and Odlyzko, A. M.

The average height of binary trees and other simpletrees.

Journal of Computer and System Sciences 25 (1982), 171–213.[9]

Flajolet, P., and Sedgewick, R.

Analytic Combinatorics . Cambridge University Press,2008. In press; 825 pages (ISBN-13: 9780521898065); also available electronically from theauthors’ home pages.[10]

Gnedenko, B. V., and Kolmogorov, A. N.

Limit Distributions for Sums of IndependentRandom Variables . Addison-Wesley, 1968. Translated from the Russian original (1949).[11]

Goh, W. M. Y., and Schmutz, E.

The expected order of a random permutation.

Bulletinof the London Mathematical Society 23 , 1 (1991), 34–42.[12]

Harary, F., and Palmer, E. M.

Graphical Enumeration . Academic Press, 1973.[13]

Hwang, H.-K.

On convergence rates in the central limit theorems for combinatorial struc-tures.

European Journal of Combinatorics 19 , 3 (1998), 329–343.[14]

McKeon, K. A.

The expected number of symmetries in locally-restricted trees I. In

GraphTheory, Combinatorics, and Applications , Y. Alavi, Ed. Wiley, 1991, pp. 849–860.[15]

McKeon, K. A.

The expected number of symmetries in locally restricted trees II.

DiscreteApplied Mathematics 66 , 3 (1996), 245–253.[16]

Nicolas, J.-L.

Distribution statistique de l’ordre d’un ´el´ement du groupe sym´etrique.

ActaMath. Hung. 45 , 1–2 (1985), 69–84.[17]

Otter, R.

The number of trees.

Annals of Mathematics 49 , 3 (1948), 583–599.[18]

P´olya, G., and Read, R. C.

Combinatorial Enumeration of Groups, Graphs and ChemicalCompounds . Springer Verlag, 1987.[19]

Sloane, N. J. A.

The On-Line Encyclopedia of Integer Sequences . 2008. Published electron-ically at .[20]

Stanley, R. P.

Enumerative Combinatorics , vol. II. Cambridge University Press, 1999.[21]

Van Cutsem, B., and Ycart, B.

Indexed dendrograms on random dissimilarities.

Journalof Classiﬁcation 15 , 1 (1998), 93–127.[22]

Wilf, H. S.

The variance of the Stirling cycle numbers. Tech. rep., ArXiv, 2005.M. B´ona, Department of Mathematics, University of Florida, 358 Little Hall, PO Box 118105,Gainesville, FL 32611–8105 (USA)P. Flajolet.