Isomorphism and Symmetries in Random Phylogenetic Trees
IISOMORPHISM AND SYMMETRIES IN RANDOMPHYLOGENETIC TREES
MIKL ´OS B ´ONA AND PHILIPPE FLAJOLET
Abstract.
The probability that two randomly selected phylogenetic trees ofthe same size are isomorphic is found to be asymptotic to a decreasing exponen-tial modulated by a polynomial factor. The number of symmetrical nodes in arandom phylogenetic tree of large size obeys a limiting Gaussian distribution,in the sense of both central and local limits. The probability that two randomphylogenetic trees have the same number of symmetries asymptotically obeysan inverse square-root law. Precise estimates for these problems are obtainedby methods of analytic combinatorics, involving bivariate generating functions,singularity analysis, and quasi-powers approximations. Introduction
Every high school student of every civilized part of the world is cognizant of the tree of species , also known as the “tree of life”, in relation to Darwin’s theory ofevolution (Figure 1). We observe n different species, and form a group with theclosest pair (under some suitable proximity criterion), then repeat the process withthe n − phylogenetic tree , also known as “cladogram”, is obtained: such a treehas the n species at its external nodes, also called “leaves”; it has n − § , n ]. In classicalcombinatorial terms, the set of phylogenetic trees thus corresponds to the set B of rooted non-plane binary trees , which are labeled at their leaves .We let B n be the subset of B corresponding to trees of size n (those with n leaves)and denote by b n := |B n | the corresponding cardinality. Considering the listing ofall unlabeled trees of sizes 1 , , , L ) ( R ) , the reader is invited to verify that b = 1, b = 1, b = 3, and that b = 15is obtained by counting all possible labelings (3 and 12, respectively) of the twotrees L, R shown on the right of (1).A general formula for the numbers b n is well known and straightforward to prove.Indeed, if we introduce the exponential generating function B ( z ) := (cid:88) n ≥ b n z n n ! , Date : January 6, 2009. a r X i v : . [ m a t h . P R ] J a n M. B ´ONA AND P. FLAJOLET
Figure 1.
Left: the representation of a pylogenetic tree in Darwin’sown handwriting. Right: an illustration of the Tree of Life by Haeckelin
The Evolution of Man , published in 1879. (Source: Entry “Tree oflife”,
Wikipedia .) then the fact that each element of B n is built up from its two subtrees implies that(2) B ( z ) = z + 12 B ( z ) . See the books by Stanley [20, pp. 13–15] or Flajolet–Sedgewick [9, § B ( z ) is the solution of the quadratic equation (2) that is agenerating function. That is, B ( z ) = 1 − √ − z. This leads to the following exact formula for the numbers b n . Proposition 1.
The number of phylogenetic trees on n labeled nodes is b n = 1 · · · · · (2 n − ≡ (2 n − . There is a natural way to associate an unlabeled rooted binary non-plane tree toeach element t ∈ B n , by simply removing all the labels of t . We will say that twoelements t, t (cid:48) ∈ B n are isomorphic if removing their labels will associate them tothe same unlabeled tree. This leads to the following intriguing question. Question.
What is the probability p n that two phylogenetic trees,selected uniformly at random in B n , are isomorphic? Note that, in our running example, the case of n = 4, we have p = (cid:0) (cid:1) + (cid:0) (cid:1) = .Indeed, if we selected two elements of B at random, there is a (3 / = (1 / chance that they will both belong to the isomorphism class of L , and (12 / =(4 / that they both belong to the isomorphism class of R , where L and R are thetwo trees of (1). ANDOM PHYLOGENETIC TREES 3
In this paper, we will use a multivariate generating function argument (Section 2)in conjunction with an analysis of singularities in the complex plane (Section 3) toanswer the isomorphism question in Theorem 1. In Section 4, we will extend ouranalysis to distributional estimates of the number of symmetrical nodes in phylo-genetic trees and in their unlabeled counterparts, known as Otter trees: see Theo-rems 2 and 3 for central and local limit laws , respectively. Such results in particularquantify the distribution of the log-size of the automorphism group of the randomtrees under consideration. In Section 5, we will work out an explicit estimate of theprobability that two random trees have the same number of symmetries.2.
Isomorphism: a Generating Function Argument
Unlabeled Trees.
Let U n be the set of all unlabeled rooted binary non-planetrees with n leaves, and let u n = |U n | be the corresponding count, with ordinarygenerating function U ( z ) := (cid:88) n ≥ u n z n . Such trees are often called
Otter trees , since Otter was the first to study theirenumeration [17]. We can build a generic element of U n by taking a tree t (cid:48) ∈ U k and a tree t (cid:48)(cid:48) ∈ U n − k , and joining their roots to a new root. As the order of t and t (cid:48) is not significant, we get each tree t ∈ U n twice this way, except that, if thetwo subtrees of t are identical, we get t only once. This leads to the functionalequation [9, 12, 17, 18]:(3) U ( z ) = z + 12 (cid:0) U ( z ) + U ( z ) (cid:1) . The numbers u n are listed as sequence A001190 (the “Wedderburn–Etheringtonnumbers” ) in the On-line Encyclopedia of Integer Sequences by Neil Sloane [19]and are the answers to various combinatorial enumeration problems. The first fewvalues of the sequence { u n } n ≥ are 1, 1, 1, 2, 3, 6, 11, 23, 46, 98.2.2. A multivariate generating function.
Let t ∈ B n , and let t ∈ B n . ByProposition 1, there are (2 n − possibilities for the ordered pair ( t , t ), where t and t do not have to be distinct. Our goal is to count such ordered pairs in which t and t are isomorphic. This number, divided by (2 n − will then provide theprobability p n that two randomly selected elements of B n are isomorphic.Let t ∈ U n . Then the number of different labelings of the leaves of t is(4) w ( t ) = n !2 sym( t ) , where sym( t ) is the number of non-leaf nodes v of t such that the two subtreesstemming from v are identical. For example, if n = 4, and t is the tree L of (1),then we have w ( t ) = 3, and indeed, t has n ! / = 24 / t is thetree R of (1), then we have w ( t ) = 1, and t has 24 / B n correspond to elements of U n . Set(5) W n = (cid:88) t ∈U n sym( t ) . As we have mentioned above, n ! / sym( t ) is the number of labeled trees in the iso-morphism class corresponding to t . Summing this number over all isomorphism M. B ´ONA AND P. FLAJOLET classes, we obtain the total number of trees in B n . That is, n ! W n = 1 · · · · · (2 n − . For instance, W = + = , and 4! · = 15 = 5!!.Let(6) F ( z, u ) = (cid:88) t ∈U u sym( t ) z | t | be the bivariate generating function of Otter trees, with z marking the number ofleaves, and u marking non-leaf nodes with two identical subtrees. In particular, F ( z, u ) = z + uz + uz + ( u + u ) z + higher degree terms. The crucial observationabout F ( z, u ) is the following. Lemma 1.
The bivariate generating function F ( z, u ) that enumerates Otter treeswith respect to the number of symmetrical nodes satisfies the functional equation (7) F ( z, u ) = z + 12 F ( z, u ) + (cid:18) u − (cid:19) F ( z , u ) . Proof.
If a tree consists of more than one node, then it is built up from its twosubtrees. As the order of the two subtrees is not significant, we will get each tree twice this way, except the trees whose two subtrees are identical. If t and t arethe two subtrees of t whose roots are the two children of the root of t , thensym( t ) = (cid:26) sym( t ) + sym( t ) , if t and t are not identicalsym( t ) + sym( t ) + 1 , if t and t are identical . The first term of the right-hand side of (7) represents the tree on one node, thesecond term represents all other trees as explained in the preceding paragraph, andthe third term is the correction term for trees in which the two subtrees of the rootare identical. (cid:3)
Note that various specializations of F ( z, u ) have a known combinatorial meaning.Indeed,( i ) If u = 1, then F ( z,
1) = (cid:80) t ∈ U z | t | is simply the ordinary generating func-tion U ( z ) of Otter trees with respect to their number of leaves. We havediscussed this generating function in Subsection 2.1, and mentioned thatits coefficients u n are the Wedderburn–Etherington numbers, which formsequence A001190 in [19].( ii ) If u = 2, then F ( z,
2) = (cid:80) t ∈ U z | t | sym( t ) is the ordinary generating functionof the total number of automorphisms in all Otter trees. The coefficientsconstitute sequence A003609 in [19]. Interested readers may consult McK-eon’s studies [14, 15] for details. The first few elements of the sequence are1, 2, 2, 10, 14, 42, 90, 354.( iii ) If u = 1 /
2, then F (cid:18) z, (cid:19) = (cid:88) t ∈U z | t | − sym( t ) = (cid:88) n W n z n = (cid:88) n (2 n − z n n ! , is the exponential generating function B ( z ) of labeled trees in disguise. Wehave discussed this generating function in the Introduction. The numbers(2 n − u = 1 / z n ] g ( z ) denote the coefficient of z n in the power series g ( z ). ANDOM PHYLOGENETIC TREES 5
Lemma 2.
For all positive integers n ≥ , the probability p n that two phylogenetictrees of size n are isomorphic satisfies p n = (cid:18) n !(2 n − (cid:19) · [ z n ] F (cid:18) z, (cid:19) . Proof.
Consider the sample space whose elements are the elements of U n , and inwhich the probability of t ∈ U n is(8) κ ( t ) := n !2 sym( t ) · n − w ( t )(2 n − . (For probabilists, κ is the image on U n of the uniform distribution of B n .) Forinstance, if n = 4, then this space has two elements, (the two trees L, R of (1)), onehas probability 1 /
5, and the other has probability 4 /
5. If we select two elements ofthis space at random, the probability that they coincide is p n = (cid:88) t ∈U n κ ( t ) = 1(2 n − (cid:88) t ∈U n w ( t ) = n ! (2 n − (cid:88) t ∈U n (cid:18) (cid:19) sym( t ) . Our claim now follows since (cid:80) t ∈U n (cid:0) (cid:1) sym( t ) is indeed the coefficient of z n in F ( z, / (cid:3) Isomorphism: Singularity Analysis
By Lemma 2, our goal is now to find the coefficient of z n in the one-variablegenerating function f ( z ) := F ( z, / . Lemma 1 shows that the formal power series F ( z, u ) is the solution of the quadraticequation (7) that satisfies F (0 ,
0) = 0. That is,(9) F ( z, u ) = 1 − (cid:112) − z − (2 u − F ( z , u ) . Iterated applications of (9), starting with u = 1 /
4, show that f ( z ) ≡ F ( z, /
4) = 1 − (cid:115) − z + 12 F (cid:18) z , (cid:19) = 1 − (cid:118)(cid:117)(cid:117)(cid:116) − z − (cid:115) − z + 78 F (cid:18) z , (cid:19) = · · · . In the limit, there results that f ( z ) admits a “continued square-root” expansion f ( z ) = 1 − (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − z − (cid:118)(cid:117)(cid:117)(cid:116) − z − (cid:115) − z − (cid:114) · · · , out of which initial elements of the sequence ( p n ) n ≥ are easily determined:1 , , , , , , , , , , · · · . In order to compute the growth rate of the coefficients of f ( z ), we will analyze thedominant singularity (or singularities) of this power series. The interested reader isinvited to consult the book Analytic Combinatorics by Flajolet and Sedgewick [9]for more information on the notions and techniques that we are going to use. Part
M. B ´ONA AND P. FLAJOLET of the difficulty of the problem is that the functional relation (9) has the characterof an inclusion–exclusion formula: F ( z, u ) does not depend positively on F ( z , u ),as soon as u ≤ /
2, which requires suitably crafted arguments, in contrast to the(simpler) asymptotic analysis of u n = [ z n ] F ( z, location , type , and number of the dominant sin-gularities of f ( z ), that is, singularities that have smallest absolute value (modulus).3.1. Location.
First, it is essential for our analytic arguments to establish that f ( z ) has a radius of convergence strictly less than 1. Our starting point parallelsLemmas 1–2 of McKeon [15], but we need a specific argument for the upper bound. Lemma 3.
Let ρ be the largest real number such that f ( z ) is analytic in the interiorof a disc centered at the origin that has radius ρ . The following inequalities hold: . < ρ < . . Proof. ( i ) Lower bound.
Note that f ( z ) is convergent in some disc of radius at least f ( z ) = F ( z, /
4) are at most as large as the coefficientsof F ( z, U ( z ) of Otter trees, and the latter is known tobe convergent in a disc of radius 0 . · · · : see Otter’s original paper [17] andFinch’s book [6, § F ( z,
1) = U ( z ).( ii ) Upper bound.
For fixed n , let a , a , · · · , a u n be the numbers of our labeledtrees whose underlying unlabeled tree is the first, second, . . . , last Otter tree of size n . Then the relation(10) p n ≡ a + a + · · · + a u n ( a + a + · · · + a u n ) > u n , results from the Cauchy-Schwarz inequality. (In words: the probability of coinci-dence of two elements from a finite probability space is smallest when the distribu-tion is the uniform one.)As we mentioned, it is proved in [17] that the generating function (cid:80) n u n x n converges in a disc of radius at least 0.4. Therefore, the series (cid:80) n u n x n convergesin a disc of radius at most 1 / . .
5, and by (10), this implies that (cid:80) n p n x n converges in a disc of radius less than 2.5. Now Lemma 2 shows that F ( z, / . / . F ( z, /
4) are, up to polynomial factors, 4 n times larger than the coefficients of (cid:80) n p n x n . It follows that ρ < . (cid:3) A well-known theorem of Pringsheim states that if a function g ( z ) is representablearound the origin by a series expansion that has non-negative coefficients and radiusof convergence R , then the real number R is actually a singularity of g ( z ). Applyingthis theorem to f ( z ), we see that the positive real number ρ must be a singularityof f ( z ).3.2. Type.
Recall that a function g ( z ) analytic in a domain Ω is said to havea square-root singularity at a boundary point α if, for some function H analyticat 0, the representation g ( z ) = H ( √ z − α ) holds in the intersection of Ω and aneighborhood of α . (In particular, if g ( z ) = (cid:112) γ ( z ) with γ analytic at α , then g ( z )has a square-root singularity at α whenever γ ( α ) = 0 and γ (cid:48) ( α ) (cid:54) = 0.) Lemma 4.
All dominant singularities (of modulus ρ ) of f ( z ) are isolated and areof the square-root type. ANDOM PHYLOGENETIC TREES 7
Proof.
In order to see this, note that ρ < ρ < √ ρ . Therefore, the power series F ( z , /
4) (that has radius of convergence √ ρ ) is analytic in the interior of the disc of radius ρ , and so is the power series F ( z , /
16) since its coefficients are smaller than the corresponding coefficients of F ( z , / f ( z ) = F (cid:18) z, (cid:19) = 1 − (cid:115) − z + 12 F (cid:18) z , (cid:19) are of the square-root type: they are to be found amongst the roots of the expressionunder the square-root sign in (9), that is, amongst the zeros of 1 − z + F ( z , / ρ . As 1 − z + F ( z , /
16) is analytic in the disc centered atthe origin with radius at least √ ρ > ρ , it has isolated roots. Hence f ( z ) has onlya finite number of singularities on the circle | z | = ρ , and each is of square-roottype. (cid:3) The argument of the proof (see (11)) also shows that ρ is determined as thesmallest positive root of the equation(12) 1 − ρ + 12 F (cid:18) ρ , (cid:19) = 0 . Number.
In order to complete our characterization of the dominant singularstructure of f ( z ), we need the following statement. Lemma 5.
The point ρ is the only singularity of smallest modulus of f ( z ) .Proof. The argument is somewhat indirect and it proceeds in two stages.First we show that, as a power series, f ( z ) converges for each z with | z | = ρ .To this purpose, we need to recall briefly some principles of singularity analysis, asexpounded in [9, Ch. VI]. Let g ( z ) be a function analytic in | z | < R with finitelymany singularities at the set { α j } on the circle | z | = R ; assume in addition that g ( z ) has a square-root singularity at each α j in the sense of Subsection 3.2. Then,one has [ z n ] g ( z ) = O (cid:0) R − n n / (cid:1) . (This corresponds to the O –transfer theoremof [9, Th. VI.3, p. 390], with amendments for the case of multiples singularities tobe found in [9, § VI.5]; see also (14) below.) It follows from this general estimateand Lemma 4 that [ z n ] f ( z ) = O ( ρ − n n / ) . Therefore, the series expansion of f ( z ) converges absolutely as long as | z | ≤ ρ , and,in particular, it converges for all z with modulus ρ .Now, we are in a position to prove that f ( z ) has no singularity other than ρ onthe circle | z | = ρ . Let us assume the contrary; that is, there is a real number z (cid:54) = ρ such that | z | = ρ and z is a singularity of f ( z ) ≡ F ( z, / f ( z ) ≡ F ( z , /
4) = 1, since the expression under the square-rootsign in (9) is equal to 0, corresponding to a singularity of square-root type. Onthe other hand, one has a priori | f ( z ) | ≤ f ( ρ ), as a consequence of the triangleinequality and the fact, proved above, that f ( z ) converges on | z | = ρ . Now it followsfrom the strong triangle inequality that the equality f ( z ) = f ( ρ ) is only possibleif all the terms f n z n that compose the (convergent) series expansion of f ( z ) arepositive real. (Here f m = [ z m ] f ( z ).) However, since, in particular, f = 1 isnonzero, this implies that z = ρ , and a contradiction has been reached. (This partof the argument is also closely related to the Daffodil Lemma of [9, p. 266].) (cid:3) M. B ´ONA AND P. FLAJOLET
The asymptotics of p n . As a result of Lemmas 3–5, the function f ( z ) hasonly one dominant singularity, and that singularity ρ is of the square-root type.One then has, for a family of constants h k , the local singular expansion:(13) f ( z ) = 1 + ∞ (cid:88) k =0 h k (1 − z/ρ ) k +1 / , which is valid for z near ρ . The conditions of the singularity analysis process assummarized in [9, § VI.4] are then satisfied. Consequently, each singular elementof (13) relative to f ( z ) can be translated into a matching asymptotic term relativeto [ z n ] f ( z ), according to the rule(14) σ ( z ) = (1 − z/ρ ) θ −→ [ z n ] σ ( z ) = ρ − n (cid:18) n − θ − n (cid:19) ∼ ρ − n n − θ − Γ( − θ ) . In particular, we have [ z n ] f ( z ) ∼ C · ρ − n n − / , for some C .Hence Lemma 2, combined with Lemmas 4–5 and the routine asymptotics of n ! / (2 n − Theorem 1.
The probability that two phylogenetic trees of size n are isomorphicadmits a complete asymptotic expansion (15) p n ∼ a · b − n · n / (cid:32) (cid:88) k c k n k (cid:33) , where a , b = 4 ρ , and the c k are computable constants, with values a = 3 . · · · , b = 2 . · · · , and c approximately equal to − . . The function F ( z, u ) can be determined numerically to great accuracy (by meansof the recursion corresponding to the functional equation (9)). So, the value ρ = 0 . · · · , is obtained as the smallest positive root of (12); the constant a then similarlyresults from an evaluation of F (cid:48) (cid:0) ρ , (cid:1) ; the constant c , which could in principlebe computed in the same manner, was, in our experiments, simply estimated fromthe values of p n for small n . The formula (15), truncated after its c /n term, thenappears to approximate p n with a relative accuracy better than 10 − for n ≥ − for n ≥
38, and 10 − for n ≥ Symmetrical Nodes and Automorphisms
In the course of our investigations on analytic properties of the bivariate gener-ating function F ( z, u ), we came up with a few additional estimates, which improveon those of McKeon [15]. In essence, what is at stake is a perturbative analysisof F ( z, u ) and its associated singular expansions, for various values of u , in a waythat refines the developments of the previous section. We offer here a succinct ac-count: details can be easily supplemented by referring to Chapter IX of the book Analytic Combinatorics [9].
Theorem 2. ( i ) Let X n be the random variable representing the number of sym-metrical nodes in a random Otter tree of U n . Then, X n satisfies a limit law ofGaussian type, ∀ x ∈ R : lim n →∞ P (cid:0) X n ≤ µn + σx √ n (cid:1) = 1 √ π (cid:90) x −∞ e − w / dw, ANDOM PHYLOGENETIC TREES 9
20 30 40 500.000.020.040.060.080.10 15 20 25 30 35 400.000.020.040.060.080.100.12
Figure 2.
Histograms of the distribution of the number of symmetricalnodes in trees of size 100, compared to a matching Gaussian. Left: Ottertrees of U . Right: phylogenetic trees of B . for some positive constants µ and σ . Numerically, µ = 0 . · · · . ( ii ) Let Y n be the random variable representing the number of symmetrical nodesin a random phylogenetic tree of B n . Then, Y n satisfies a limit law of Gaussiantype, ∀ x ∈ R : lim n →∞ P (cid:0) Y n ≤ (cid:98) µn + (cid:98) σx √ n (cid:1) = 1 √ π (cid:90) x −∞ e − w / dw, for some positive constants (cid:98) µ and (cid:98) σ . Numerically, (cid:98) µ = 0 . · · · .Proof (Sketch). ( i ) The case of Otter trees ( X n , U n ). In accordance, with generalprinciples [9, Ch. IX], we need to estimate the generating polynomial(16) ϕ n ( u ) := [ z n ] F ( z, u ) , when u is close to 1, with F ( z, u ) as specified by (6) and (7). For u in a small enoughcomplex neighborhood Ω of 1, the radius of convergence of F ( z , u ) is larger thansome ρ > ρ , where ρ ≈ . ρ ( u ) to the analytic equation(17) 1 − ρ ( u ) + ( u − F ( ρ ( u ) , u ) = 0(compare with (12)), such that ρ (1) = ρ is the dominant singularity of the generat-ing function F ( z,
1) of Otter trees. By the analytic version of the implicit functiontheorem (equivalently, by the Weierstrass Preparation Theorem), this function ρ ( u )depends analytically on u , for u near 1.In addition, by (9), the function F ( z, u ) has a singularity of the square-roottype at ρ ( u ). Also, for u ∈ Ω and Ω taken small enough, the triangle inequalitycombined with the previously established properties of F ( z,
1) may be used toverify that there are no other singularities of z (cid:55)→ F ( z, u ) on | z | = | ρ ( u ) | . Thereresults, from singularity analysis and the uniformity of the process [9, p. 668], theasymptotic estimate(18) ϕ n ( u ) = c ( u ) ρ ( u ) − n n − / (1 + o (1)) , n → + ∞ , uniformly with respect to u ∈ Ω, for some c ( u ) that is analytic at u = 1. Then, theprobability generating function of X n , which equals ϕ n ( u ) /ϕ n (1) satisfies what isknown as a “quasi-powers approximation . That is, it resembles (analytically) theprobability generating function of a sum of independent random variables,(19) ϕ n ( u ) ϕ n (1) = c ( u ) c (1) (cid:18) ρ (1) ρ ( u ) (cid:19) n [1 + ε n ( u )] , where sup u ∈ Ω | ε n ( u ) | tends to 0 as n → ∞ . The Quasi-powers Theorem (see [9, § IX.5] and [13]) precisely applies to such approximations by quasi-powers and im-plies that the distribution of X n is asymptotically normal.( ii ) The case of phylogenetic trees ( Y n , B n ). The starting point is a simple com-binatorial property of ϕ n ( u ), as defined in (16):(20) ϕ n ( u/
2) = 1 n ! (cid:88) t ∈U n n !2 sym( t ) u sym( t ) = 1 n ! (cid:88) t ∈B n u sym( t ) . (The first form results from the definition (6) of F ( z, u ); the second form relies onthe expression (4) of the number of different labellings of an Otter tree that give riseto a phylogenetic tree.) Thus, ϕ n taken with an argument near 1 / B n .From this point on, the analysis of symmetries in phylogenetic trees is entirelysimilar to that of Otter trees. For u in a small complex neighborhood (cid:98) Ω of 1 / z (cid:55)→ F ( z, u ) has a dominant singularity ρ ( u ) that is ananalytic solution of (17) and is such that ρ (1 /
2) = 1 /
2, the radius of convergence of B ( z ) ≡ F ( z, / u ∈ (cid:98) Ω now near 1 /
2. In particular,(21) ϕ n ( u ) ϕ n (1 /
2) = (cid:98) c ( u ) (cid:98) c (1 / (cid:18) (cid:98) ρ (1 / (cid:98) ρ ( u ) (cid:19) n [1 + (cid:98) ε n ( u )] , where (cid:98) ε n ( u ) → u := v/
2, with v near 1), the distribution of Y n is asymptotically normal. (cid:3) Figure 2 shows that the fit with a Gaussian is quite good, even for comparativelylow sizes ( n = 100). Phrased differently, the statement of Theorem 2 means thatthe logarithm of the order sym( t ) of the automorphism group of a random tree t (either in U n or in B n ) is normally distributed . In the case of U n , the expectation ofthe cardinality of this group has been determined by McKeon [15] to grow roughlyas 1 . n . In the case of phylogenetic trees ( B n ), we find an expected growth ofthe rough form 1 . n , where the exponential rate 1 . · · · is exactly 1 / (2 ρ ),with ρ , still, the radius of convergence of U ( z ) ≡ F ( z, B n .)As a matter of fact, the histograms of Figure 2 suggest that a convergencestronger than a plain convergence in law (corresponding to convergence of thedistribution function) holds. The situation is loosely evocative of the fact (Erd˝os–Tur´an Theorem) that the logarithm ofthe order of a random permutation of size n is normally distributed; see, e.g., [5, 11, 16]. ANDOM PHYLOGENETIC TREES 11
Definition 1.
Let ( ξ n ) be a family of random variables with expectation µ n = E ( ξ n ) and variance σ n = V ( ξ n ) . It is said to satisfy a local limit law with density g ( x ) ifone has (22) lim n →∞ sup x ∈ R | σ n P ( ξ n = (cid:98) µ n + xσ n (cid:99) ) − g ( x ) | = 0 . In other terms, we expect the probability of ξ n being at x standard deviations awayfrom its mean to be well approximated by g ( x ) /σ n . This concept is discussed inthe case of sums of random variables by Gnedeneko and Kolmogorov in [10, Ch. 9]and, in a broader combinatorial context, by Bender [1] and Flajolet–Sedgewick [9, § IX.9].
Theorem 3.
The number of symmetrical nodes in either an unlabeled tree ( X n on U n ) or a phylogenetic tree ( Y n on B n ) satisfies a local limit law of the Gaussiantype. That is, in the sense of Definition 1, a local limit law holds, with density g ( x ) = 1 √ π e − x / . Proof. ( i ) The unlabeled case ( X n , U n ). The proof essentially boils down to estab-lishing that f n ( u ) = [ z n ] F ( z, u )is small compared to [ z n ] F ( z, u satisfies | u | = 1 and stays awayfrom 1; then, Theorem IX.14, p. 696, from [FlSe08] does the rest. The argumentsare variations of the ones previously used.Since a tree of size n has less than n symmetrical nodes, we have | f n ( u ) | ≤| u | n f n (1) for any | u | ≥
1. There results that the convergence of the series expansionof F ( z, u ) is dominated by that of F ( | zu | , | u | ≥
1. Apply the factexplained in the previous sentence, with z and u instead of z and u , to get thatthe coefficients of F ( z , u ) are less than the coefficients of F ( | z u | , | z u | < . | zu | < .
75, say. Nowchoose η so that (1 + η )( ρ + η ) < .
75, where ρ is the radius of convergence ofOtter trees ( ρ ≡ ρ (1) ≈ . F ( z , u ) is bivariate analytic whenever | z | < ( ρ + η ) and | u | < η . In accordance with previously developed arguments,this implies that, for any fixed u satisfying | u | ≤ η , the function z (cid:55)→ F ( z, u )has only finitely many singularities, each of the square-root type, in | z | ≤ ρ + η .For u in a small complex neighborhood of 1, we already know that z (cid:55)→ F ( z, u )has only one dominant singularity at some ρ ( u ), which is a root of1 − ρ ( u ) + (2 u − F ( ρ ( u ) , u ) = 0 . (This property lies at the basis of the central limit law of the previous theorem.)Consider now a u such that | u | = 1, but u (cid:54)∈ Ω. We argue that z (cid:55)→ F ( z, u ) isanalytic at all points z such that | z | = ρ . Indeed for such values of u and z , wehave, by the strong triangle inequality ,(23) | F ( z, u ) | < F ( ρ , , the reason being that, in the expansion F ( z, u ) = z + uz + uz + · · · , the valuesof the monomials u k z n cannot be all collinear, unless u = 1. The inequality (23)combined with the fact that F ( ρ ,
1) = 1 implies that z (cid:55)→ F ( z, u ) cannot besingular (since, as we know, the only possibility for a singularity would be that itis of the square-root type and F ( z, u ) = 1). Thus, for | u | = 1 and u (cid:54)∈ Ω, the function z (cid:55)→ F ( z, u ) is analytic at all pointsof | z | = ρ . Hence, it is analytic in | z | ≤ ρ + δ , for some δ >
0. By usual exponentialbounds, there results that, for some
K >
0, one has(24) | f n ( u ) | < K ( ρ + δ/ − n , | u | = 1 , u (cid:54)∈ Ω . As expressed by Theorem IX.14 of [9], the existence of a quasi-powers approxima-tion (when u is near 1), as in (18) and (19), and of the exponentially small bound(when u (cid:54)∈ Ω is away from 1), as provided by (24), suffices to ensure the existenceof a local limit law.( ii ) The labeled case ( Y n , B n ). In accordance with (20), the function F ( z, u/
2) isthe bivariate exponential generating function of phylogenetic trees, with z markingsize and u marking the number of symmetrical nodes. Consider once more | u | = 1and distinguish the two cases u ∈ (cid:98) Ω (for which the proof of Theorem 2 provides aquasi-powers approximation) and u (cid:54)∈ (cid:98) Ω. In the latter case, arguments that entirelyparallel those applied to unlabeled trees give us that z (cid:55)→ F ( z, u/
2) has no singu-larity on | z | = 1 /
2. This implies, for u (cid:54)∈ (cid:98) Ω, the exponential smallness of (cid:98) ϕ n ( u/ (cid:3) Coincidence of the Number of Symmetries
From a statistician’s point of view, it may be of interest to determine the prob-ability for two trees to be “ similar ” (rather than plainly isomorphic), given somestructural similarity distance between non-plane trees—see, for instance, the workof Ycart and Van Cutsem [21] for a study conducted under probabilistic assump-tions that differ from ours. Combinatorial generating functions can still be usefulin this broad range of problems, as we now show by considering the following ques-tion: determine the probability that two randomly chosen trees τ, τ (cid:48) of the same sizehave the same number of symmetrical nodes . This probability a priori lies in theinterval [ n , n have the same number of cycles is asymptoticto (2 √ π log n ) − ; B´ona and Knopfmacher [2] examine combinatorially and asymp-totically the probability that various types of integer compositions have the samenumber of parts, and several other coincidence probabilities are studied in [7]. Thefollowing basic lemma trivializes the asymptotic side of several such questions. Lemma 6.
Let C be a combinatorial class equipped with an integer-valued para-meter χ . Assume that the random variable corresponding to χ restricted to C n (under the uniform distribution over C n ) satisfies a local limit law with density g ( x ) ,in the sense of Definition 1. Let the variance of χ on C n be σ n and assume that g ( x ) is continuously differentiable. Then, the probability that two objects c, c (cid:48) ∈ C n The reasoning corresponding to that theorem is simple: start from[ u k ] f n ( u ) = 12 iπ Z | u | =1 f n ( u ) duu k +1 . Use (24) to neglect the contribution corresponding to u (cid:54)∈ Ω; appeal to the saddle point methodapplied to the quasi-powers approximation to estimate the central part u ∈ Ω, and conclude.
ANDOM PHYLOGENETIC TREES 13 admit the same value of χ satisfies the asymptotic estimate (25) P (cid:20) χ ( c ) = χ ( c (cid:48) ) , c, c (cid:48) ∈ C n (cid:21) ∼ Kσ n , where K := (cid:90) ∞−∞ g ( x ) dx. Note that, for g ( x ) the standard Gaussian density, one has K = 1 / (2 √ π ). Proof (sketch).
Let (cid:36) n be the probability of coincidence; that is, the left hand-sideof (25). Observe that, by hypothesis, we must have σ n → ∞ . The baseline is that (cid:36) n = (cid:88) k P C n [ χ ( c ) = k ] ∼ σ n (cid:88) x ∈E n g ( x ) , with E n := 1 σ n ( Z ≥ − { µ n } ) , µ n := E C n [ χ ] ∼ σ n (cid:90) ∞−∞ g ( x ) dx. To justify this chain rigorously, first restrict attention to values of x in a finiteinterval [ − A, + B ], so that the tails ( (cid:82) B ) g are less than some small (cid:15) . Then,with x ∈ [ − A, + B ], make use of the approximation (22) provided by the assumptionof a local limit law. Next, approximate the sum of g ( x ) taken at regularly spacedsampling points (a Riemann sum) by the corresponding integral. Finally, completeback the tails. (cid:3) Given the local limit law expressed by Theorem 3, an immediate consequence ofLemma 6 is the following.
Theorem 4.
For Otter trees ( U n ) and phylogenetic trees ( B n ) , the asymptoticprobabilities that two trees of size n have the same number of symmetries admit theforms U n : 12 σ √ πn , B n : 12 (cid:98) σ √ πn , where σ, (cid:98) σ are the two “variance constants” of Theorem 2. In summary, as we see in several particular cases here, qualitatively similar phe-nomena are expected in trees, whether plane or non-plane trees, labelled or unla-belled, whereas, quantitatively , the structure constants (for instance, µ and (cid:98) µ inTheorem 2; σ and (cid:98) σ in Theorem 4) tend to be model-specific. Yet another instanceof such universality phenomena is the height of Otter trees, analysed in [3], whichis to be compared to the height of plane binary trees [8]: both scale to √ n and leadto the same elliptic-theta distribution, albeit with different scaling factors. Acknowledgements.
The work of M. B´ona was partially supported by the NationalScience Foundation and the National Security Agency. The work of P. Flajolet was partlysupported by the French ANR Project SADA (“Structures Discr`etes et Algorithmes”).
References [1]
Bender, E. A.
Central and local limit theorems applied to asymptotic enumeration.
Journalof Combinatorial Theory 15 (1973), 91–111.[2]
B´ona, M., and Knopmacher, A.
On the probability that certain compositions have thesame number of parts.
Annals of Combinatorics (2008). To appear, 19pp.[3]
Broutin, N., and Flajolet, P.
The height of random binary unlabelled trees. In
Pro-ceedings of Fifth Colloquium on Mathematics and Computer Science: Algorithms, Trees,Combinatorics and Probabilities (Blaubeuren, 2008), U. R¨osler, Ed., vol. AI, pp. 121–134. [4]
Diestel, R.
Graph Theory . No. 173 in Graduate Texts in Mathematics. Springer Verlag,2000.[5]
Erd˝os, P., and Tur´an, P.
On some problems of a statistical group theory III.
Acta Math.Acad. Sci. Hungar. 18 (1967), 309–320.[6]
Finch, S.
Mathematical Constants . Cambridge University Press, 2003.[7]
Flajolet, P., Fusy, E., Gourdon, X., Panario, D., and Pouyanne, N.
A hybrid ofDarboux’s method and singularity analysis in combinatorial asymptotics.
Electronic Journalof Combinatorics 13 , 1:R103 (2006), 1–35.[8]
Flajolet, P., and Odlyzko, A. M.
The average height of binary trees and other simpletrees.
Journal of Computer and System Sciences 25 (1982), 171–213.[9]
Flajolet, P., and Sedgewick, R.
Analytic Combinatorics . Cambridge University Press,2008. In press; 825 pages (ISBN-13: 9780521898065); also available electronically from theauthors’ home pages.[10]
Gnedenko, B. V., and Kolmogorov, A. N.
Limit Distributions for Sums of IndependentRandom Variables . Addison-Wesley, 1968. Translated from the Russian original (1949).[11]
Goh, W. M. Y., and Schmutz, E.
The expected order of a random permutation.
Bulletinof the London Mathematical Society 23 , 1 (1991), 34–42.[12]
Harary, F., and Palmer, E. M.
Graphical Enumeration . Academic Press, 1973.[13]
Hwang, H.-K.
On convergence rates in the central limit theorems for combinatorial struc-tures.
European Journal of Combinatorics 19 , 3 (1998), 329–343.[14]
McKeon, K. A.
The expected number of symmetries in locally-restricted trees I. In
GraphTheory, Combinatorics, and Applications , Y. Alavi, Ed. Wiley, 1991, pp. 849–860.[15]
McKeon, K. A.
The expected number of symmetries in locally restricted trees II.
DiscreteApplied Mathematics 66 , 3 (1996), 245–253.[16]
Nicolas, J.-L.
Distribution statistique de l’ordre d’un ´el´ement du groupe sym´etrique.
ActaMath. Hung. 45 , 1–2 (1985), 69–84.[17]
Otter, R.
The number of trees.
Annals of Mathematics 49 , 3 (1948), 583–599.[18]
P´olya, G., and Read, R. C.
Combinatorial Enumeration of Groups, Graphs and ChemicalCompounds . Springer Verlag, 1987.[19]
Sloane, N. J. A.
The On-Line Encyclopedia of Integer Sequences . 2008. Published electron-ically at .[20]
Stanley, R. P.
Enumerative Combinatorics , vol. II. Cambridge University Press, 1999.[21]
Van Cutsem, B., and Ycart, B.
Indexed dendrograms on random dissimilarities.
Journalof Classification 15 , 1 (1998), 93–127.[22]
Wilf, H. S.
The variance of the Stirling cycle numbers. Tech. rep., ArXiv, 2005.M. B´ona, Department of Mathematics, University of Florida, 358 Little Hall, PO Box 118105,Gainesville, FL 32611–8105 (USA)P. Flajolet.