On the distribution of interspecies correlation for Markov models of character evolution on Yule trees
OOn the distribution of interspecies correlationfor Markov models of character evolution on Yule trees
Willem H. Mulder a , Forrest W. Crawford b a Department of Chemistry, The University of the West Indies, Mona Campus, Kingston 7, Jamaica, West Indies b Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
Abstract
Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally onstochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylo-genetic trees is the Yule process, in which new species are “born” from existing lineages at a constant rate.Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknownhow these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Un-derstanding the interplay between speciation and mutation under simple models of evolution is essential forderiving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic stud-ies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion andOrnstein-Uhlenbeck models of phenotypic change on a Yule tree. We compute the probability distributionof the number of mutations shared between two randomly chosen taxa in a Yule tree under discrete Markovmutation models. Our results suggest summary measures of phylogenetic information content, illuminatethe correlation between site patterns in sequences or traits of related organisms, and provide heuristics forexperimental design and reconstruction of phylogenetic trees.
Keywords: infinite sites, interspecies correlation, mutation model, phylogenetics, Yule process
1. Introduction
Simple stochastic models of speciation and trait evolution have proven useful for reconstruction of phylo-genetic trees describing the ancestral relationship between sets of taxa. The simplest continuous-time modelof speciation is the Yule process, in which each extant lineage gives birth at constant rate λ . A Yule tree is aphylogenetic tree in which the branching times of the tree are drawn from the Yule distribution. Despite theapparent simplicity of the Yule process, Yule trees have complex structural properties (Steel and McKenzie,2002; Rosenberg, 2006; Gernhard et al, 2008; Steel and Mooers, 2010; Mulder, 2011; Crawford and Suchard,2013). The Yule process is usually employed as a prior or null distribution on the space of phylogenetictrees within a broader scheme of phylogenetic reconstruction (Nee et al, 1994; Rannala and Yang, 1996;Nee, 2006). Researchers impose a model for the evolution of a character (trait, DNA, RNA, or amino acid Email addresses: [email protected] (Willem H. Mulder), [email protected] (Forrest W.Crawford)
Preprint submitted to Elsevier May 11, 2019 a r X i v : . [ q - b i o . P E ] A ug equence) on the branches of this phylogenetic tree. By jointly estimating the phylogenetic tree topology,branch lengths, and the parameters underlying the evolutionary model, researchers hope to understand theevolutionary history and process that gave rise to the observed data.Research on the interaction of tree topology, branch lengths, and evolutionary processes generally falls intoone of two categories. The first is the search for better measures of phylogenetic information for prospective experimental design. Most of these studies examine the probability of correctly reconstructing a simple treeor optimal design of phylogenetic studies (Yang, 1998; Sullivan et al, 1999; Shpak and Churchill, 2000; Zwickland Hillis, 2002; Susko et al, 2002). Several authors have attempted to determine whether it is better to addmore taxa or additional characters to maximize the chance of reconstructing the correct tree (Graybeal, 1998;Zwickl and Hillis, 2002). Steel and Penny (2000) analyze basic models of evolution to understand the theo-retical properties of stochastic models on phylogenetic trees. Fischer and Steel (2009) consider asymptoticsequence length bounds for correct reconstruction under maximum parsimony. Townsend (2007) introduces“phylogenetic informativeness”, the probability of observing site patterns allowing correct reconstructionof a four-taxon tree. Susko (2011) and Susko and Roger (2012) find expressions for correct reconstructionprobability for small internal edges on four-taxon trees. Real-world phylogenetic studies often involve largenumbers of taxa, and it remains controversial whether properties of mutation models on four-taxon treesgeneralize to trees with larger numbers of taxa (see e.g. Townsend, 2007; Klopfstein et al, 2010; Townsendand Leuenberger, 2011).The second class of approaches focuses on retrospective inferences about evolutionary parameters andthe derivation of estimators and confidence intervals. Following the work of Stadler (2009), who describessampling properties of birth-death trees and the distribution of the age of the most recent common ancestor(MRCA) of subsets of randomly chosen taxa, Bartoszek and Sagitov (2012) and Bartoszek (2013) findexpressions for the expectation of the interspecies correlation under models of continuous trait evolutionvia diffusion and Ornstein-Uhlenbeck processes. Bartoszek and Sagitov (2012) derive asymptotic confidenceintervals for ancestral trait values under these models. Crawford and Suchard (2013) give an estimator forthe evolutionary variance under Brownian motion for an unobserved Yule tree.In this paper we study the distribution of character values observed at the tips of a phylogenetic treegenerated by the Yule process. We first state two theorems that describe the distribution of the time ofshared ancestry between two randomly chosen taxa in a Yule tree of age τ with n taxa and speciation rate λ . Next we extend results presented by Bartoszek and Sagitov (2012) and Bartoszek (2013) to find the exactprobability distribution and covariance between pairs of randomly chosen tip values under Brownian motionand Ornstein-Uhlenbeck evolution of a continuous trait. These results give insight into the finite-time, finite- n dynamics of interspecies correlation. Next we examine discrete character evolution on Yule trees underPoisson and reversible Poisson mutation models. We suggest a new measure of phylogenetic information andgive a method for deciding whether it is better to add taxa or sites to a phylogenetic analysis.2 th nodek−1 nodes n−k−1 nodes l Time N u m be r o f T a x a t k Figure 1: The most recent common ancestor (MRCA) of two taxa at time x in a Yule tree and the corresponding Yule countingprocess. The Yule tree in the top panel has Y (0) = 2 and Y ( τ ) = n . The lineages connecting two randomly chosen taxa to theirMRCA at the k th node are shown as thick lines. The MRCA is shown as a circle. The bottom panel shows the correspondingYule process counting the number of lineages over time. The time of the birth of the k th lineage x is shown with a dashedvertical line.
2. Background
A Yule process Y ( t ) is a continuous-time Markov chain on the positive integers in which a jump fromstate n to n + 1 occurs with rate nλ . Define P Ymn ( t ) = Pr( Y ( t ) = n | Y (0) = m ) to be the transitionprobability from state m to n in time t . The Yule process obeys the forward Kolmogorov equationsd P Ymn ( t )d t = ( n − λP Ym,n − ( t ) − nλP Ymn ( t ) (1)for m ≥ n ≥ m . The transition probabilities are P Ymn ( t ) = (cid:18) n − m − (cid:19) e − λmt (1 − e − λt ) n − m (2)(Bailey, 1964). A Yule tree is a binary tree in which the number of extant lineages at time t is given by theYule process Y ( t ). If there are n extant lineages and a “birth” event occurs, one of the n lineages is chosenuniformly at random and split into two. In this paper, we assume that at the MRCA of all n taxa existedat time 0. We model t = 0 as the time of the first split, so Y (0) = 2, and both tree size (number of taxa) n and age τ are given. In what follows, we limit our attention to the ( n − n -forest, since our conclusions readily carry over to the n !( n − / n − leaf-labelled,ranked Yule trees of phylogenetic interest (Gernhard et al, 2008; Mulder, 2011).We now consider pairs of tips on a Yule tree whose MRCA is the k th birth event. We call these events“nodes” in the tree. The k th node is preceded chronologically by k − x since the first split. The k th node corresponds to the “crown age” of the sub-tree or clade below the node.3igure 1 shows an example in which the k th birth event, preceded by k − x . In continuous time each n -tree pattern of this type, with tree age τ and with the k th node appearing attime x , has the same probabilistic weight, and hence these trees can be dealt with on equal footing usingpurely combinatorial arguments. The following result gives the probability of two randomly chosen tips in aphylogenetic tree having their MRCA at the k th node. It was first derived by Stadler (2009). Theorem 1.
The probability of randomly choosing two tips in a tree of size n whose MRCA is the k th nodeis P ( n, k ) = 2( n + 1)( n − k + 1)( k + 2) (3) for n ≥ k + 1 (Stadler, 2009). Appendix A gives a simple alternative proof of this fact using recurrence relations.We now consider the time of shared ancestry of two randomly chosen taxa, the age of their MRCA.Theorem 1 provides the probability of choosing two tips whose MRCA is the k th node; here we seek thedistribution of the age x of this node. Lemma 1.
The probability density of the time x of the k th node of a Yule tree of age τ and size n is p ( x | k, n, τ, λ ) = δ ( x ) k = 1 λ ( n − (cid:18) n − k − (cid:19) e − ( k − λ ( τ − x ) (1 − e − λx ) k − (1 − e − λ ( τ − x ) ) n − k − (1 − e − λτ ) n − k ≥ for ≤ x ≤ τ , where δ ( x ) is the Dirac delta function. Appendix B provides a derivation.Now we study the age of the MRCA of two randomly chosen taxa without conditioning on the MRCAbeing the k th node in the tree. Finding the marginal distribution of x by summing P ( n, k ) over k withrespect to (4), we arrive at p ( x | n, τ, λ ) = n − (cid:88) k =1 P ( n, k ) p ( x | k, n, τ, λ ) . (5)where P ( n, k ) is given by Theorem 1 and p ( x | k, n, τ, λ ) is given by Lemma 1. The following Theorem givesa closed-form expression for this probability. Theorem 2.
The probability density of the age x of the MRCA of two randomly chosen taxa in a tree ofsize n and age τ with branching rate λ is p ( x | n, τ, λ ) = n + 13( n − δ ( x ) + 2 λ ( e λ ( τ − x ) − (1 − e − λτ ) n ( n − (1 − e − λx ) (1 − e − λ ( τ − x ) ) (cid:34) ( n ( n −
3) + 2) (cid:18) − e − λx e λ ( τ − x ) − (cid:19) − n −
2) 1 − e − λx e λ ( τ − x ) − − (cid:18) ( n + 1) 1 − e − λx e λ ( τ − x ) − (cid:19) (cid:18) − e − λ ( τ − x ) − e − λτ (cid:19) n − + 6 (cid:35) , (6)4 x t a V a l ue t − . . . X ( t ) X ( t ) b V a l ue t . . . . X ( t ) X ( t ) c t X ( y ) X ( y ) d Time V a l ue t X ( t ) X ( t ) e Time0 x t X ( t ) X ( t ) f Figure 2: Examples of Markov models on a two-taxon tree of age τ = 2 and splitting time x = 1 .
3. Panel (a) shows a Yule treefor two taxa with a splitting time at x ; (b) shows a Brownian motion model with µ = and σ = 10 − ; (c) Ornstein-Uhlenbeckwith θ = 1, σ = 10 − , and α = 0 . α = 4, (e) reversible Poisson with α = 4 and β = 2, and (f) binarywith α = 2 and β = 2. The correlation between the character values X ( τ ) and X ( τ ) is a function of their time of sharedancestry x . where δ ( x ) is the Dirac delta function. Appendix C gives a proof. To our knowledge, this is a new result; Stadler (2009) gives a similar derivationfor the distribution of the MRCA age of two randomly picked taxa in a Yule tree, but without conditioningon τ .
3. Markov models of evolutionary change
Now consider a Markov process X ( t ) on the branches of a Yule tree of age τ . Figure 2 shows examplesof Markov models X ( t ) that we will consider. In the panel (a), a Yule tree of age τ and size n = 2 taxais shown with a speciation event at time x . The process begins on the ancestral lineage and evolves overthe interval (0 , x ). Following the speciation event at time x , the character evolves independently in the twodaughter taxa, conditional on their shared ancestral trait value X ( x ). Next are the examples of the Markovevolutionary models X ( t ) that we study in this paper: Brownian motion (b), Ornstein-Uhlenbeck process (c),Poisson (d), Reversible Poisson (e), and Binary (f). Parameters used to generate these simulated trajectoriesare given in the caption of Figure 2.In each of the following sections, we will consider a statistic M , which is a function of the value of X i ( τ )and X j ( τ ) at two randomly chosen tips i and j , with i (cid:54) = j . For example, M could be the number ofshared mutations at the tips. Then the probability density of M (or mass function if M is discrete-valued)5 ( m | n, τ, λ ) is obtained by marginalizing over the time of shared ancestry, p ( m | n, τ, λ ) = n − (cid:88) k =1 P ( n, k ) (cid:90) τ p ( x | k, n, τ, λ ) p ( m | x, τ ) d x, (7)where p ( m | x, τ ) is the density (or mass function) of M , conditional on the time x of shared ancestry. Thefollowing sections describe specific continuous-time Markov models X ( t ) and statistics M that are of interestin phylogenetic reconstruction and evolutionary inference. Consider a continuous-valued Markov process X ( t ) with mean zero and variance σ obeying the stochasticdifferential equation d X ( t ) = σ d B ( t ), where B ( t ) is Brownian noise. The process evolves on the branches ofa Yule tree. On a single branch, X ( t ) has normally distributed increments X ( t ) − X ( s ) ∼ Normal (cid:0) , σ ( t − s ) (cid:1) for 0 ≤ s ≤ t . Given a tree topology T and branch lengths t = ( t , . . . , t n − ), the trait values at the tipsof the tree are distributed according to the multivariate random variable X ∼ Normal (cid:0) , σ C ( T , t ) (cid:1) wherethe evolutionary covariance matrix C ( T , t ) is an n × n matrix whose diagonal elements are equal to τ . Theoff-diagonal element C ij where i (cid:54) = j is the time of shared ancestry of taxa i and j , and so the matrix issymmetric. The covariance of the tip values in taxa i and j is proportional to their time of shared ancestry.This fact gives a natural way of seeing (6) as the marginal distribution of a randomly-chosen off-diagonalelement of C , as the following Corollary makes clear. Corollary 1.
For a Yule tree of size n and age τ , a randomly chosen off-diagonal element of the covariancematrix C has probability density given by (6) in Theorem 2. Sagitov and Bartoszek (2012) study the “interspecies correlation coefficient” ρ n = 1 (cid:0) n (cid:1) Var (cid:0) X ( τ ) (cid:1) (cid:88) j
Suppose Poisson mutations with rate α fall on a Yule tree of size n , age τ , and branching rate λ . Let M be the number of mutation events on the tree. Then M ∼ Poisson (cid:18) α (cid:90) τ Y ( x ) d x (cid:19) (19) and the probability mass function of M is given by Pr( M = m ) = α m λ n − m !( n − P Y n ( τ ) n (cid:88) j =2 (cid:18) n − j − (cid:19) ( j − − j n − (cid:88) (cid:96) =0 (cid:0) n − (cid:96) (cid:1) ( − jτ ) n − (cid:96) − ( α + λ ) m + (cid:96) × (cid:2) γ (cid:0) m + (cid:96) + 1 , nτ ( α + λ ) (cid:1) − γ (cid:0) m + (cid:96) + 1 , jτ ( α + λ ) (cid:1)(cid:3) (20) where γ ( · , · ) is the lower incomplete gamma function and P Y n ( τ ) is the Yule transition probability. A proof is given in Appendix D. The quantity M is also known as the “number of segregating sites” on thetree.Now let M be the number of Poisson mutations shared by two randomly chosen taxa in a Yule tree ofage τ and size n . The probability of the two taxa sharing m mutations is the probability that m mutationsaccumulated during the time of their shared ancestry. If the time of shared ancestry is x , then the numberof mutations occurring during this time is Poisson with rate αx . Then marginalizing over x , we recover thedistribution of the number of shared mutations between two randomly chosen taxa. When M = 0,Pr( M = 0 | n, τ, λ, α ) = n + 13( n −
1) + 2( n + 1)( n − n − − e − λτ ) n − n − (cid:88) k =2 (cid:18) n − k − (cid:19) e − ( k − λτ ( k + 1)( k + 2) × k − (cid:88) j =0 (cid:18) k − j (cid:19) n − k − (cid:88) i =0 (cid:18) n − k − i (cid:19) ( − i + j e − iλτ γ (cid:0) m + 1 , ( α + λ ( j − i − k + 1)) τ (cid:1) ( α + λ ( j − i − k + 1)) m +1 (21)9nd when M ≥
1, we havePr( M = m | n, τ, λ, α ) = 2 λ ( n + 1)( n − α m m !( n − − e − λτ ) n − n − (cid:88) k =2 (cid:18) n − k − (cid:19) e − ( k − λτ ( k + 1)( k + 2) × k − (cid:88) j =0 (cid:18) k − j (cid:19) (cid:88) i =1 (cid:18) n − k − i (cid:19) ( − i + j e − iλτ γ (cid:0) m + 1 , ( α + λ ( j − i − k + 1)) τ (cid:1) ( α + λ ( j − i − k + 1)) m +1 (22)Derivations of (21) and (22) are given in Appendix E. In this model, unique distinguishable mutations occur as a Poisson process with constant rate α , but thechanges are reversible with rate β per mutation. When there are j mutations in a lineage, the rate of lossis jβ . This corresponds to reversion of disadvantageous changes in an evolutionary context. On a singlebranch, the number of mutations added and removed is modeled by the M/M/ ∞ queue, also known as theimmigration-death process. The process has forward equationd P ij ( t )d t = αP i,j − ( t ) + ( j + 1) βP i,j +1 ( t ) − ( α + jβ ) P ij ( t ) . (23)Let X ( t ) be the number of mutations at time x , given that there were none at time 0. Then X ( t ) has Poissondistribution with mean ( α/β )(1 − e − βt ).Suppose now that k mutations exist at time 0 and we monitor only on the loss of those original j mutations. We are interested in the probability of losing j − m mutations, so that m remain at time t . Theforward equations become d P jm ( t )d t = ( m + 1) βP j,m +1 ( t ) − mβP jm ( t ) (24)for m ≤ j . On a single branch of length t , the number of surviving mutations m is Binomially distributed withprobability e − βt . The number of mutations M shared between two randomly chosen taxa has probabilitymass functionPr( M = m | x, τ, α, β ) = exp (cid:104) − αβ (1 − e − βx ) e − β ( τ − x ) (cid:105) (cid:104) αβ (1 − e − βx ) e − β ( τ − x ) (cid:105) m m ! . (25)Therefore when the time of shared ancestry is x and tree age is τ , M has Poisson distribution. A proof of(25) is given in Appendix F. Then the distribution of M is given byPr( M = m | n, τ, λ, α, β ) = n − (cid:88) k =1 P ( n, k ) (cid:90) τ p ( x | k, n, τ, λ ) Pr( M = m | x, τ, α, β ) d x (26)which does not seem to have a simple closed-form expression. It can be easily evaluated by numerical10ntegration. We now study a two-state process X ( t ) occurring independently at N independent sites in a DNAsequence evolution model. This model is similar to the Reversible Poisson model, but for a finite number ofsites. Call the two states in the process 0 and 1. Transitions from 0 to 1 occur with rate α and from 1 to0 with rate β . The meaning of α and β here is different than in Section 3.4. In this context, α and β areper-site mutation rates. The transition rate matrix is Q = − α αβ − β . (27)The system evolves according to the matrix differential equation d P B d t = P B ( t ) Q . With initial condition P B (0) = I , the transition probability matrix is given by P B ( t ) = e Qt , where the elements of P B ( t ) are P B ( t ) = β + αe − ( α + β ) t α + β P B ( t ) = αα + β (cid:16) − e − ( α + β ) t (cid:17) P B ( t ) = βα + β (cid:16) − e − ( α + β ) t (cid:17) P B ( t ) = α + βe − ( α + β ) t α + β . (28)We are interested in the probability that K of N sites share the same state, conditional on the statebeing inherited from the MRCA. That is, we wish to exclude situations in which the taxa show a match-ing site pattern, but the matching states are not directly descended from the same ancestral state. Sitepatterns having this property are called “identical by descent”. To illustrate, consider the statistic M k = { X i,k ( τ ) = X j,k ( τ ) , ibd } which is 1 when site k in taxon i and site k in taxon j share the same ancestralallele identically by descent, and zero otherwise. ThenPr( M k = 1 | x, ibd) = P B ( x ) e − α ( τ − x ) + P B ( x ) e − β ( τ − x ) (29)where P B ( x ) and P B ( x ) are given in (28).Letting M = (cid:80) Nk =1 M k , the probability that K of N sites match IBD is binomial,Pr( M = K | x, ibd) = (cid:18) NK (cid:19) Pr( M k = 1 | x, ibd) K Pr( M k = 0 | x, ibd) N − K . (30)Marginalizing over the time x of shared ancestry,Pr( M = K | ibd) = n − (cid:88) k =1 P ( n, k ) (cid:90) τ p ( x | τ, n, λ ) Pr( M = K | x, ibd) d x. (31)This model will be useful in the next section, where we consider phylogenetic experimental design.11 Taxa n S i t e s N a =0.01
10 20 30 40 50
Taxa n S i t e s N a =0.03
10 20 30 40 50
Taxa n S i t e s N a =0.05
10 20 30 40 50
Taxa n S i t e s N a =0.1 Figure 4: Comparison of (cid:102)
P I as a function of the number of taxa n , number of sites N , and mutation rate α . Whether it isbetter to add taxa or sites to a phylogenetic analysis depends on the mutation rate α , and on the number of taxa and sitesalready present. Greater incremental improvements in (cid:102) P I per additional taxon or site can be obtained when the mutation rateis larger.
4. Applications
Townsend (2007) defines phylogenetic informativeness (PI) as the probability that a character evolvingaccording to a Poisson process mutates at least once on a short internal branch of an unrooted four-taxon tree,but does not change again on the terminal branches (See Townsend, 2007; Townsend and Leuenberger, 2011,for details). The evolutionary rate that maximizes this probability is argued to be optimal for reconstructingthe four-taxon tree. We suggest an extension of this concept to rooted trees with n tips and fixed age τ asfollows. Suppose we choose two taxa i and j , i (cid:54) = j at random from the tips of the tree.Let M be the number of sites (from N total sites) having at least one mutation shared identically bydescent by the chosen taxa under the binary character model in Section 3.5. This condition corresponds tothe event that one or more mutations accumulate on the ancestral branch and none on the two branchesleading to randomly chosen taxa i and j . We are interested in the probability of finding at least one sitewith at least one change. From (30) we have the probability that at least one of N sites has a mutationshared identically by descent,Pr( M > | x, α, β, ibd) = 1 − (cid:16) − P B ( x ) e − α ( τ − x ) − P B ( x ) e − β ( τ − x ) (cid:17) N . (32)12ow define a measure of phylogenetic informativeness as (cid:102) PI = n − (cid:88) k =1 P ( n, k ) (cid:90) τ p ( x | k, n, τ, λ ) Pr( M > | x, α, β, ibd) d x. (33)An important issue in phylogenetic experimental design is whether to add more taxa or more independentsites to an analysis in order to maximize the chance of accurate tree reconstruction (Sullivan et al, 1999;Shpak and Churchill, 2000; Zwickl and Hillis, 2002; Susko et al, 2002). Our binary mutation model gives oneway of answering this question: we seek changes that occur on the ancestral branch and persist identicallyby descent on both daughter branches. Figure 4 shows (cid:102) PI, given by (33), as a function of the number of sites N and the number of taxa n , for different values of the mutation rate α = β . We set τ = 1 and λ = 1 inevery case. It is clear from these plots that there is no single answer to the question of whether to add onemore site or one more taxon to achieve a unit increase in (cid:102) P I . When the number of taxa n is small, addingmore taxa is best. When the number of sites N is small, adding more sites is best. For larger values of themutation rate α , a greater increase in (cid:102) P I is gained for each taxon or site added. In cases where there is asignificant cost (in money or researcher effort) to obtain an additional sequence sites or taxa, calculation of (cid:102)
P I can help researchers decide whether the additional cost is worth the gain in informativeness.
It is possible to invert mutation probability expressions to uncover properties of the MRCA of tworandomly chosen taxa, conditional on the value of their mutation statistic M = m . In the reversible Poissonmodel, the probability that the MRCA of two randomly chosen taxa is the k th node in the tree is p ( k | τ, n, m, λ, α, β ) = P ( n, k ) (cid:90) τ p ( x | k, n, τ, λ ) Pr( M = m | x, τ, α, β ) d x n − (cid:88) j =1 P ( n, j ) (cid:90) τ p ( x | j, n, τ, λ ) Pr( M = m | x, τ, α, β ) d x . (34)In a similar way, we can find the conditional distribution of the MRCA age x , p ( x | τ, n, m, λ, α, β ) = n − (cid:88) k =1 P ( n, k ) p ( x | k, n, τ, λ ) Pr( M = m | x, τ, α, β ) n − (cid:88) k =1 P ( n, k ) (cid:90) τ p ( u | k, n, τ, λ ) Pr( M = m | u, τ, α, β ) d u . (35)Given the value of a statistic M which is a function of the trait or character values of two taxa, a roughestimate of the age of their MRCA can be obtained by maximizing (35) with respect to x . While this isnot a method for tree reconstruction, it may prove useful in settings where only a subset of tip values areobserved. If this subset can be regarded as randomly selected, then measures of evolutionary correlation canstill be computed, even in the absence of a full phylogenetic tree.13 . Discussion In this work, we have derived probability distributions for several quantities that give insight into thedynamics of evolutionary processes on unobserved phylogenetic trees. This is achieved via comparisonof evolutionary outcomes for pairs of species. For continuous trait evolution, we study the evolutionarycorrelation under Brownian motion and OU processes. Equation (9) provides a natural generalization of theinterspecies correlation coefficient ρ n introduced by Sagitov and Bartoszek (2012). Poisson mutation modelswith and without reversals provide the distribution of a convenient summary statistic in discrete models ofcharacter evolution: the number of changes on an ancestral branch. The Poisson models also give a naturaldistribution for the number of segregating sites in an infinite sites model on a tree of age τ and size n . Thedistribution of the number of Poisson mutations on the whole tree presented in Corollary 2 can also be usedto place a posterior distribution on the age τ of the whole tree. That is, we can estimate τ , given M observeddifferences at the tips.In the applications presented in Section 4, we propose contributions to both prospective and retrospectiveanalysis of evolutionary processes on trees. First, we extend the notion of phylogenetic informativeness (PI)to trees with n tips and suggest that the choice of whether to add taxa or sites depends on both the number oftaxa and mutation rate. Second, we show that the conditional probability of the MRCA age and location inthe tree can be expressed conditional on an observable statistic describing the number of pairwise differencesobserved.A great deal of information about speciation and evolutionary process can be gleaned from pairwisecomparisons of taxa. The simple Yule process provides a parsimonious description of speciation, and makesexplicit the correlation induced by the tree in evolutionary outcomes at the tips. We hope that the resultspresented in this paper will aid in understanding this correlation and development of inferential techniquesfor comparative evolutionary analysis. Acknowledgements
We thank Krzysztof Bartoszek for insightful comments and suggestions, and Jeffrey Townsend for helpfulconversations about phylogenetic informativeness.
Appendix A. Proof of Theorem 1
This result is due to Stadler (2009). Here we present a simple alternative proof based on a recurrencerelation. We provide this derivation because the techniques it employs may be useful in novel derivations ofrelated combinatorial properties of Yule trees, such as those studied by Mulder (2011).Let ν n ( k ) denote the number of pairs of tips in an n -forest that have the k th node of an n -tree as theirMRCA. We imagine the n -forest to be generated from the ( n − n − n -tree formed in this manner so that all of them are different.Suppose n ≥ k + 1. Then ν n ( k ) follows from ν n − ( k ) via induction, based on two observations. First,any pair of the type considered becomes one in the n -forest each time a new tip is attached to one of the n − ν n ( k ) = ( n − ν n − ( k ) + 4 ν n − ( k ) = ( n + 1) ν n − ( k ) . (A.1)The smallest value of n for which a pair with its MRCA being the k th node can occur is k + 1, and a ( k + 1)-forest contains ν k +1 ( k ) = k ! such pairs (cherries), i.e. exactly one on each tree in the forest. With (A.1), weobtain ν k +2 ( k ) = ( k + 3) k !, ν k +3 ( k ) = ( k + 4)( k + 3) k !, etc. In general, ν n ( k ) = ( n + 1)! / ( k + 1)( k + 2). Everychoice of picking any pair of tips is equally likely among all n -trees whose k th node appears between x and x + d x , and since the total number of pairs in the n -forest equals n !( n − /
2, the probability of choosingone with the desired property is P ( n, k ) = 2( n + 1)( n − k + 1)( k + 2) (A.2)for n ≥ k + 1, as claimed. Appendix B. Proof of Lemma 1
By definition, the first node in the tree has age zero: p ( x | k = 1 , τ, n, λ ) = δ ( x ), where δ ( · ) is the Diracdelta function. For k ≥
2, the k th node appears at the moment k lineages become k + 1. Suppose the k thnode appears at time x + d x . We must first construct a tree beginning with 2 lineages at time 0 having k lineages at time x , which happens with probability P Y k ( x ) = ( k − e − λx (1 − e − λx ) k − . Then at time x + d x , the k th node is appears with density kλe − kλ d x . Now there are k + 1 lineages, and the tree musthave n lineages at time τ , which happens with probability P Yk +1 ,n ( τ − x ) = (cid:18) n − k (cid:19) e − ( k +1) λ ( τ − x − d x ) (1 − e − λ ( τ − x − d x ) ) n − k − (B.1)Finally, since we are conditioning on the full tree having n lineages at time τ , the distribution must benormalised by P Y n ( τ ) = ( n − e − λτ (1 − e − λτ ) n − . Putting these expressions together and sending d x tozero, we find that p ( x | k, τ, n, λ ) = P Y k ( x ) kλ P Yk +1 ,n ( τ − x ) /P Y n ( τ )= λ ( n − (cid:0) n − k − (cid:1) e − ( k − λ ( τ − x ) (1 − e − λx ) k − (1 − e − λ ( τ − x ) ) n − k − (1 − e − λτ ) n − (B.2)15or k ≥
2, as claimed.
Appendix C. Proof of Theorem 2
With the abbreviation a = (1 − e − λx ) / ( e λ ( τ − x ) − p ( x | n, τ, λ ) = n + 13( n − δ ( x )+ λ n + 1)( n − e λ ( τ − x ) − n − − e − λx ) (cid:18) − e − λ ( τ − x ) − e − λτ (cid:19) n − n − (cid:88) k =2 (cid:18) n − k − (cid:19) a k ( k + 2)( k + 1) . (C.1)The sum can be evaluated via elementary manipulations to yield n − (cid:88) k =2 (cid:18) n − k − (cid:19) a k ( k + 2)( k + 1) = (cid:0) ( n ( n −
3) + 2) a − n − a + 6 (cid:1) (1 + a ) n − − (cid:0) ( n + 1) a + 3 (cid:1) a ( n + 1) n ( n − n − . (C.2)Substitution for a yields (6), as claimed. Appendix D. Proof of Corollary 2
Let R τ = (cid:82) τ Y ( x ) d s be the integral of the Yule process with rate λ , beginning with Y (0) = 2 and endingwith Y ( τ ) = n . Then R τ has density function p ( x ) = λ n − e − λx ( n − P Y n ( τ ) n (cid:88) j =2 (cid:18) n − j − (cid:19) ( j − − j − ( x − jτ ) n − H ( x − jτ ) (D.1)where P Y n ( τ ) is the Yule transition probability (2) and H ( x ) is the Heaviside step function (Crawford andSuchard, 2013). The minimum branch length that can accrue on the interval (0 , τ ) is 2 τ and the maximumis nτ . If the Yule tree has total branch length x , then the number of mutations that occur with rate α on thetree has Poisson distribution with rate αx ; its mass function is Pr( M ( x ) = m | α ) = ( αx ) m e − αx /m !. Thenintegrating this function with respect to p ( x ), we havePr( M ( τ ) = m | n, τ, λ, α ) = (cid:90) ∞ Pr( M ( x ) = m | α ) p ( x ) d x = (cid:90) nτ τ ( αx ) m e − αx m ! λ n − e − λx ( n − n (cid:88) j =1 (cid:18) n − j − (cid:19) ( − j − ( x − jτ ) n − H ( x − jτ ) d x = α m λ n − m !( n − n (cid:88) j =1 (cid:18) n − j − (cid:19) n − (cid:88) k =0 (cid:18) n − k (cid:19) ( − n − k + j − ( jτ ) n − k − (cid:90) nτjτ x k + m e − ( α + λ ) x d x (D.2)Re-writing the integral as the difference of lower incomplete Gamma functions delivers the desired expression.16 ppendix E. Proofs of (21) and (22)We seek an expression for the probability that m mutations accumulate on the ancestral branch of tworandomly chosen taxa,Pr( M = m | n, τ, λ, α ) = n − (cid:88) k =2 P ( n, k ) (cid:90) τ p ( x | k, n, τ, λ ) ( αx ) m e − αx m ! d x. (E.1)Suppose for now that M ≥
1. Then the Poisson mutation probability is zero when the MRCA node k = 1,and the first term in the sum drops out. Marginalizing over x and k , we findPr( M = m | n, τ, λ, α ) = 2 λ ( n + 1)( n − α m m !( n − − e − λτ ) n − n − (cid:88) k =2 (cid:18) n − k − (cid:19) e − ( k − λτ ( k + 1)( k + 2) × (cid:90) τ e ( λ ( k − − α ) x x m (1 − e − λx ) k − (1 − e − λ ( τ − x ) ) n − k − d x = 2 λ ( n + 1)( n − α m m !( n − − e − λτ ) n − n − (cid:88) k =2 (cid:18) n − k − (cid:19) e − ( k − λτ ( k + 1)( k + 2) × (cid:90) τ e ( λ ( k − − α ) x x m k − (cid:88) j =0 (cid:18) k − j (cid:19) ( − j e − λjx (cid:34) n − k − (cid:88) i =0 (cid:18) n − k − i (cid:19) ( − i e − λi ( τ − x ) (cid:35) d x = 2 λ ( n + 1)( n − α m m !( n − − e − λτ ) n − n − (cid:88) k =2 (cid:18) n − k − (cid:19) e − λ ( k − τ ( k + 1)( k + 2) × k − (cid:88) j =0 (cid:18) k − j (cid:19) n − k − (cid:88) i =0 (cid:18) n − k − i (cid:19) ( − i + j e − iλτ (cid:90) τ x m exp [ − x ( α + λ ( j − i − k + 1))] d x. (E.2)Writing the integral as an incomplete gamma function gives the result for M ≥
1. Now consider the casethat M = 0. When the MRCA of the two chosen taxa is k = 1, the time of shared ancestry is x = 0, and nomutations can accumulate. ThereforePr( M = 0 | n, τ, λ, α ) = n + 13( n −
1) + n − (cid:88) k =2 P ( n, k ) (cid:90) τ p ( x | k, n, τ, λ ) e − αx d x = n + 13( n −