The Geometry of the space of Discrete Coalescent Trees
Lena Collienne, Kieran Elmes, Mareike Fischer, David Bryant, Alex Gavryushkin
aa r X i v : . [ q - b i o . P E ] J a n THE GEOMETRY OF THE SPACE OF DISCRETE COALESCENT TREES
LENA COLLIENNE , KIERAN ELMES , MAREIKE FISCHER , DAVID BRYANT ,AND ALEX GAVRYUSHKIN (cid:0) Abstract.
Computational inference of dated evolutionary histories relies upon various hypothesesabout RNA, DNA, and protein sequence mutation rates. Using mutation rates to infer these datedhistories is referred to as molecular clock assumption. Coalescent theory is a popular class ofevolutionary models that implements the molecular clock hypothesis to facilitate computationalinference of dated phylogenies. Cancer and virus evolution are two areas where these methods areparticularly important.Methodologically, phylogenetic inference methods require a tree space over which the inferenceis performed, and geometry of this space plays an important role in statistical and computationalaspects of tree inference algorithms. It has recently been shown that molecular clock, and hencecoalescent, trees possess a unique geometry, different from that of classical phylogenetic tree spaceswhich do not model mutation rates.Here we introduce and study a space of discrete coalescent trees, that is, we assume that timeis discrete, which is inevitable in many computational formalisations. We establish several geo-metrical properties of the space and show how these properties impact various algorithms used inphylogenetic analyses. Our tree space is a discretisation of a known time tree space, called t -space,and hence our results can be used to approximate solutions to various open problems in t -space.Our tree space is also a generalisation of another known trees space, called the ranked nearestneighbour interchange space, hence our advances in this paper imply new and generalise existingresults about ranked trees. Introduction
A commonly used hypothesis in various applications in evolutionary biology is the molecularclock. For example, a strict molecular clock is the assumption that the mutation rate of a gene isapproximately constant over time. After this phenomenon had first been observed by Zuckerkandland Pauling [26], the molecular clock became a popular hypothesis, and various relaxations weredeveloped [16]. A popular framework for reconstructing and analysing timed evolutionary (species)trees [12] that uses the molecular clock assumption on gene trees is coalescent theory. For exam-ple, coalescent is widely employed for inferring relationships of a sample of genes [10, 15], or foranalysing population dynamics [7, 14]. A recent striking application of coalescent theory is can-cer phylogenetics [20, 21], where accurate estimates of divergence times are essential for targetedtreatment strategies. Under a coalescent model evolution is considered backwards in time, and twolineages coalesce after a waiting time, which is to be estimated.In phylogenetic trees, which display evolutionary relationships, internal nodes can hence beequipped with times, when assuming a molecular clock. Software packages for reconstructing thosetrees from data such as RNA, DNA, or protein sequences rely on a parameterisation of trees whereinternal nodes are equipped with times. Popular tree inference software used for this purposeare based on Maximum Likelihood [13, 19, 25] or Bayesian methods [2, 22, 24]. They rely on tree Department of Computer Science, University of Otago, New Zealand Institute of Mathematics and Computer Science, University of Greifswald, Germany Department of Mathematics and Statistics, University of Otago, New Zealand
E-mail addresses : [email protected], [email protected], [email protected],[email protected], (cid:0) [email protected] . Date : January 11, 2021. search algorithms, where in every step a new tree is proposed and accepted if the proposed tree fulfilscertain requirements. For tree proposals under the molecular clock assumption a parameterisationof trees taking the times of internal nodes into account is required. Furthermore, a similaritymeasure for these trees is necessary, to propose trees that are measurably similar to the runningtree.Tree spaces that take branch lengths of trees into account exist in the literature. For example,the BHV-space [1] models trees as points in a cubical complex. However, this parameterisationis not suitable for coalescent trees because changing the times of an evolutionary event in thetree implies that all preceding events change their times as well. Hence two trees can be closeto each other in this space even though the timing of many internal nodes is different in the twotrees. Examples of more suitable tree spaces where internal nodes of trees are equipped with timesare t -space and τ -space [8]. It has been observed, however, that in the τ -space, similarly to theBHV-space, shortest paths between trees often contain the star tree [8], a property that can beproblematic in applications. Although the t -space is free from these properties, no algorithm forcomputing distances or shortest paths between trees in this space is known yet, so applications arelimited.Enabling statistical analysis over the space of phylogenetic trees was an important motivation forBillera, Holmes, and Vogtmann [1] to introduce the BHV-space and study its geometric properties.Tree space geometry has also played an important role in studies of rogue taxa in a tree [5] and alsosummary trees [18]. Here, driven by the same motivation, we propose to study coalescent trees.In this paper we introduce the space DCT m of discrete coalescent trees, where internal nodesare assigned unique discrete times. This tree space is a discrete version of the t -space. DCT m isalso a generalisation of the ranked nearest neighbour interchange (RNNI) space [4]. Here we showthat the space DCT m as well as RNNI have the desired properties mentioned above, includingefficiently computable shortest paths that preserve biological information shared between trees.After introducing notations used throughout this paper (Section 2), we discuss how the algorithm FindPath [4] can be generalise from RNNI to be applied to discrete coalescent trees, computingshortest paths in polynomial time (Section 3). We then analyse some geometrical properties of bothtree spaces DCT m and RNNI (Section 4) – first, we discuss the cluster property in Section 4.1 andthen consider a subset of trees (caterpillar trees) for which we are able to compute RNNI distancesmore efficiently than with FindPath (Section 4.2). Following that, we establish the diameter ofDCT m and RNNI and briefly discuss the radius for each space. We finish this paper with a sectionproviding a connection between the RNNI space and partition lattices, and propose directions forfurther research (Section 5). 2. Technical Introduction A rooted binary phylogenetic tree is a binary tree with n leaves uniquely labelled by elements of aset { a , . . . , a n } . The main object of study in this paper are discrete coalescent trees , binary rootedphylogenetic trees with a positive integer-valued time assigned to each node. More specifically, all n leaves a , . . . , a n are assigned time 0, and every internal node is assigned a unique time less orequal to an integer m , such that it always has time greater than its children. Note that this implies m ≥ n −
1. We denote the time of an internal node v by time( v ). If not stated otherwise, we referto discrete coalescent trees simply as trees . We furthermore call two trees (not necessarily binary) identical if there is a graph isomorphism between them preserving leaf labels and times.As a special case of discrete coalescent trees we consider ranked trees with root time n −
1. Inthese trees internal nodes have distinct times ranging from 1 to n −
1. This definition of rankedtrees coincides with the one of Collienne and Gavryushkin [4]. In the case of ranked trees we say rank of a node v to mean its time (rank( v ) = time( v )) to be consistent with notations used in[4]. There are ( n − n !2 n − ranked trees [23]. Every ranked tree gives (cid:0) mn − (cid:1) discrete coalescent trees, as EOMETRY OF RANKED TREE SPACES 3 a a a a a T Figure 1.
Discrete coalescent tree with n = 5 leaves and root height m = 6. Thehighlighted node with time three can be referred to as ( a ) T , ( { a , a } ) T , and thecluster induced by this node is ( T ) .every ( n − { , . . . , m } can be the set of times assigned to the internal nodes ofa ranked tree. Hence there are, contrary to the claim in [9], ( n − n !2 n − (cid:0) mn − (cid:1) discrete coalescent trees.Every internal node v of a tree T can be referred to by the set C of leaves that are descendingfrom this node. We call such a set C cluster and say that the cluster C is induced by v . A listof clusters [ C , . . . , C n − ] determines at most one ranked tree [4], where cluster C i is induced bythe internal node with rank i for i ∈ { , . . . , n − } . For discrete coalescent trees however, timesof nodes also need to be provided to uniquely identify a tree. For a subset S ⊆ { a , . . . , a n } we call the internal node of a tree T with lowest time among those ancestral to all elements of S the most recent common ancestor of S and denote it by ( S ) T . We furthermore denote theparent of a leaf a i in T by ( a i ) T , and the cluster induced by the node with time i in T by ( T ) i .The node highlighted in Figure 1 for example can be referred to as ( a ) T , the parent of a , or( { a , a } ) T , the most common ancestor of { a , a } , or ( T ) , the node with time three in T . Notethat we will simply write rank( a i ) T or time( a i ) T to mean rank(( a i ) T ) or time(( a i ) T ), respectively.Although differing from traditional notations, our notation with brackets referring to internal nodesis intuitive, shortens nested formulas, and is consistent with notations used in [4]. A type of treesthat will be of importance throughout the whole paper are caterpillar trees , which are trees whereevery internal nodes has at least one child that is a leaf.We are now ready to introduce the central object of study of this paper, the graph (or space) ofdiscrete coalescent trees. This graph is called DCT m for a fixed positive integer m . The vertex setof DCT m is the set of trees with root time less or equal to m . Note that a second parameter ofDCT m is the number of leaves n of the trees in the graph, which we assume to be fixed throughoutthis paper. Trees T and R are connected by an edge ( T and R are neighbours ) in this graph ifperforming one of the following (reversible) operations on T results in R (Figure 2):(1) An NNI move connects trees T and R if there is an edge e in T and an edge f in R , both of lengthone, such that shrinking e and f to nodes results in identical trees.(2) A rank move on T exchanges the times of two internal nodes with time difference one.(3) A length move on T changes the time of an internal node by one.A length move can only change the time of a node to become t if there is no node with time t already. Furthermore, the time of the root of a tree in DCT m cannot be changed by a length moveto become greater than m in DCT m . Note that our definition of DCT m differs from the definitionof the space on discrete time-trees of Gavryushkin, Whidden, and Matsen [9]. In contrast to theirdefinition, length moves in DCT m do not change the height of a tree, unless it is performed on theroot, which makes our definition appropriate for coalescent trees.The definition of DCT m leads to a natural definition of the distance between two trees T and R in this graph as the length of a shortest paths between these trees, denoted by d ( T, R ). We also
GEOMETRY OF RANKED TREE SPACES a a a a a a a a a a a a a a a a a a a a Figure 2.
The three possible moves on a discrete coalescent tree: a length movechanging the time of the highlighted node on the left, a rank move swapping theranks of the highlighted nodes in the middle and an NNI move on the dotted edgeon the right.consider the ranked nearest neighbour interchange (RNNI) graph of Collienne and Gavryushkin [4],which is the graph DCT m for m = n −
1, and hence a graph of ranked trees. In this graph lengthmoves are not possible, so we use the notion RNNI move to mean either a rank move or an NNImove in order to distinguish these moves from length moves.3.
Computing Shortest Paths in
DCT m Shortest paths, and therefore distances, between trees in RNNI can be computed with the algo-rithm
FindPath , which was introduced by Collienne and Gavryushkin [4] and has running timequadratic in the number of leaves n . As RNNI is a special case of DCT m for m = n −
1, the ques-tion arises whether a modification of this algorithm can also be used to compute shortest paths inDCT m . In this section we present a generalisation of FindPath that computes distances betweentrees in DCT m . Before introducing the version of FindPath for DCT m , we introduce a way toconvert trees in DCT m on n leaves into ranked trees on m + 2 leaves, such that the RNNI distancebetween those ranked trees equals their distance in DCT m (Theorem 1).A tree T in DCT m on n leaves can be converted into a ranked tree in RNNI with m + 2 leavesin the following way (Algorithm 1). First add a new root with time m + 1 that becomes parentof the root of T . The other child of this new root becomes the root of a caterpillar tree T rc onleaf set { a n +1 , a n +2 , . . . , a m +2 } , such that time( a n +1 ) T rc = time( a n +2 ) T rc < time( a n +3 ) T rc < . . . < time( a m +2 ) T rc < m + 1. An example of this extension of a tree T to a ranked tree T r is depictedin Figure 3.Throughout this paper we denote this extended ranked version of a tree T by T r . Moreover,we denote the subtree of T r that is identical to T by T dr ( d for discrete coalescent tree) and thecaterpillar subtree on leaf set { a n +1 , . . . , a m +2 } by T cr . EOMETRY OF RANKED TREE SPACES 5
Algorithm 1
RankedTree( T , m ) S := { ≤ i ≤ m | no internal node in T has time i } [ i , . . . , i m − n +1 ] = sort ( S ) T dr = copy of T T cr = tree consisting of just one internal node v with rank i and children a n +1 , a n +2 for k = 2 , . . . , m − n + 1 do Add internal node v k with with time i k and children v k − and a n +1+ k to T cr T r = tree with root with time m + 1 and children of root are roots of T dr and T cr . return T r a a a a a a a a a a a a a T T r
134 6 a a a a a a a a
25 71345 a a a a a Figure 3.
Extending a tree T on n leaves in DCT (top left) to a ranked tree with m + 2 = 8 leaves (top right) by adding a caterpillar subtree with three leaves. Thetrees on the bottom result from T and T r by performing a length move (left) or rankmove (right), respectively.In the following we distinguish two different types of rank moves. Rank moves between one nodeof T cr and one node of T dr induce length moves on the subtree T dr in DCT m (Figure 3). Therefore,we will refer to such rank moves as rank moves corresponding to length moves . All remaining rankmoves will still be called rank moves. Note that the correspondence of rank moves between T cr and T dr to length moves in T shows that any path between T and R in DCT m can be interpreted as apath between T r and R r in RNNI.After extending both trees T and R in DCT m to ranked trees T r and R r on m + 2 leaves, re-spectively, we can compute shortest paths between T r and R r in RNNI, using FindPath . A pathcomputed by
FindPath preserves clusters [4], hence there are no NNI moves in the newly addedcaterpillar subtree on the leaf set { a n +1 , . . . , a m +2 } on such a path. The only moves involving in-ternal nodes of this caterpillar subtree are rank moves corresponding to length moves, as described GEOMETRY OF RANKED TREE SPACES above. Hence the path FP( T r , R r ) provides a path between T and R in DCT m , when only consid-ering the subtrees induced by { a , . . . , a n } in all trees on FP( T r , R r ), interpreting some rank movesbetween T r and R r as length moves. We denote this DCT m path, which results from FP( T r , R r ),by FP( T, R ). In Theorem 1 we establish that FP(
T, R ) is indeed a shortest path in DCT m . Notethat for any given pair of trees T and R , we always assume m to be the maximum root time ofthese trees and consider a shortest path between them in DCT m . Theorem 1.
The path
FP(
T, R ) between two discrete coalescent trees T and R is a shortest pathin DCT m , where m is the maximum root time of T and R .Proof. Let T and R be discrete coalescent trees and T r and R r their extended ranked versionscomputed with Algorithm 1, respectively. Any path in DCT m from T to R gives a path of equallength between T r and R r in the RNNI space on m + 2 leaves. This is due to the fact that theonly moves needed in the subtree T cr to transform it to R cr are rank moves corresponding to lengthmoves, and no other RNNI moves. If there was a path between T and R shorter than FP( T, R ),the corresponding path between T r and R r in RNNI would be shorter than the one computed by FindPath in this space. Since this contradicts the fact that
FindPath computes shortest pathsin RNNI [4, Theorem 1], it follows that FP(
T, R ) is a shortest path in DCT m . (cid:3) Theorem 1 shows that
FindPath computes a shortest path between two trees in DCT m inpolynomial time, more specifically in O ( mn ). More details on the running time are discussedin Section 4.3 following Theorem 6. It is not even necessary to convert a given pair of discretecoalescent trees to ranked trees to apply FindPath to them. Instead, we modify
FindPath for trees in DCT m (Algorithm 2). Iterations of FindPath that consider clusters in the addedcaterpillar trees are replaced by length moves increasing the time of internal nodes as describedin the for loop in Line 11 of Algorithm 2. The benefit of this modified version of the algorithm,compared to using
FindPath on the extended ranked versions of the trees, is a reduced use ofmemory, which is especially of practical relevance for m ≫ n , which is typical in applications.Note that we do not need the parameter m in practice, as the distance between any two trees inDCT m ′ is the same as their distance in DCT m for any m > m ′ . Therefore, if the distance betweentwo trees is to be computed, we can simply choose m to be the maximum root height of the giventrees and compute their distance in DCT m . Algorithm 2
FindPath ( T, R ) T := T , p := [ T ] for k = 1 , . . . , m do if R has a node with time k then C := ( R ) k while time(( C ) T ) > k do T is T with the time of ( C ) T decreased by an RNNI move T = T p = p + T else if T has a node with time k then i := min { l | l > k and no node in T has time l } for j = i − , . . . , k do T is T where the time of ( T ) j is increased by one (length move) T = T p = p + T return p EOMETRY OF RANKED TREE SPACES 7 Geometrical Properties of
DCT m Cluster Property.
A tree space has the cluster property , if all trees on every shortest pathbetween two trees sharing a cluster C also contain C . This is a desirable property in evolutionarybiology applications as trees sharing a cluster or subtree are expected to be closer to each otherthan to a tree not sharing a cluster with them. This property is also desirable in centroid-basedtree sample summary methods. For a given sample of trees containing a common subtree, it isexpected that their summary tree also contains this subtree. It is therefore desirable to have a treespace that has the cluster property.A mathematical motivation for investigating the cluster property in RNNI is its importance ina similar tree space, the nearest neighbour interchange graph (NNI). In the NNI graph, trees haveno times and NNI moves are allowed on every edge. Computing distances in NNI is NP -hard [6],and the proof relies on the fact that this tree space does not have the cluster property [17]. In theRNNI graph, however, distances can be computed in polynomial time using FindPath [4], whichpreserves common clusters. The question whether RNNI has the cluster property is hence natural,and will be settled in Theorem 2.
Theorem 2.
The
RNNI graph has the cluster property.Proof.
We assume to the contrary that there are two ranked trees T and R sharing a cluster C anda shortest path p between these trees where C is not present in every tree. We furthermore assumethat there is no pair of trees with a shorter path not containing a shared cluster and distance lessthan d ( T, R ), meaning that T and R give a minimum counterexample. Because of this minimalityassumption on the length of p , the first tree T ′ following T on p does not contain C . Since C mustbe the only cluster changed by the NNI move between T and T ′ , all nodes with rank below ( C ) T induce the same clusters in T and T ′ (Figure 4). We now compare distances d ( T, R ) and d ( T ′ , R )by using properties of FindPath . A A A A A A ( C ) T T T ′ ( C ) T ′ Figure 4.
Trees T and NNI neighbour T ′ , such that the cluster C = A ∪ A isnot present in T ′ , but in T .First we compare FP( R, T ) and FP(
R, T ′ ). All trees on these two paths coincide up to iteration i = rank(( C ) T ), in which the cluster considered on FP( R, T ) is C . Let R ′ denote the tree at thispoint of the path, meaning that FP( R, T ) and FP(
R, T ′ ) coincide up to this tree R ′ . It follows d ( T, R ) = d ( R, R ′ ) + d ( R ′ , T ) and d ( T ′ , R ) = d ( R, R ′ ) + d ( R ′ , T ′ ).Now consider FP( T ′ , R ′ ) to evaluate d ( R ′ , T ′ ). As FindPath preserves clusters, C is present inevery tree on FP( T, R ) up to and including R ′ . The first iteration of FindPath applied to the pairof trees ( T ′ , R ′ ) hence considers the cluster C , as all cluster induced by nodes below ( C ) T ′ coincidein R ′ and T ′ . To construct the cluster C in T ′ , there is just one NNI move needed, which resultsin the tree T , as T and T ′ are NNI neighbours such that T contains C and T ′ does not (Figure 4).We can therefore conclude that d ( T, R ) = d ( T ′ , R ) −
1, which contradicts the assumption that T ′ is the first tree on a shortest path from T to R . There is hence no shortest path between T and R that does not preserve C , which proves the cluster property for RNNI. (cid:3) GEOMETRY OF RANKED TREE SPACES
The fact that the slightly modified version of
FindPath computes shortest paths in DCT m already suggests that shortest paths in RNNI and DCT m have similar properties. Indeed, thecluster property in DCT m follows from Theorem 2. Corollary 1.
The graph
DCT m has the cluster property.Proof. Assume that there is a shortest path between two trees T and R in DCT m that does notpreserve a common cluster. This path corresponds to a path between T r and R r , the extendedranked versions of T and R in RNNI, as already discussed in Theorem 1. Since this path has thesame length as the one between T r and R r , it is be a shortest path in RNNI as well, which leadsto a contradiction to Theorem 2. (cid:3) Caterpillar Trees.
In this subsection we focus on the set of caterpillar trees and establishsome properties of shortest paths between those trees in both RNNI and DCT m . In Theorem 3 wewill see that, in both DCT m and RNNI, any two caterpillar trees are connected by a shortest pathconsisting only of caterpillar trees. We say that a set of trees is convex in a tree space, if there is ashortest path between any two trees in this set that stays within the set. The set of caterpillar treesis hence convex in RNNI. The NNI space of unranked trees however does not have this property[9]. Based on the convexity of the set of caterpillar trees in RNNI we introduce a way to computedistances between caterpillar trees in this space in time O ( n √ log n ) in Corollary 2, and hence withbetter worst-case time complexity than FindPath . Whether this complexity can be achieved inDCT m is an open question. Theorem 3.
The set of caterpillar trees is convex in
DCT m .Proof. Let T and R be two caterpillar trees in DCT m . We prove the theorem by showing thatthere is a caterpillar tree T ′ that is a neighbour of T and closer to R than T . The existence of ashortest path consisting only of caterpillar trees between T and R follows inductively. Throughoutthis proof we consider the extended ranked versions T r and R r of T and R .Let a k := argmax a ,...,a n { rank( a i ) R r | rank( a i ) R r = rank( a i ) T r } be the leaf with parent withmaximum rank in R r among those whose parents do not have equal rank in T r and R r . Letfurthermore a j ∈ { a , . . . , a m +2 } be a leaf with rank( a j ) T r = rank( a k ) T r + 1. We define T ′ r to bethe caterpillar tree resulting from T r by an NNI move or rank move exchanging the ranks of ( a k ) T r and ( a j ) T r . An NNI move is necessary if these two nodes are connected by an edge, otherwise arank move corresponding to a length move is performed on T r to obtain T ′ r (Figure 5). In bothcases T ′ rd is a caterpillar tree. We will use properties of shortest paths computed by FindPath toshow that | FP( R r , T ′ r ) | = | FP( R r , T r ) | − T r and T ′ r induced by nodes of rank less than rank( a k ) T r coincide, the pathsFP( R r , T r ) and FP( R r , T ′ r ) coincide up to a ranked tree R ′ r , which contains all these clusters.We now compare the lengths of FP( R ′ r , T r ) and FP( R ′ r , T ′ r ). We note at first that rank( a j ) R r < rank( a k ) R r . If it otherwise was rank( a k ) R r ≤ rank( a j ) R r , it would follow rank( a j ) R r = rank( a j ) T r ,by the definition of a k , and therefore rank( a k ) R r ≤ rank( a j ) R r = rank( a j ) T r = rank( a k ) T r +1. rank( a k ) R r ≤ rank( a k ) T r + 1 however contradicts the definition of a k , hence rank( a j ) R r < rank( a k ) R r . It follows rank( a j ) R ′ r < rank( a k ) R ′ r , as a j and a k are not in any of the clustersconsidered by FindPath before R ′ r , which means that their parents do not exchange ranks before R ′ r .By our assumptions on T r , the cluster considered on FP( R r , T r ) in iteration l = rank( a k ) T r ,which is the iteration following R ′ r , is S ∪ { a k } , where S is a cluster that is present in all threetrees T r , T ′ r , and R r . In the following iteration l + 1 = rank( a j ) T r , S ′ ∪ { a j } is considered for acluster S ′ , where S either equals S ∪ { a k } , if T r and T ′ r are connected by an NNI move (bottom ofFigure 5), or is a cluster present in T rc , T ′ rc , and R ′ rc , if T r and T ′ r are connected by a rank move(top of Figure 5). Decreasing the rank of ( S ∪ { a k } ) R ′ r takes rank( S ∪ { a k } ) R ′ r − l RNNI moves.
EOMETRY OF RANKED TREE SPACES 9
S a k a j S ′ S a k a j S ′ S a k a j S a j a k T r T ′ r S ′ l + 1 l + 1 ll S a k a j S ′ S a j a k R ′ r Figure 5.
The two possible versions of trees T r (left), T ′ r (middle), and R ′ r asdescribed in the proof of Theorem 3. Between T r and T ′ r only the ranks of theparents of a j and a k are exchanged, the rest of the trees coincide. At the bottomthe case that ( a j ) T is parent of ( a k ) T and S ′ = S ∪ { a k } is displayed. S ′ is a clusterin all three trees at the bottom. At the top ( a j ) T and ( a k ) T are in the two differentsubtrees T dr and T cr (the same in T ′ r and R ′ r ), which is also true for the disjoint sets S and S ′ , which are present as clusters in all three trees. Dotted lines representremaining parts of trees, which are equal in T r and T ′ r , but different to R ′ r . Notethat the rank difference of ( a k ) R ′ r and ( a j ) R ′ r does not need to be one, which it is in T r and T ′ r .Because the rank of ( S ∪ { a j } ) R ′ r increases by one when the parents of a k and a j swap ranks in thisiteration, the following iteration for S ′ ∪ { a j } needs rank( S ′ ∪ { a j } ) R ′ r + 1 − ( l + 1) RNNI moves.On FP( R r , T ′ r ) however, first rank( S ′ ∪ { a j } ) R ′ r − l RNNI moves decrease the rank of ( S ′ ∪ { a j } ) R ′ ,and then rank( S ∪ { a k } ) R ′ r − ( l + 1) are needed for S ∪ { a k } . In total, these two iterations combinedresult in at least one extra move on FP( R r , T r ) comparing to FP( R r , T ′ r ).The only difference in the trees after iteration l + 1 on the two different paths is the orderof ranks of the parents of a j and a k . Since the rest of T r and T ′ r coincide, the remaining partsof FP( R r , T r ) and FP( R r , T ′ r ) consist of the same moves. With our previous observation we canconclude d ( R r , T r ) = d ( R r , T ′ r ) + 1, and hence T ′ r is on a shortest path from T r to R r . (cid:3) Note that it follows that the set of caterpillar trees is convex in RNNI. This convexity prop-erty implies that the distance between caterpillar trees can be computed more efficiently than by
FindPath . We prove this in the rest of this section. To do so, we first establish that the problem ofcomputing a shortest path consisting only of caterpillar trees can be interpreted in a few differentways.One problem analogous to the shortest path problem for caterpillar trees in RNNI is the
TokenSwapping Problem [11] on a special class of graphs, so-called lollipop graphs. An instance of thetoken swapping problem is a simple graph where every vertex is assigned a token. Two tokens areallowed to swap positions if they are on vertices that are connected by an edge. Each token is assigned a unique goal vertex, and the aim is to find the minimum number of token swaps for alltokens to reach their goal vertex.The problems of computing distances between caterpillar trees can be seen as an instance of thetoken swapping problem on lollipop graphs. A lollipop graph is a graph consisting of a completegraph that is connected to a path by one edge. An instance of the token swapping problem thatcorresponding to the distance problem for caterpillar trees is described in the following. An exampleis illustrated in Figure 6. Let T and R be caterpillar trees withrank( a ) R = rank( a ) R < rank( a ) R < . . . < rank( a n ) R andrank( b ) T = rank( b ) T < rank( b ) T < . . . < rank( b n ) T such that [ b , . . . , b n ] is a permutation of [ a , . . . , a n ]. The corresponding instance of the tokenswapping problem consists of a lollipop graph consisting of a complete graph on three leaves,connected to a path of length n − a , the other ones in the complete graphare labelled by a and a . The vertices on the paths are then labelled inductively, starting at theneighbour of a , such that the neighbour of the last already labelled node with label a i − is labelledby a i . The token on vertex a i has b i as goal vertex. Since the only moves between two caterpillartrees in RNNI are NNI moves, which simply swap two leaves, they correspond to swapping twotokens in the above described instance of the token swapping problem. a a a a a a a a a a a a a a a a a a a a T R
Figure 6.
Two caterpillar trees T and R and the corresponding instance of thetoken swapping problem. Vertex labels are in circles and token goal vertices inrectangles.Therefore, the algorithm described by Kawahara, Saitoh, and Yoshinaka [11] to solve the tokenswapping problem on lollipop graphs can be used for computing distances between caterpillar trees.It however has worst-case time complexity O ( n ), the same as FindPath .In the following we present an algorithm for computing distances between caterpillar trees withbetter worst-case time complexity, O ( n √ log n ), for RNNI (Corollary 2). To do so, we first establisha formula to express distances between two caterpillar trees in RNNI (Theorem 4). This algorithmcan also be used to solve the token swapping problem on lollipop graphs, improving the worst-caserunning time of the known algorithm [11]. EOMETRY OF RANKED TREE SPACES 11
Theorem 4.
Let T and R be caterpillar trees in RNNI such that a ) R = rank( a ) R < rank( a ) R < . . . < rank( a n ) R = n − and P T := { ( a i , a j ) | rank( a i ) T < rank( a j ) T and rank( a i ) R > rank( a j ) R } ,M T := { a i | ( ∀ l with rank( a l ) T ≤ rank( a i ) T : rank( a l ) R > rank( a i ) R ) }∩ { a i | rank( a i ) T < min { rank( a ) T , rank( a ) T }} Then for m T = | M T | and p T = | P T | : d ( T, R ) = p T − m T . We refer to pairs ( a i , a j ) ∈ P T , as defined in Theorem 4, as transpositions. The reason forthis is that caterpillar trees can be seen as permutations of the set { a , . . . , a n } , ordered by theranks of their parents. The tree R in the theorem then corresponds to the identity permutation( a , a , a , . . . , a n ). Note that there is no one-to-one correspondence between permutations andcaterpillar trees. For example the permutation ( a , a , a , . . . , a n ) corresponds to the tree R aswell. Therefore, the two pairs of leaves sharing their parent in T and R , respectively, are not in theset P T . Proof.
Let T and R be caterpillar trees as described above and b d ( T, R ) := p T − m T . For proving b d ( T, R ) = d ( T, R ), it is sufficient to show that for all caterpillar trees T ′ that are neighbour of T itis b d ( T ′ , R ) ≥ b d ( T, R ) − . (1)The fact that inequality (1) implies b d ( T, R ) = d ( T, R ) can be established by induction as in [4,Theorem 1].For proving inequality (1) we first establish p T ′ ≥ p T − m T ′ ≤ m T + 1, assumingthat T ′ is a caterpillar tree that is an RNNI neighbour of T . We then show that p T ′ = p T − m T ′ = m T + 1 cannot both be true simultaneously, which proves inequality (1).At most one transposition of T can be resolved in T ′ because the only move possible betweencaterpillar trees T and T ′ is an NNI move exchanging two leaves. Hence p T ′ ≥ p T −
1. Let a k and a j be the leaves that exchange their position between T and T ′ , such that rank( a k ) T < rank( a j ) T .Since these are the only leaves that change positions between T and T ′ , they are the only elementsthat could be in M T ′ \ M T . It remains to show ( M T ′ \ M T ) = { a k , a j } , from which we can concludethat m T ′ ≤ m T −
1. We prove this by showing that if a k ∈ ( M T ′ \ M T ), it follows a j / ∈ M T ′ .We assume that a k ∈ ( M T ′ \ M T ), implying a k / ∈ M T , so at least one of the following conditionsmust be violated for i = k : ∀ l with rank( a l ) T ≤ rank( a i ) T : rank( a l ) R > rank( a i ) R (C1) rank( a i ) T < min { rank( a ) T , rank( a ) T } . (C2)At first we consider the case that (C1) is violated for a k in T . Then there is an l such thatrank( a l ) T ≤ rank ( a k ) T and rank( a k ) R > rank( a l ) R . It immediately follows that the same conditionis violated for a k in T ′ , because the NNI move exchanging a k and a j preserves the relationship of a k and a l . It hence is a k / ∈ M T ′ , contradicting our assumption a k ∈ ( M T ′ \ M T ).We can therefore assume that (C2) is violated for a k . It follows rank( a k ) T ≥ min { rank( a ) T , rank( a ) T } . As only a k and a j exchange between T and T ′ and a k ∈ M T ′ , itfollows a j ∈ { a , a } . This however results in a violation of (C2) for a j in T ′ and hence a j / ∈ M T ′ .We can conclude ( M T ′ \ M T ) = { a k , a j } , and hence m T ′ ≤ m T + 1. It remains to show that p T ′ = p T − m T ′ = m T + 1 cannot be true at the same time. Weassume p T ′ = p T −
1, hence ( a k , a j ) is a transposition in T , meaning that rank( a k ) T < rank( a j ) T and rank( a k ) R > rank( a j ) R . As a k and a j are the only leaves that could be in M T ′ \ M T , it sufficesto show that neither of them actually is in M T ′ \ M T , resulting in m T ′ < m T + 1.That a k is not in M T ′ follows from the violation of (C1) by rank( a j ) T ′ < rank( a k ) T ′ andrank( a j ) R < rank( a k ) R . It hence is a k / ∈ M T ′ \ M T . Moreover, if a j ∈ M T ′ , it follows a j ∈ M T asexplained in the following. If a j ∈ M T ′ , both conditions (C1) and (C2) are met by a j in T ′ . Withrank( a k ) T < rank( a j ) T and rank( a k ) R > rank( a j ) R it immediately follows that these conditionsare also met in T , and hence a j ∈ M T , and therefore a j / ∈ M T ′ \ M T .Summarising, it is M T ′ \ M T = ∅ , and we can conclude that if p T ′ = p T −
1, it is m T ′ < m T + 1,which concludes this proof. (cid:3) Corollary 2.
The distance between two caterpillar trees can be computed in O ( n √ log n ) in RNNI .Proof.
By Theorem 4 the distance between two caterpillar trees in RNNI is the number of trans-positions between two sequences of length n minus m T as defined in Theorem 4. The value m T can be computed in time linear in n for any caterpillar tree T by considering the leaves of the treeordered according to increasing rank of their parents. The number of transpositions of a sequenceof length n (Kendall-tau distance) can be computed in time O ( n √ log n ) [3]. This number is equal‘to p T , as defined in Theorem 4, when ignoring transpositions for the pairs of leaves sharing a parentin T and R , respectively. The worst-case running time for computing the RNNI distance betweencaterpillar trees is therefore O ( n √ log n ). (cid:3) Diameter and Radius.
In this section we investigate the diameter of RNNI and DCT m ,which is the greatest distance between any pair of trees in each of these graphs, respectively,i.e. max trees T,R d ( T, R ). We first establish the exact diameter of RNNI, improving the upper bound n − n − given by Gavryushkin, Whidden, and Matsen [9]. Afterwards, we generalise this resultto DCT m . Theorem 5.
The diameter of
RNNI is ( n − n − .Proof. For proving this theorem we use the fact that
FindPath computes shortest paths in RNNI.Each iteration i of FindPath , applied to two ranked trees T and R , decreases the rank of the mostrecent common ancestor of a cluster C , induced by the node of rank i in R , in the currently lasttree T ′ on the already computed path (starting wth T ′ = T ). The maximum rank of ( C ) T ′ at thebeginning of iteration i is n −
1, the rank of the root. As every move decreases the rank of ( C ) T ′ byone, there are at most n − − i moves in iteration i . The maximum length of a shortest path in RNNIis hence n − P i =1 i = ( n − n − . Note that the caterpillar trees [ { a , a } , { a , a , a } , . . . , { a , . . . , a n } ]and [ { a n , a n − } , { a n , a n − , a n − } , . . . , { a n , . . . , a } ] provide an example of trees that have distance ( n − n − , as already pointed out in [4, Corollary 1], proving that this upper bound for the lengthof a shortest path is tight. (cid:3) Theorem 6.
The diameter of
DCT m is ( n − n − + ( m − n + 1)( n − .Proof. In order to prove the diameter of DCT m , we consider the longest path that FindPath cancompute between any two trees T and R . That is, we consider the maximum number of movesthat FindPath can perform on the extended ranked versions T r and R r of any two trees T and R .Therefore, we distinguish RNNI moves in the subtrees on the leaf set { a , . . . , a n } from the rankmoves corresponding to length moves, i.e. rank moves between one node of each of the subtrees onleaf subsets { a , . . . , a n } and { a n +1 , . . . a m +2 } .The maximum number of RNNI moves (excluding rank moves corresponding to length moves)on FP( T r , R r ) follows from Theorem 5 and is ( n − n − . The maximum number of rank moves EOMETRY OF RANKED TREE SPACES 13 corresponding to length moves on a shortest path between T r and R r is reached when every internalnode of the subtree T cr of T r swaps rank with every internal node of the subtree T dr . The maximumnumber of such rank swaps corresponding to length moves is hence ( m − n + 1)( n − ( n − n − + ( m − n +1)( n − m we give an example oftrees T and R (Figure 7) for which the path computed by FindPath has length ( n − n − + ( m − n + 1)( n − T and R are caterpillar trees defined as follows. m − n − a ) T = rank( a ) T < rank( a ) T < . . . < rank( a n ) T = m and 1 = rank( a ) R = rank( a n ) R < rank( a n − ) R < . . . < rank( a ) R = n − . (cid:3) n − m − n − m − a a a a a n − a n − a n a n m ... ... T R
Figure 7.
Trees T and R with distance ( n − n − + ( m − n + 1)( n −
1) as describedin the proof of Theorem 6.Note that the worst-case running time of
FindPath is O ( n ) in RNNI and O ( nm ) in DCT m anddepends on the diameter of the corresponding tree spaces. For computing a shortest path, thereis no algorithm with better worst-case running time than this, as the running time for algorithmscomputing shortest paths is bounded from below by the diameter of the corresponding space. Therecan however be more efficient algorithms for computing distances.The radius of a graph is defined as he minimum distance of any vertex in the graph to the vertexwith maximum distance from it, that is, min T max R d ( T, R ), where d is the distance measure in thecorresponding graph. In the following we see that the radius of RNNI equals its diameter, which isnot true for DCT m , as we will see afterwards. Theorem 7.
The radius of
RNNI equals its diameter ( n − n − .Proof. We prove this theorem by showing that every ranked tree T in RNNI has a caterpillar tree R with distance ( n − n − to T , using induction on the number of leaves n .The base case n = 3 is trivial, as all three trees in this space are caterpillar trees with distanceone from each other. For the induction step we consider an arbitrary tree T with n + 1 leaves. Let x and y be the leaves of T that share the internal node of rank one as parent in T , and let T n bethe tree on n leaves resulting from deleting one of these leaves, say x , from T , and suppressing theresulting degree-2 vertex. By the induction hypothesis there is a caterpillar tree R n with distance ( n − n − to T n . Now consider the tree R resulting from adding x at the top of R n such that theroot of R has x and R n as children.We now consider FP( R, T ). In the first iteration of
FindPath , ( { x, y } ) R moves down until itreaches rank one. Therefore, first ( x ) R moves down by NNI moves until it reaches rank rank( y ) R +1. Then a further NNI move creates an internal node with children x and y , before this node is moveddown by rank swaps to reach rank one as depicted in Figure 8. Altogether, there are n − x decreases by one within everymove, starting at the root with rank n and ending at the internal node of rank one. The treeat the end of this first iteration on FP( R, T ) is identical to R n when removing the leaf x andsuppressing its parent (the node of rank one). Since the cluster { x, y } is not considered again in FindPath , the remaining part of FP(
R, T ) contains the same moves as FP( R n , T n ), and hence | FP(
R, T ) | = | FP( R n , R n ) | + n −
1. Therefore it is d ( T, R ) = ( n − n − + n − n ( n − , whichproves the lemma. (cid:3) xy x ya a a n − a a a n − x ya a a n − R Figure 8.
Initial n − R, T ) as described in the proof ofTheorem 7. Removing the leaf x and suppressing the non-root node of degree twofrom the tree on the right results in R n as described in the theorem.Unlike in RNNI, the radius of DCT m does not equal its diameter. A counterexample is given bythe tree depicted in Figure 9 on three leaves in DCT . There is no tree in DCT that has distance ( n − n − + ( m − n + 1)( n −
1) (diameter of DCT m ) from this tree. a a a Figure 9.
Tree in DCT on three leaves for which there is not tree with distance5 = ( n − n − + ( m − n + 1)( n −
1) (diameter) from it5.
Conclusion and future research questions
In this paper we introduced and analysed properties of the space of discrete coalescent treesDCT m . An important tool for establishing these characteristics of the tree space is the algo-rithm FindPath , which has been introduced by Collienne and Gavryushkin [4] for RNNI. Wegeneralised this algorithm and showed in Theorem 1 that it solves the shortest path problem inDCT m as well. Afterwards, we established properties of DCT m and RNNI such as the clusterproperty (Section 4.1), the convexity of the set of caterpillar trees (Section 4.2), diameter, andradius (Section 4.3). With the convexity of the set of caterpillar trees in RNNI we also found amore efficient way of computing distances between such trees, using the correspondence betweencaterpillar trees and permutations.The worst-case time complexity of FindPath for computing a shortest path is O ( mn ) in DCT m .In Section 4.3 we have seen that there is no algorithm with better worst-case running time forcomputing shortest paths. However, it might be possible to compute distances more efficiently. EFERENCES 15
In fact, we established in Section 4.2 a way for computing distances between caterpillar trees in O ( n √ log n ). This raises the question whether there is an algorithm that computes the distancebetween any two trees in DCT m with better running time than FindPath .Throughout this paper we consider DCT m as a generalisation of RNNI by allowing internal nodesto have integer-valued time differences. We therefore introduced the parameter m to bound theheight of a tree in the space of discrete coalescent trees in order to get a finite space. A differentparameter ρ has previously been introduced in Collienne and Gavryushkin [4] for generalising RNNIto a space RNNI( ρ ) of ranked trees, where rank and NNI moves have weights ρ and one, respectively.Combining these two approaches of generalising RNNI gives a space of discrete coalescent treeswhere different moves have different weights. This tree space is relevant for practical applications,where for example some knowledge about the tree topology exists, but the uncertainty of the timingof internal nodes remains high. Investigating such a tree space could therefore be a next step forfurther studies.Another tree space, of which DCT m is a discretisation, is the t -space [8], where internal nodes areassigned real-valued times. For this space on time-trees no algorithm for computing shortest pathsor distances is known yet. Our results for DCT m , however, can be transferred to this space, asdiscrete coalescent trees can be interpreted as discrete time-trees. Distances in DCT m can thereforebe used to approximate those in t -space. For this it is important to notice that the parameter m isnot relevant in applications, as distances between two trees in DCT m ′ coincide with those in DCT m if m ′ < m . Since choice of m , and therefore the choice on how to discretise time-trees, drives thecomplexity of computing shortest paths (Section 4.3), finding a way to discretise time-trees to useour results on DCT m can be subject of further research.In Section 4.2 we established a connection between the shortest path problem for caterpillartrees in RNNI and the token swapping problem on lollipop graphs. We can furthermore providea connection between the RNNI graph and a well-known algebraic structure, the partition lattice.We provide a detailed description of this relation in the supplementary material. Acknowledgements
We thank Charles Semple for useful discussions about the Cluster Property, and Mike Steel forhis useful comments that improved this paper.We acknowledge support from the Royal Society Te Ap¯arangi through a Rutherford DiscoveryFellowship (RDF-UOO1702). This work was partially supported by Ministry of Business, Innova-tion, and Employment of New Zealand through an Endeavour Smart Ideas grant (UOOX1912), aData Science Programmes grant (UOAX1932).MF thanks the joint research project
DIG-IT! supported by the European Social Fund (ESF),reference: ESF/14-BM-A55-0017/19, and the Ministry of Education, Science, and Culture ofMecklenburg-Vorpommern, Germany, for funding parts of this work.
References [1] LJ Billera, SP Holmes, and K Vogtmann. “Geometry of the Space of Phylogenetic Trees”.
Adv. Appl. Math.
PLoSComput. Biol.
Proceedings of the twenty-first annual ACM-SIAM symposium on DiscreteAlgorithms . SIAM, 2010, pp. 161–173.[4] L Collienne and A Gavryushkin. “Computing nearest neighbour interchange distances be-tween ranked phylogenetic trees” (July 2020). arXiv: . [5] MA Cueto and FA Matsen. “Polyhedral geometry of phylogenetic rogue taxa”. Bull. Math.Biol.
Discrete Mathe-matical Problems with Medical Applications: DIMACS Workshop Discrete Mathematical Prob-lems with Medical Applications, December 8-10, 1999, DIMACS Center . Vol. 55. AmericanMathematical Soc., 2000, p. 19.[7] AJ Drummond et al. “Bayesian coalescent inference of past population dynamics from molec-ular sequences”.
Mol. Biol. Evol.
J. Theor.Biol.
403 (Aug. 2016), pp. 197–208.[9] A Gavryushkin, C Whidden, and FA Matsen 4th. “The combinatorics of discrete time-trees:theory and open problems”.
J. Math. Biol.
Oxford surveys in evolu-tionary biology
WALCOM: Algorithms and Computation . Springer In-ternational Publishing, 2017, pp. 448–459.[12] JFC Kingman. “The coalescent”.
Stochastic Process. Appl.
Bioinformatics
Genetics
Trends Ecol.Evol.
Mol.Biol. Evol.
Computing and Combinatorics . Lecture Notes in Computer Science. Springer Berlin Heidel-berg, June 1996, pp. 343–351.[18] E Miller, M Owen, and JS Provan. “Polyhedral computational geometry for averaging metricphylogenetic trees”.
Adv. Appl. Math.
68 (July 2015), pp. 51–91.[19] LT Nguyen et al. “IQ-TREE: a fast and effective stochastic algorithm for estimatingmaximum-likelihood phylogenies”.
Mol. Biol. Evol.
Theor. Popul. Biol. (2017).[21] D Posada. “CellCoal: Coalescent Simulation of Single-Cell Sequencing Samples”.
Mol. Biol.Evol.
Bioinformatics
Phylogenetics . Oxford University Press, 2003.[24] MA Suchard et al. “Bayesian phylogenetic and phylodynamic data integration using BEAST1.10”.
Virus Evol
Mol. Biol. Evol.
EFERENCES 17 Supplement
In the following we discuss a connection of the RNNI graph to a well-known algebraic structure,the partition lattice. Following that, we discuss further open problems providing new ideas forfuture research.6.1.
Partition Lattice.
The connection of RNNI to partition lattices provides a new direction forfurther research and translates results and open problems from the language of phylogenetics tothe language of lattice theory.The partition lattice on { , . . . , n } is the lattice given by the partially ordered set (Π n , ≤ ), whereΠ n is the power set of { , . . . , n } and X ≤ Y if partition X refines Y , that is, X ≤ Y ⇔ ( ∀ x ∈ X )( ∃ y ∈ Y ) x ⊆ y . A chain in a lattice is a set { X , . . . , X k } with X ≤ X ≤ . . . ≤ X k . The length of the chain { X , . . . , X k } is k , the number of its elements minus one, and such a chainis called maximal chain of a lattice if there is no chain with length greater than k in the lattice.For simplification we will denote the partition lattice on n elements by Π n . Π is illustrated inFigure 10. We assume that a partition X in the partition lattice Π n has rank k if the number ofelements in X is n − k . The algebraic structure of Π n is related to the RNNI graph on trees on n leaves in the following way. Theorem 8.
The
RNNI graph on n leaves is isomorphic to the graph of maximal chains of thepartition lattice Π n where two maximal chains are connected by an edge if and only if they differby exactly one partition. The corresponding metric spaces are isometric. This theorem implies that the algorithms developed for trees, such as
FindPath , can be usedon lattices, and also complexity results from RNNI can be transferred.
Proof.
There is a one-to-one relation between ranked trees and maximum chains in a partitionlattice. We can define a bijective mapping from the set of ranked trees to the set of maximumchains in Π n as follows. A ranked tree T maps onto a maximum chain C T if the set in the partitionof rank i in C T that is the union of two sets of the partition of rank i − C T is the cluster inducedby the internal node of rank i in T .Note that this bijection is an isomorphism between the RNNI graph and the graph of chainsas in the theorem. Indeed, two chains are different by exactly one partition if and only if thecorresponding trees are connected by an RNNI move. (cid:3) Figure 10 is an illustration of the proof of Theorem 8. The four chains indicated in boldcorrespond to the following RNNI path. The leftmost chain corresponds to the caterpillar tree[ { , } , { , , } , { , , , } ] (in the cluster representation). First, the partition { , , }{ } is re-placed with { , }{ , } and we get the chain corresponding to the tree [ { , } , { , } , { , , , } ],which is one RNNI move away from the caterpillar tree. Second, the partition { , }{ }{ } is replaced with { }{ }{ , } , which corresponds to the rank swap on the previous tree.Third, the partition { , }{ , } is replaced with { }{ , , } and we reach the caterpillar tree[ { , } , { , , } , { , , , } ]. { , , , }{ , }{ }{ } { }{ }{ }{ }{ , }{ }{ } { , }{ }{ } { }{ , }{ } { }{ , }{ } { }{ }{ , }{ , , }{ } { , }{ , } { , , }{ } { , }{ , } { , , }{ } { , }{ , } { , , }{ } Figure 10.
The partition lattice Π on { , , , }}