[PDF] Computing nearest neighbour interchange distances between ranked phylogenetic trees

Abstract

Many popular algorithms for searching the space of leaf-labelled trees are based on tree rearrangement operations. Under any such operation, the problem is reduced to searching a graph where vertices are trees and (undirected) edges are given by pairs of trees connected by one rearrangement operation (sometimes called a move). Most popular are the classical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnection moves. The problem of computing distances, however, is NP-hard in each of these graphs, making tree inference and comparison algorithms challenging to design in practice. Although ranked phylogenetic trees are one of the central objects of interest in applications such as cancer research, immunology, and epidemiology, the computational complexity of the shortest path problem for these trees remained unsolved for decades. In this paper, we settle this problem for the ranked nearest neighbour interchange operation by establishing that the complexity depends on the weight difference between the two types of tree rearrangements (rank moves and edge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to NP-hard, which is the highest. In particular, our result provides the first example of a phylogenetic tree rearrangement operation for which shortest paths, and hence the distance, can be computed efficiently. Specifically, our algorithm scales to trees with thousands of leaves (and likely hundreds of thousands if implemented efficiently). We also connect the problem of computing distances in our graph of ranked trees with the well-known version of this problem on unranked trees by introducing a parameter for the weight difference between move types. We propose to study a family of shortest path problems indexed by this parameter with computational complexity varying from quadratic to NP-hard.

Full PDF

aa r X i v : . [ c s . D S ] J u l COMPUTING NEAREST NEIGHBOUR INTERCHANGE DISTANCESBETWEEN RANKED PHYLOGENETIC TREES

LENA COLLIENNE AND ALEX GAVRYUSHKIN (cid:0)

Abstract.

Many popular algorithms for searching the space of leaf-labelled (phylogenetic) treesare based on tree rearrangement operations. Under any such operation, the problem is reducedto searching a graph where vertices are trees and (undirected) edges are given by pairs of treesconnected by one rearrangement operation (sometimes called a move). Most popular are the clas-sical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnectionmoves. The problem of computing distances, however, is NP -hard in each of these graphs, makingtree inference and comparison algorithms challenging to design in practice.Although ranked phylogenetic trees are one of the central objects of interest in applications suchas cancer research, immunology, and epidemiology, the computational complexity of the shortestpath problem for these trees remained unsolved for decades. In this paper, we settle this prob-lem for the ranked nearest neighbour interchange operation by establishing that the complexitydepends on the weight diﬀerence between the two types of tree rearrangements (rank moves andedge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to NP -hard, which is the highest. In particular, our result provides the ﬁrst example of a phylogenetictree rearrangement operation for which shortest paths, and hence the distance, can be computedeﬃciently. Speciﬁcally, our algorithm scales to trees with thousands of leaves (and likely hundredsof thousands if implemented eﬃciently).We also connect the problem of computing distances in our graph of ranked trees with thewell-known version of this problem on unranked trees by introducing a parameter for the weightdiﬀerence between move types. We propose to study a family of shortest path problems indexedby this parameter with computational complexity varying from quadratic to NP -hard. The problem of reconstructing evolutionary histories from sequence data is central for manypopular methods in computational biology. Most commonly trees are inferred from sequences viamaximum likelihood [21, 10], MCMC [18, 22, 4], distance-, or parsimony-based approaches [23].All these methods rely on various tree rearrangement operations [20], the most popular of whichare nearest neighbour interchange (NNI), subtree prune and regraft (SPR), and tree bisection andreconnection (TBR). Under any such operation, the tree inference problem can be formulated as agraph search, where vertices are trees and edges are given by tree rearrangement operations. Forsearch algorithms to be eﬃcient, it is important to understand the geometry of these graphs. Forexample, basic geometric properties of the NNI graph have been successfully leveraged to speedup the maximum likelihood method [16]. The most basic geometric characteristic that frequentlyarises in applications is the minimum number of rearrangements necessary to transform one treeinto another [20]. The problem then amounts to computing the length of a shortest path between

Department of Computer Science, University of Otago, New Zealand

E-mail addresses : [email protected], (cid:0) [email protected] . Date : July 27, 2020.We thank Alexei Drummond, David Bryant, and Kieran Elmes for useful discussions about the weight diﬀerencebetween RNNI moves, complexity, scalability, and applied aspects of our results. Their comments improved ourpaper.We acknowledge support from the Royal Society Te Ap¯arangi through a Rutherford Discovery Fellowship (RDF-UOO1702) awarded to AG. This work was partially supported by Ministry of Business, Innovation, and Employment ofNew Zealand through an Endeavour Smart Ideas grant (CONT-61378-ENDSI-UOO) and a Data Science Programmesgrant (UOAX1932). trees in the three graphs. This can also be seen as computing the distance between trees in thecorresponding metric space.Classical results in mathematical phylogenetics imply that these distances are NP -hard to com-pute for all three rearrangement operations NNI, SPR, and TBR [5, 3, 12, 1]. Intuitively, thediﬀerence between the three operations is how much change can be done to a tree by a singleoperation, with NNI being the most local type of rearrangement and TBR the most global one.Remarkably, it took over 25 years and a number of published erroneous attempts, as discussed indetail by DasGupta et al. [5], to prove that computing distances is NP -hard in NNI [5]. Similarly,incorrect proofs for SPR have been discussed in the literature [11, 1], before Bordewich and Semple[3] proved the NP -hardness result for rooted trees and Hickey et al. [12] utilised this proof to es-tablish the result for unrooted trees. To facilitate practical applications, ﬁxed parameter tractablealgorithms [7] for computing the SPR distance have been developed over the years [24, 3, 25].Computing the NNI distance is also known to be ﬁxed parameter tractable [6]. Although impor-tant, these algorithms remain impractical for large distances and are only applied to trees with amoderate number of leaves or those with small distances [25].Another area where algorithms for computing shortest paths and distances between trees play acentral role, is calculating consensus or summary trees [15, 2, 27]. A popular tree distance measureused in such methods is the Robinson-Foulds distance [17], as it can be computed in linear time.Lack of biological motivation however is a downside of this approach, which often results in poorsummaries and hence is not used for summarising samples of trees obtained in a full Bayesian treeinference approach [8]. In general, distance measures that are easy to compute typically have thisproblem, whereas measures that are biologically relevant, including rearrangement-based distances,are often hard to compute [25].In this paper, we establish that using a generalisation of the NNI operation introduced byGavryushkin, Whidden, and Matsen [9] called RNNI (for Ranked Nearest Neighbour Interchange),the shortest path problem is computable in O ( n ), where n is the number of tree leaves. This makesRNNI the ﬁrst tree rearrangement operation under which shortest paths and distances between treesare polynomial-time computable. Our proof of this result (Theorem 1) is constructive – we providean algorithm called FindPath that computes shortest paths in the RNNI graph in O ( n ) time.Our algorithm is optimal as shortest paths often have length quadratic in the number of leaves n .The algorithm is practical as it takes seconds on a laptop to compute the distance between treeswith thousands of leaves, while in the closely related NNI graph the tractable number of leaves iswell below twenty [13, 26].Because NNI can be seen as a special case of RNNI, we investigate whether there exists athreshold at which the complexity shifts from NP -hard to polynomial. Speciﬁcally, we introducean edge weight parameter ρ in the RNNI graph and consider a parametrised graph RNNI( ρ ). Weshow that the shortest path problem is NP -hard in RNNI(0) and quadratic in RNNI(1), so thecomplexity changes with ρ . We hence propose to characterise the complexity classes of the problemRNNI( ρ ) for values of ρ ≥

0. This problem is similar in spirit to the beyond worst-case analysis(including parametrised complexity [7]) framework [19]. Just like in our result, a γ -perturbationstable instance of the maximum cut problem is known [19] to be NP -hard for small values of γ and polynomial for larger values of γ . Since the problem of identifying the value of γ wherethe complexity switches has largely been resolved [14], we hope that the approaches reviewed byRoughgarden [19] will be helpful for our proposed study as well.1. Definitions and background results

Unless stated otherwise, by a tree in this paper we mean a ranked phylogenetic tree , which is abinary tree where leaves are uniquely labelled by elements of the set { a , . . . , a n } for a ﬁxed integer n , and all internal (non-leaf) nodes are uniquely ranked by elements of the set { , . . . , n − } sothat each child has a strictly smaller rank than its parent. All leaves are assumed to have rank 0 OMPUTING RNNI DISTANCE 3 but we only refer to the ranks of internal nodes throughout. In total there are ( n − n !2 n − such treeson n leaves [9]. Two trees are considered to be identical if there exists an isomorphism betweenthem which preserves edges, leaf labels, and node rankings. For example, trees in Figure 1 are alldiﬀerent. a a a a a a a a a a a a a a a a a a a a ( { a , a } ) T = ( T ) ( { a , a } ) T ′ = ( T ′ ) T T ′ Figure 1.

Trees in the RNNI graph with three NNI moves on the left and a rankmove on the right.Because internal nodes of a tree T are ranked uniquely, we can address the node of rank t ∈ { , . . . , n − } , and we write ( T ) t to denote this node. An interval [( T ) t , ( T ) t +1 ] is deﬁnedby two nodes of consecutive ranks. A cluster C ⊆ { a , . . . , a n } in a tree T is a subset of leaves thatcontains all leaves descending from one internal node of T . We then say that this internal node induces the cluster C , and that the subtree rooted at this node is induced by C . Trees can uniquelybe speciﬁed using the cluster representation , that is a list of all clusters induced by internal nodes ofthat tree ordered according to the ranks of internal nodes. For example, the cluster representationof tree T in Figure 1 is [ { a , a } , { a , a , a } , { a , a } , { a , a , a , a , a } ]. For a set S ⊆ { a , . . . , a n } and tree T we denote the most recent common ancestor of S in T , that is the node of the lowestrank in T that induces a cluster containing all elements of S , by ( S ) T . Note that ( C ) T = ( T ) t ifthe cluster C is induced by the node of rank t in T .Our main object of study is the following class of graphs RNNI( ρ ) indexed by a real-valuedparameter ρ ≥

0. Vertices of the RNNI( ρ ) graph are trees as deﬁned above. Two trees areconnected by an edge (also called an RNNI move ) if one results from the other by performing oneof the following two types of tree rearrangement operation (see Figure 1):(i) A rank move on a tree T exchanges the ranks of two internal nodes ( T ) t and ( T ) t +1 withconsecutive ranks, provided the two nodes are not connected by an edge in T .(ii) Trees T and R are connected by an NNI move if there are edges e in T and f in R bothconnecting nodes of consecutive ranks in the corresponding trees, such that the (non-binary) treesobtained by shrinking e and f into internal nodes are identical.The parameter ρ ≥ weight of a path in RNNI( ρ ) is the sum of the weights of all moves along the path. The distance between two trees in RNNI( ρ ) is the weight of a path with the minimal weight, which wewill call a shortest path . When ρ = 1 we assume that the graph is unweighted.We consider the following class of problems parametrised by a real number ρ ≥ COMPUTING RNNI DISTANCE

RNNI( ρ )-SPINSTANCE: A pair of trees T and R on n leavesFIND: A path of minimal weight between T and R in RNNI( ρ )Since RNNI( ρ ) is a connected graph, there always exists a solution to RNNI( ρ )-SP. Furthermore,the size of every solution to an instance of RNNI( ρ )-SP is bounded by a polynomial in n , despitethe search space being super-exponential. This is because the diameter of the RNNI(1) graph isbounded from above [9] by n − n − / NP -hard. To be consistentwith notations used in the literature [9], we will denote the graph RNNI(1) by RNNI.2. FindPath algorithm

In this section we introduce an algorithm called

FindPath that computes paths between treesand is quadratic in the number of leaves.An input of the

FindPath algorithm is two trees T and R in their cluster representation. Wedenote the representation of R by [ C , . . . , C n − ]. The algorithm considers the clusters C , . . . , C n − iteratively in their order and produces a sequence p of trees which becomes a shortest path from T to R after the algorithm terminates. During each iteration k = 1 , . . . , n − p if necessary, and we will refer to the last added tree as T . In iteration k , the rank of ( C k ) T is decreased by RNNI moves until C k is induced by the node of rank k in T . In Proposition 1we show that FindPath is a deterministic algorithm with running time quadratic in the numberof leaves n . In particular, there always exists a unique move that decreases the rank of ( C k ) T asdescribed above. Algorithm 1

FindPath ( T, R ) T := T , p := [ T ], [ C , . . . , C n − ] := R for k = 1 , . . . , n − do while rank(( C k ) T ) > k do if ( C k ) T and node u preceding ( C k ) T in T are connected by an edge then T is T with the rank of ( C k ) T decreased by an NNI move else T is T with ranks of u and ( C k ) T swapped T = T p = p + T return p Proposition 1.

FindPath is a correct deterministic algorithm that runs in O ( n ) time.Proof. To show that

FindPath is a deterministic algorithm (see the pseudocode above), we haveto prove that tree T constructed in the while loop (line 3) of the algorithm always exists and isuniquely deﬁned. If T is obtained in line 7 from T by a rank move, the tree exists and is uniquebecause there always exists exactly one rank move on any particular interval that is not an edge. Itremains to show that an NNI move that decreases the rank of ( C k ) T always exists and is unique.To prove this we consider cases k = 1 and k > k = 1. In this case C k consists of two leaves { x, y } . Since we assumed that the while condition is satisﬁed,the node v = ( { x, y } ) T has rank r >

1. Consider the node u preceding v in T , so the rank of u is r −

1. Assume without loss of generality that x is in the cluster induced by u , so y has to beoutside this cluster. Consider the following three disjoint subtrees of T : the subtree T inducedby a child of u and containing x , the subtree T induced by the other child of u , the subtree T induced by a child of v and containing y . Now observe that out of two NNI moves possible on theedge [ u, v ] in T , only the one that swaps T and T does decrease the rank of the most recentcommon ancestor of { x, y } . Hence T exists and is unique in this case.Case k >

1. In this case C k = C i ∪ C j for i, j < k . In this case the subtree of T induced by ( C i ) T is identicalto the subtree of R induced by ( C i ) R , and the same is true for ( C j ) T and ( C j ) R . Hence, we canreduce this case to k = 1 by suppressing C i and C j in both T and R to new leaves c i and c j (ofrank zero) respectively. As in Case k = 1, exactly one of two possible NNI moves deceases the rankof the most recent common ancestor of c i , c j in T , so the same is true for the most recent commonancestor ( C k ) T , and T is unambiguously deﬁned.Thus, FindPath is a deterministic algorithm.To prove correctness, note that the algorithm starts by adding T to the output path, and everynew tree added to the output path is an RNNI neighbour of the previously added one (see line 5and 7). To see that the output path terminates in R , observe that after k iteration of the for loop(line 2) of the algorithm, the ﬁrst k clusters of T and R must coincide, and so after n − T and R is constructed.The worst-case time complexity of FindPath is quadratic in the number of leaves, as there canbe at most n − for loop (line 2) and in every iteration of the for loop at most n − while loops (line 3) are executed. Here and throughout the paper we assume that the outputof FindPath is encoded by a list of RNNI moves rather than an actual list of trees. This is becausewriting out a tree on n leaves takes time linear in n and the complexity of FindPath becomescubic. (cid:3) FindPath computes shortest paths in optimal time

In this section we prove the main result of this paper, that RNNI(1)-SP is polynomial. Speciﬁcallywe prove that paths returned by

FindPath are always shortest. We also show that

FindPath isan optimal algorithm, that is, no sub-quadratic algorithm can solve RNNI(1)-SP.The main ingredient of our proof is to show that a local property (see (1) in the proof) of the

FindPath algorithm is enough to establish that the output paths are shortest. The property canintuitively be understood as

FindPath always choosing the best tree possible to go to. Importantly,this result can be used for an arbitrary vertex proposal algorithm in an arbitrary graph to establishthat the algorithm always follows a shortest path between vertices in the graph, hence our prooftechnique is of general interest.

Theorem 1.

The worst-case time complexity of the shortest path problem in the

RNNI graph ontrees with n leaves is O ( n ) . Hence RNNI(1) - SP is polynomial time solvable.Proof. We prove this theorem by showing that for every pair of trees T and R , the path computedby the FindPath algorithm is a shortest RNNI path. We denote this path by FP(

T, R ) and itslength by | FP(

T, R ) | . By d ( T, R ) we denote the length of a shortest path between T and R , that is,the RNNI distance between trees. We hence want to show that | FP(

T, R ) | = d ( T, R ) for all trees.Assume to the contrary that T and R are two trees with a minimum distance d ( T, R ) such that d ( T, R ) = | FP(

T, R ) | , that is, d ( T, R ) < | FP(

T, R ) | . Let T ′ be the ﬁrst tree on a shortest RNNIpath from T to R . Then d ( T ′ , R ) = d ( T, R ) −

1, implying that the distance between T ′ and R isstrictly smaller than that between T and R . Hence d ( T ′ , R ) = | FP( T ′ , R ) | < | FP(

T, R ) | −

1. Weﬁnish the proof by showing that no trees satisfy this inequality.Speciﬁcally, we will show thatfor all trees T , R , and T ′ such that T ′ is one RNNI move away from T , | FP( T ′ , R ) | ≥ | FP(

T, R ) | − COMPUTING RNNI DISTANCE

Figure 2.

Trees T , T ′ , and R as in inequality (1). Paths FP( T, R ) =[

T, T , T , . . . , R ] and FP( T ′ , R ) = [ T ′ , T ′ , T ′ , . . . , R ] are indicated by arrows.Assume to the contrary that T and R are trees for which there exists T ′ violating inequal-ity (1). Out of all such pairs T, R choose one with the minimal | FP(

T, R ) | . Denote FP( T, R ) =[

T, T , T , . . . , R ] and FP( T ′ , R ) = [ T ′ , T ′ , T ′ , . . . , R ], and let [( T ) t , ( T ) t +1 ] be the interval in T onwhich the RNNI move connecting T and T ′ is performed. Let C k be the cluster of R such that thenode ( C k ) T is moved down by the ﬁrst move on FP( T, R ). If the rank of ( C k ) T is not in { t, t + 1 } then ( C k ) T and ( C k ) T ′ induce the same cluster, so FindPath would make the same rearrangementin both trees T and T ′ in the ﬁrst move along FP( T, R ) and FP( T ′ , R ) resulting in trees T and T ′ which are RNNI neighbours, as in Figure 2. In this case, paths FP( T , R ) and FP( T ′ , R ) vi-olate inequality (1) but FP( T , R ) is strictly shorter than FP( T, R ), contradicting our minimalityassumption. Hence, the ﬁrst move on FP(

T, R ) has to involve an interval incident to at least oneof the nodes ( T ) t , ( T ) t +1 .Moreover, because C k is the ﬁrst cluster satisfying the while condition of FindPath appliedto T and R , all clusters C j with j < k have to be present in T . And since the ﬁrst move onFP( T, R ), which decreases the rank of ( C k ) T , involves nodes with ranks not higher than t + 2, themost recent common ancestor of C k has rank not higher than t + 1 after this move. Hence k ≤ t + 1.Furthermore, clusters C j for all j ≤ k − T ′ as well as T , because all clustersinduced by nodes of rank t − C k − , however, mightnot be induced by a node in T ′ if k − t . Therefore, the ﬁrst move on FP( T ′ , R ) can decreasethe rank of the most recent common ancestor of either C k − or C k .We will distinguish two cases depending on whether T and T ′ are connected by an NNI or a rankmove. For each of these we will further distinguish all possible moves between T and T . Note thatin all ﬁgures illustrating possible moves on FP( T, R ) and FP( T ′ , R ) below, the position of the treeroot is irrelevant, so we have positioned roots to simplify our ﬁgures. A B C A BCD DA BC D

T T ′ tt + 1 t + 2 tt + 1 t + 2 Figure 3.

NNI move between T and T ′ on the edge [( T ) t , ( T ) t +1 ] indicated in bold,and the third RNNI neighbour resulting from a move on this edge. OMPUTING RNNI DISTANCE 7

Case 1. T and T ′ are connected by an NNI move. So [( T ) t , ( T ) t +1 ] is an edge in T – see Figure 3. Denotethe clusters induced by the children of ( T ) t by A and B and the cluster induced by the child of( T ) t +1 that is not ( T ) t by C , and assume that the NNI move between T and T ′ exchanges thesubtrees induced by clusters B and C . Additionally, denote the cluster induced by the child of( T ) t +2 that is not ( T ) t +1 by D – see Figure 3. Note that [( T ) t +1 , ( T ) t +2 ] does not have to be anedge in tree T (see Case 1.4).We now consider all possible moves FindPath can perform to go from T to T that involve anode of rank t or t + 1, that is, we will consider three intervals in total.1.1 RNNI move (either type) on interval [( T ) t , ( T ) t +1 ]. This move has to be the NNI move that isdiﬀerent from the NNI move connecting T and T ′ . In this case, the cluster B ∪ C is built in T , asdepicted in the bottom of Figure 3. Hence the ﬁrst cluster C k that satisﬁes the while condition of FindPath must contain elements from both B and C but not from A , and the rank of ( C k ) R hasto be at most t . But then FindPath applied to T ′ and R has to decrease the rank of ( C k ) T ′ inits ﬁrst step implying that T ′ = T , so | FP( T ′ , R ) | = | FP(

T, R ) | . This contradicts our assumptionthat | FP( T ′ , R ) | < | FP(

T, R ) | − T ) t +1 , ( T ) t +2 ] that swaps the subtrees induced by clusters C and D .This move is shown in Figure 4A by an arrow from T to the leftmost tree in the middle row. Inthis case, the ﬁrst cluster C k that satisﬁes the while condition of FindPath computing FP(

T, R )must intersect D but not C . Additionally, C k must intersect A , or B , or both of them. Hence, wewill consider each of these three cases individually, and demonstrate them in Figure 4. A B C D T A B C DA B CDA B CDA D CB A B C Dttt T T t + 2 t + 1 t + 2 t + 1 t + 2 t + 1 (A) Possible initial segments of FP(

T, R ) A BC DA BC D A BCDA BC D A BC D A BC Dttt t + 2 t + 1 t + 2 t + 1 t + 2 t + 1 T ′ T ′ T ′ (B) Possible initial segments of FP( T ′ , R ) Figure 4.

Comparison of paths FP(

T, R ) and FP( T ′ , R ) if T and T ′ are connectedby an NNI move on edge [( T ) t , T t +1 ] in T . The bottom row displays all possibilitiesfor T and T ′ , depending on the position of cluster C k that satisﬁes the while condition of FindPath : case C k intersects B and D is on the left, C k intersects A and D is in the middle, and C k intersects C and D is on the right.1.2.1 C k intersects A , B , and D but not C . In this case, since we assumed [( T ) t , ( T ) t +1 ] to be an edge inthe tree, no move on T can decrease the rank of ( C k ) T . It follows from the proof of Proposition 1 COMPUTING RNNI DISTANCE that this can happen only when the subtrees induced by ( C k ) T and ( C k ) R in the correspondingtrees coincide. That is, the while condition of FindPath must be false after this ﬁrst move for all j ≤ k . This implies that t = k − C k − = A ∪ B . But since the rank of ( C k − ) T ′ is t + 1 > k − C k − has to be the ﬁrst cluster for which the while condition of FindPath applied to T ′ and R ismet. Hence the ﬁrst move on FP( T ′ , R ) must decrease the rank of ( C k − ) T ′ by building the cluster A ∪ B , in which case T ′ = T . This however contradicts | FP( T ′ , R ) | < | FP(

T, R ) | − C k intersects A and D but not B or C . Starting from T , FindPath exchanges ﬁrst subtrees inducedby clusters C and D and then by B and D . This results in trees T and T – see the path leadingto the tree in the middle of the bottom row in Figure 4A. This implies that the rank of ( C k − ) R is lower than t , so the ﬁrst cluster that satisﬁes the while condition of FindPath applied to T ′ and R is C k . Hence, starting from T ′ , FindPath exchanges ﬁrst subtrees induced by B and D and then by C and D . This results in trees T ′ and T ′ – see the path leading to the tree in themiddle of the bottom row in Figure 4B. It follows that T and T ′ are connected by an RNNI moveon the interval [( T ) t +1 , ( T ) t +2 ] (indicated by dotted edges in the corresponding trees in Figure 4).This together with the facts that | FP( T , R ) | = | FP(

T, R ) | − | FP( T ′ , R ) | = | FP( T ′ , R ) | − T, R ) is of minimal length violating inequality (1).1.2.3 C k intersects B and D but not A or C . This case is analogous to the previous one. The two initialsegments of FP( T, R ) and FP( T ′ , R ) are the paths leading to the leftmost trees in the bottom row ofFigures 4A and 4B, respectively. Note that the rank swap leading from T ′ to T ′ is required becausethe rank of ( C k ) R is at most t as implied by the move leading from T to T . The correspondingtrees T and T ′ are again RNNI neighbours.1.3 NNI move on (edge) interval [( T ) t +1 , ( T ) t +2 ] that builds a cluster C ∪ D in T . This move is shownin Figure 4A by an arrow from T to the second leftmost tree in the middle row. In this case, C k intersects C and D but not A or B . And we have the following two possibilities to consider.1.3.1 The ranks of ( C k ) T and ( C k ) R coincide. In this case, the previous cluster C k − of R has to be A ∪ B . Since A ∪ B is not a cluster in T ′ , the ﬁrst RNNI move on FP( T ′ , R ) builds the cluster A ∪ B by swapping subtrees induced by cluster B and C . This move results in T ′ = T contradicting | FP( T ′ , R ) | < | FP(

T, R ) | − C k ) T is strictly higher than that of ( C k ) R . In this case, FindPath decreases the rankof ( C k ) T in the second step. This results in the path from T to the rightmost tree in Figure 4A.Hence, FP( T ′ , R ) also has to begin with two moves that decrease the rank of ( C k ) T ′ twice, resultingin the rightmost path in Figure 4B. Similarly to case 1.2.2, we arrive at a contradiction that trees T , T ′ , and R violate inequality (1) and | FP( T , R ) | < | FP(

T, R ) | .1.4 Rank move on interval [( T ) t +1 , ( T ) t +2 ]. This case is analogous to case 1.3 (see Figure 5). If theranks of ( C k ) T and ( C k ) R coincide then C k − = A ∪ B , and applying FindPath to T ′ , R we get T ′ = T . If the rank of ( C k ) T is strictly higher than that of ( C k ) R then FindPath decreases therank of ( C k ) T in the second step. Recall that the interval between nodes of rank t and t + 1 isan edge in both T and T ′ . Hence, the ﬁrst two moves on FP( T ′ , R ) decrease the rank of ( C k ) T ′ twice resulting in T ′ which is an RNNI neighbour of T as depicted in Figure 5. As before, thiscontradicts our minimality assumption.1.5 RNNI move (either type) on interval [( T ) t − , ( T ) t ]. In this case C k ⊆ A ∪ B and the rank of ( C k ) R is at most t −

1. This implies that C k is the ﬁrst cluster to satisfy the while condition for T ′ andthe ﬁrst move on FP( T ′ , R ) decreases the rank of ( C k ) T ′ by exchanging the subtrees induced by B and C . This results in T ′ = T . Case 2. T and T ′ are connected by a rank move. We assume that the rank move is performed on the interval[( T ) t , ( T ) t +1 ]. Denote the cluster induced by ( T ) t by A , the clusters induced by the children of( T ) t by A and A , the cluster induced by ( T ) t +1 by B , and the clusters induced by the childrenof ( T ) t +1 by B and B – see Figure 6. OMPUTING RNNI DISTANCE 9

A B C D

T T ′ tt + 1 t + 2 A C B DA B C DA B C D A C B DA C B D T T ′ T T ′ t + 1 t + 2 tt + 1 t + 2 t Figure 5.

Comparison of paths FP(

T, R ) and FP( T ′ , R ) if there is an NNI movebetween T and T ′ and a rank move on the interval above this edge follows onFP( T, R ). A B B T B B T ′ B B A A A A A A B A B B B A A B B A A ttt T T T ′ T ′ t + 1 t + 2 t + 1 t + 2 t + 1 t + 2 Figure 6.

Rank move between T and T ′ and possible initial segments of FP( T, R )and FP( T ′ , R ) when [( T ) t +1 , ( T ) t +2 ] is an edge. We use notations A = A ∪ A and B = B ∪ B .We again consider all possible moves FindPath can perform to go from T to T that involve anode of rank t or t + 1.2.1 Rank move on [( T ) t , ( T ) t +1 ]. This move results in T = T ′ . T ) t +1 , ( T ) t +2 ]. The following two sub-cases are analogous to case 1.3.2.2.1 ( T ) t +2 is a parent of ( T ) t . The ﬁrst move on FP( T, R ) builds a cluster A ∪ B or A ∪ B , and weassume without loss of generality that it is the former, as in Figure 6. This implies that C k intersects A and B but not B If the ranks of ( C k ) T and ( C k ) R coincide then the previous cluster C k − of R has to be A . Therefore, the ﬁrst move on FP( T ′ , R ) decreases the rank of ( A ) T ′ , which resultsin T ′ = T . If the rank of ( C k ) T is strictly higher than that of ( C k ) R then FindPath decreasesthe rank of ( C k ) T in the second step. Due to the symmetry we can assume that C k ⊆ A ∪ B ,which implies that the move between T and T exchanges the subtrees induced by A and B ,as depicted on the left of Figure 6. C k ⊆ A ∪ B implies that the ﬁrst two moves on FP( T ′ , R )result in a tree T ′ that is an RNNI neighbour of T – see Figure 6. This is a contradiction to theminimality assumption on | FP(

T, R ) | .2.2.2 ( T ) t +2 is not a parent of ( T ) t . In this case, there exists a cluster C induced by the child of ( T ) t +2 which is diﬀerent from the one that induces B – see Figure 7. We can assume without loss of A B B T A B B C CA B C B A B B CA B C B A B B Ctt + 1 t + 2 tt + 2 t + 1 tt + 2 t + 1 T T T ′ T ′ T ′ Figure 7.

Comparison of paths FP(

T, R ) and FP( T ′ , R ) if there is a rank movebetween T and T ′ and an NNI move on the edge below the corresponding (rank)interval follows on FP( T, R ).generality that C k ⊆ C ∪ B and the ﬁrst move on FP( T, R ) builds a new cluster C ∪ B . If theranks of ( C k ) T and ( C k ) R coincide then C k − = A , which implies that A is induced by the nodeof rank t in both T and R . So T ′ = T . If the rank of ( C k ) T is strictly higher than that of ( C k ) R then FindPath decreases the rank of ( C k ) T in the second step – see Figure 7. The correspondingﬁrst moves on FP( T ′ , R ) are shown on the right in Figure 7, and we again get that T and T ′ areRNNI neighbours.2.3 Rank move on interval [( T ) t +1 , ( T ) t +2 ]. Again, depending on whether or not the ranks of ( C k ) T and( C k ) R coincide, we arrive at the conclusion that either T ′ = T or T and T ′ are RNNI neighbours,similarly to case 1.4.2.4 RNNI move (either type) on interval [( T ) t − , ( T ) t ]. In this case C k ⊆ A and the ﬁrst move onFP( T ′ , R ) must be a rank swap resulting in T ′ = T .Since all possible cases result in a contradiction, we conclude that inequality (1) is true for alltrees, which completes the proof of the theorem. (cid:3) OMPUTING RNNI DISTANCE 11

We ﬁnish this section by showing that no algorithm has strictly lower worst-case time complexitythan

FindPath . We again assume here that the output of an algorithm for solving RNNI(1)-SPis a list of RNNI moves. Requiring the output to be a list of trees would result in cubic complexitywhile maintaining the optimality of

FindPath . Corollary 1.

The time-complexity of the shortest path problem

RNNI(1) - SP is Ω( n ) .Proof. We prove this by establishing the lower bound on the output size to the problem, that is,the length of a shortest paths.Consider two “caterpillar” trees T = [ { a , a } , { a , a , a } , . . . , { a , a , . . . , a n } ] and R =[ { a , a n } , { a , a n , a n − } , . . . , { a , a n , . . . , a } ]. Applied to these trees FindPath executes an NNImove in each of the n − k − while loops (line 3) in every iteration k of the for loop (line 2).Hence the length of the output path of FindPath is n − P k =1 k = ( n − n − and therefore quadratic in n . Theorem 1 then implies that this path is a shortest path. It follows that the worst-case size ofthe output to RNNI(1)-SP is quadratic. (cid:3) For what ρ is RNNI( ρ ) - SP polynomial? As we have seen in Section 2, the shortest path problem RNNI(1)-SP is solvable in polynomialtime. In this section, we will show that a classical result in mathematical phylogenetics impliesthat RNNI(0)-SP is NP -hard. We will also discuss RNNI( ρ )-SP for other values of ρ . Theorem 2 (DasGupta et al. [5]) . RNNI(0) - SP is NP -hard.Proof. Note that the length of the path required in an instance of RNNI(0)-SP is equal to theminimum number of NNI moves necessary to convert one tree into another tree where the rankingsof internal nodes are ignored and NNI moves are allowed on every edge. This minimum is calledthe NNI distance and the corresponding problem is known to be NP -hard [5]. (cid:3) In the light of Theorem 1 and Theorem 2 the following problem is natural.

Problem 1.

Characterise the complexity of

RNNI( ρ ) - SP in terms of ρ . This problem is also of applied value. For example, trees might come from an inference methodwith higher certainty of their branching structure and lower certainty of their nodes order. Acomparison method for such trees should have higher penalty for NNI changes and lower penaltyfor rank changes, which in our notations requires ρ <

FindPath algorithm substantially relies on the factthat the rank move and the NNI move have the same weight in the RNNI graph. This suggeststhat a non-trivial algorithmic insight is necessary to extend our polynomial complexity result toother values of ρ . Proposition 2.

FindPath does not compute shortest paths in

RNNI( ρ ) for ρ = 1 .Proof. For ρ > T = [ { a , a } , { a , a , a } , { a , a , a , a } ] and R = [ { a , a } , { a , a , a } , { a , a , a , a } ] . Applied to these trees

FindPath proceeds from T to [ { a , a } , { a , a } , { a , a , a , a } ], then to[ { a , a } , { a , a } , { a , a , a , a } ], and then to R . This path consists of two NNI moves withone rank move in between them and therefore has weight 2 + ρ . However, the path from T to[ { a , a } , { a , a , a } , { a , a , a , a } ] to [ { a , a } , { a , a , a } , { a , a , a , a } ] to R consists of threeNNI moves and is hence shorter. a a a a a a a a a a a a a a a a a a a a a a a a

11 1 11 ρ Figure 8.

Path computed by

FindPath (top) and a shorter path (bottom) for ρ > a a a a a a a a a a a a a a a a a a a a a a a a ρ Figure 9.

Path computed by

FindPath (top) and a shorter path (bottom) for ρ < ρ < T = [ { a , a } , { a , a } , { a , a , a , a } ] and R = [ { a , a } , { a , a , a } , { a , a , a , a } ] . Applied to these trees

FindPath proceeds from T to [ { a , a } , { a , a , a } , { a , a , a , a } ], thento [ { a , a } , { a , a , a } , { a , a , a , a } ], and then to R . This path consists of three NNI movesand therefore has weight 3. However, the path from T to [ { a , a } , { a , a } , { a , a , a , a } ] to[ { a , a } , { a , a , a } , { a , a , a , a } ] to R consists of one rank move followed by two NNI movesand is hence shorter. (cid:3) Additional open problems

The idea utilised by DasGupta et al. [5] to prove that computing distances in NNI is NP -hardstems from a result that shortest paths in NNI do not preserve clusters [13], that is, sometimes acluster shared by two trees T and R is shared by no other tree on any shortest path between T and R . This counter-intuitive property eventually led to the computational hardness result in NNI.Moreover, this property makes little sense biologically as trees clustering the same set of sequencesinto a subtree should be closer to each other than to a tree that does not have that subtree.Indeed, a shared cluster means that both trees support the hypothesis that this cluster has evolvedalong a subtree. In light of this biological argument, the NP -hardness result can be interpretedas RNNI( ρ )-SP being hard only when the graph RNNI( ρ ) is biologically irrelevant. Althoughin sharp contrast with the common belief in the ﬁeld of computational phylogenetics [25], thisinterpretation resonates with the idea suggested in the beyond worst case analysis framework [19]that some problems are only computationally hard when their instances are practically irrelevant.The following question is hence natural. EFERENCES 13 (1) For which values of ρ does RNNI( ρ ) have the cluster property? How do those compare to the valuesof ρ for which RNNI( ρ )-SP is eﬃcient?Other natural questions that arise in the context of our results are the following.(2) The questions we have considered for ranked NNI can be studied in other rearrangement-basedgraphs on leaf-labelled trees, such as the ranked SPR graph and the ranked TBR graph [20]. Whatis the complexity of the shortest path problem there?(3) Can our results be used to establish whether the problem of computing geodesics between treeswith real-valued node heights is polynomial-time solvable? This geodesic metric space is calledt-space and an eﬃcient algorithm for computing geodesics in t-space would be of importance forapplications [8]. References [1] BL Allen and M Steel. “Subtree Transfer Operations and Their Induced Metrics on Evolu-tionary Trees”.

Ann. Comb.

Algorithms Mol. Biol.

Ann. Comb.

PLoS Comput. Biol.

DiscreteMathematical Problems with Medical Applications: DIMACS Workshop Discrete Mathemat-ical Problems with Medical Applications, December 8-10, 1999, DIMACS Center

55 (2000),p. 19.[6] B DasGupta et al. “On the Linear-Cost Subtree-Transfer Distance between PhylogeneticTrees”.

Algorithmica

Fundamentals of Parameterized Complexity . Springer, London,2013.[8] A Gavryushkin and AJ Drummond. “The space of ultrametric phylogenetic trees”.

J. Theor.Biol.

403 (Aug. 2016), pp. 197–208.[9] A Gavryushkin, C Whidden, and FA Matsen 4th. “The combinatorics of discrete time-trees:theory and open problems”.

J. Math. Biol.

Syst. Biol.

Discrete Appl. Math.

Evol. Bioinform. Online

Computing and Combinatorics . Lecture Notes in Computer Science (June 1996), pp. 343–351.[14] K Makarychev, Y Makarychev, and A Vijayaraghavan. “Bilu–Linial stable instances of maxcut and minimum multiway cut”.

Proceedings of the twenty-ﬁfth annual ACM-SIAM sympo-sium on Discrete algorithms (2014), pp. 890–906.[15] FR McMorris and MA Steel. “The complexity of the median procedure for binary trees”.

NewApproaches in Classiﬁcation and Data Analysis . Springer Berlin Heidelberg, 1994, pp. 136–140.[16] LT Nguyen et al. “IQ-TREE: a fast and eﬀective stochastic algorithm for estimatingmaximum-likelihood phylogenies”.

Mol. Biol. Evol.

Math. Biosci. [18] F Ronquist and JP Huelsenbeck. “MrBayes 3: Bayesian phylogenetic inference under mixedmodels”.

Bioinformatics

Commun. ACM

Phylogenetics . Oxford University Press, 2003.[21] A Stamatakis. “RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses withthousands of taxa and mixed models”.

Bioinformatics

Virus Evol

Mol. Biol. Evol.

Experimental Algorithms . Lecture Notes in ComputerScience (2010), pp. 141–153.[25] C Whidden and F Matsen. “Calculating the Unrooted Subtree Prune-and-Regraft Distance”.

IEEE/ACM Trans. Comput. Biol. Bioinform. (Feb. 2018).[26] C Whidden and FA Matsen IV. “Ricci-Ollivier Curvature of the Rooted Phylogenetic Subtree-Prune-Regraft Graph”.

Proceedings of the Thirteenth Workshop on Analytic Algorithmics andCombinatorics (ANALCO16) (2016), pp. 106–120.[27] C Whidden, N Zeh, and RG Beiko. “Supertrees Based on the Subtree Prune-and-RegraftDistance”.