A simulation approach for change-points on phylogenetic trees
Adam Persing, Ajay Jasra, Alexandros Beskos, David Balding, Maria De Iorio
AA Simulation Approach for Change-Points onPhylogenetic Trees
BY ADAM PERSING , AJAY JASRA , ALEXANDROS BESKOS , DAVID BALDING ,& MARIA DE IORIO Department of Statistical Science, University College London, London, WC1E 7HB, UK.E-Mail: [email protected], [email protected], [email protected] Department of Statistics & Applied Probability, National University of Singapore, Singapore, 117546, SG.E-Mail: [email protected] Genetics Institute, University College London, London, WC1E 6BT, UK.E-Mail: [email protected]
Abstract
We observe n sequences at each of m sites, and assume that they have evolvedfrom an ancestral sequence that forms the root of a binary tree of known topology andbranch lengths, but the sequence states at internal nodes are unknown. The topologyof the tree and branch lengths are the same for all sites, but the parameters of theevolutionary model can vary over sites. We assume a piecewise constant model for theseparameters, with an unknown number of change-points and hence a trans-dimensionalparameter space over which we seek to perform Bayesian inference. We propose twonovel ideas to deal with the computational challenges of such inference. Firstly, weapproximate the model based on the time machine principle: the top nodes of thebinary tree (near the root) are replaced by an approximation of the true distribution; asmore nodes are removed from the top of the tree, the cost of computing the likelihood isreduced linearly in n . The approach introduces a bias, which we investigate empirically.Secondly, we develop a particle marginal Metropolis-Hastings (PMMH) algorithm, thatemploys a sequential Monte Carlo (SMC) sampler and can use the first idea. Ourtime-machine PMMH algorithm copes well with one of the bottle-necks of standardcomputational algorithms: the trans-dimensional nature of the posterior distribution.The algorithm is implemented on simulated and real data examples, and we empiricallydemonstrate its potential to outperform competing methods based on approximateBayesian computation (ABC) techniques. Keywords : Binary trees, change-point models, particle marginal Metropolis-Hastings,sequential Monte Carlo samplers, time machine, approximate Bayesian computation. a r X i v : . [ s t a t . C O ] A ug Author Summary
A phylogeny (or evolutionary tree) can explain the ancestral relationships among speciesbased on similarities in their genetic sequences (e.g., DNA). A phylogenetic model will typi-cally be parametrized by rates which can correspond to genetic mutations that occur withinpopulations as they evolve over time. In many applications, it is reasonable to assume thatgenetic sequences share a common phylogenetic structure across all of their sites. How-ever, to allow for greater modelling flexibility, it is often times desirable to allow for theevolutionary rate parameters to change across the length of the sequences. The main focusof this paper is estimating that rate variation across the sites. We consider a model thatallows for neighbouring blocks of sites to be parametrized by different evolutionary rates(i.e., a change-point model), and we propose a novel computational scheme that enables apractitioner to fit this model to genetic sequences when the true number of change-points isunknown. Thus, our contribution is a methodology that infers both the number of distinctblocks along the length of the sequence and the values of the rates themselves. We empir-ically demonstrate the potential of our algorithm to outperform competing computationalmethods.
A phylogeny (or evolutionary tree) is the most common structure employed to explain theevolutionary relationships among species (‘taxa’) based on similarities in their physical or(more usually) genetic characteristics. The branching pattern of the tree is usually referredto as its topology, and describes shared and independent periods of evolution of differenttaxa. The leaves of the tree correspond to observations on the taxa. In a rooted phylogenetictree (Figure 1 of Appendix A), each internal node corresponds to a speciation event andrepresents the most recent common ancestor of all the taxa descended from that node. Thelength of the edges connecting the nodes (‘branches’) can be interpreted as the time betweenspeciation events.The evolutionary analysis of molecular sequence variation is statistically challenging.Parsimony methods were among the first approaches for inferring phylogenies, but in re-cent years, great research effort has been devoted to likelihood-based methods, both in thefrequentist [1] and Bayesian framework.DNA sequences occupy one of four states (A, C, G, T) at each site, and so specifying2he likelihood function requires a model for how these change over time at each site. Thesimplest such model is the Jukes-Cantor, in which each state is substituted by any otherstate at the points of a homogeneous Poisson process. The Kimura model has a rate fortransitions (A ↔ G or C ↔ T) that can differ from the rate for transversions (all othersubstitutions), see [2, Chapter 13]. Objects of inference can include the topology of thephylogenetic tree (here regarded as known), the relative branch lengths on the tree, and thesubstitution rates.Likelihood-based approaches usually assume that substitution rates are the same at allsites, so that the likelihood is obtained as a product across sites. However, variation in sub-stitution rates along DNA sequences is well established [3]. This variation can be explainedby variation in functional constraint across the genes encoded in the sequences. If the DNAsequence is from a coding region, natural selection may constrain variability at some sitesmore than others and therefore sites might exhibit different rates of evolution. Therefore,it is important to accommodate rate variation across sites in phylogenetic inference [4, 5].One possibility is to estimate a different rate for each site [6] but this is computationallydemanding because of the large number of parameters, and the limited information perparameter leads to poor inferences. A better alternative is to assume that the rates atdifferent sites are independent draws from a distribution, typically either a Gamma [7, 8]or a Log-Normal distribution [9]. A more realistic model would assume that the rates areauto-correlated along the sequence. One possible solution is offered by Phylogenetic HiddenMarkov (phylo-HMM) models, which allow for correlated rates between nearby sites [10, 11]:the rate of evolution is modelled as a Markov process operating along the sequence and sitespecific rates are drawn from a finite set of values. The discrete number of ’rate categories’represents one limitation of the phylo-HMM approach [12, 13], while another is the smallnumber of taxa that can be accommodated with reasonable computational resources [14].Alternatively, [15] have developed a Bayesian multiple change-point model of rate variationalong the DNA sequence, which assumes that sites are grouped into an unknown num-ber of contiguous segments, each with possibly a different tree topology, as well differentsubstitution rates and branch lengths. Several recent proposals involve finite mixtures ofdistributions to model heterogeneity across sites. In this case, the distribution of each siteon the sequence is a mixture of multiple processes, each of which may have its own treetopology, branch lengths and substitution rates (e.g. [16, 3, 17]). [18] extend these ideas toinfinite mixtures assuming a Dirichlet process prior.3he main focus of this paper is estimating evolutionary rate variation across sites as-suming that the tree topology and branch lengths are known and the same at every siteunder analysis. The latter assumption is not very restrictive in most applications, whichinvolve taxa that are separated by enough time that within-taxon coalescent variation isunimportant. Although substitution rates can vary along the sequence, they are assumedto be the same across all taxa at each site. Our proposed time-machine PMMH model isable to account for quantitative differences in rates of substitutions (e.g. sites with highrates versus sites with low rates), and can also allow different rates for different types ofsubstitution (such as transitions and transversions).Recently there has been a revival of interest in models which allow for variation inevolutionary rates due to an explosion in the availability of comparative sequence data,and consequent interest in comparative methods for the detection of functional elements(e.g., [19, 20, 21]). The model proposed in this paper is similar in spirit to early work onspatial variation of evolutionary rates ( e.g. [10, 11]), which maintains a single consistenttopology along the sequence but allows changes in evolutionary rates. In this framework,given the rate at each site, each site is then assumed to evolve independently along thetrue phylogeny with that rate and the correlation between sites arises from the clusteringof high and low rates at adjacent sites. However, most of these models allow for a smalldiscrete number of ’rate categories’ into which sections of the sequences are sorted ([12, 22])and many methods are limited to two-species comparisons as they become increasinglycomputationally expensive when more species are included. Our proposed model overcomesboth these difficulties, as the model for evolutionary rates, based on a multiple change pointmodel, is structurally simple and flexible so that the rates are not restricted to a finite setbut estimated on-line. Moreover, the use of the “time machine” significantly speeds upcomputations.
Several negative mathematical results exist in the literature (e.g. [23]) for Markov chainMonte Carlo (MCMC) inference when the tree topology (and branch lengths) is unknown,and these have spurred the development of highly sophisticated Monte Carlo-based algo-rithms [24]. Here, the tree topology and branch lengths are assumed to be known, but theposition and number of change-points for the rates are unknown. In addition, as we will ex-plain later, the cost of evaluating the likelihood will be an O ( mp n ) operation ( p is the num-4er of states at each site, m the number of sites and n the number of sequences). It followsthat parameter inference requires expectations w.r.t. a probability on a trans-dimensionalstate-space. Contructing efficient MCMC algorithms on trans-dimensional spaces is a no-toriously challenging problem and the standard approach is to use reversible jump MCMC(RJMCMC) [25]. Typically, and especially for our model, it is difficult to develop moveson the trans-dimensional state-space that are likely to be accepted, which is important herebecause likelihood computations are expensive.To deal with some of these inferential and computational issues, we propose: • To reduce the cost of computing the likelihood and assist the mixing of MCMC,through a likelihood approximation based on the time-machine principle [26]. • To improve mixing compared to standard RJMCMC, by adapting an idea in [27],developing a particle marginal Metropolis-Hastings (PMMH) algorithm [28], based onthe sequential Monte Carlo (SMC) samplers method in [29]. This approach can benefitfrom the time-machine approach.In the time-machine approach, the unobserved sequence at the root, and possibly alsoother top-most nodes of the tree, are replaced with the stationary distribution of the sub-stitution process. This can reduce the cost of computing the likelihood by a linear factorin n ; this can allow larger datasets than would otherwise be manageable. The resultingestimates are biased, but in the examples below we find the bias to be smaller than forcompetitive methods. Indeed, we conjecture (and this is supported by empirical results)that our approach is competitive with other approximate methods, in particular approxi-mate Bayesian computation (ABC); this latter method is often not appropriate for modelselection problems as we describe in Section 3.3. An important point here is that the time-machine performs a ‘principled’ approximation of the mathematical model. This is basedon the general understanding that most of the information in the data is at the lower partof the tree, thus contrasting with an often ad-hoc selection of summary statistics in ABCapproaches.Our PMMH algorithm extends the idea in [27], both with regard to the methodologyand the context of phylogenetic trees with change-points. The MCMC method will oftengenerate (as we will explain in Section 3.2) trans-dimensional proposals which are more likelyto be accepted than standard RJMCMC algorithms. This is further aided by using the time-machine, which results in a less complex posterior with a faster likelihood evaluations. Thecombination of the above factors can lead to reliable, but biased, inference from moderate5ized data-sets. As mentioned above, we expect the bias to be minimal relative to ABCmethods.This article is structured as follows. In Section 3 the model and methods are described;this includes our mathematical result on the bias. In Section 4 our empirical results aregiven. In Section 5 we conclude the article with a discussion. The appendix provides furtherdetails of the methods. We first describe our change-point model and the associated Bayesian inference problem,then the time machine approximation, and the PMMH algorithm. The end of this sec-tion then briefly discusses some competing ABC methods that can also be used to per-form Bayesian inference, but we make a case against using such algorithms in this context.Throughout the article, given a vector ( x , . . . , x n ) we define x k : l := ( x k , x k +1 , . . . , x l ), k ≤ l ≤ n ; also, we use the notation [ k ] = { , , . . . , k } . We observe n sequences of length m , such that each observation is x ij ∈ { , . . . , p } with1 ≤ i ≤ n , 1 ≤ j ≤ m . Similar to as in [30], it is assumed that the data originate from arooted binary tree [2, Chapter 1] of known topology and branch lengths with the n leavesbeing the observed sequences. The sequences at the other n − n ) to theroot 2 n −
1. Let ν : [2 n − → { n +1 , . . . , n − } map nodes other than the root onto theirparent node. It is assumed that we are given a Markov model on the tree describing theevolution of states over time on each branch of the tree and at each site; each site evolvesindependently given the branch lengths. Treating the sequence states at internal nodes asmissing data we can write the full-data likelihood as: p ( x n − , m | θ ) = m (cid:89) j =1 µ θ ( x (2 n − j ) n − (cid:89) i =1 f θ ( x ij | x ν ( i ) j ) (1)where θ ∈ Θ ⊆ R d is an unknown parameter, f θ a Markov transition probability, and µ θ a probability distribution on the state space at a site. Note that x n +1:2 n − , m areunobserved, whereas x n, m are observed. Note also that f θ ( x ij | x ν ( i ) j ) can depend onthe (known) length of the branch connecting the node i with node ν ( i ). For convenience6n subsequent formulas we will write f θ ( x (2 n − j | x ν (2 n − j ) in place of µ θ ( x (2 n − j ), eventhough ν (2 n − j is undefined.The observed-data likelihood can be written as a sum over the missing data: p ( x n, m | θ ) = m (cid:89) j =1 (cid:88) x n +1:2 n − ,j ∈ [ p ] n − n − (cid:89) i =1 f θ ( x ij | x ν ( i ) j ) . (2)Using belief propagation [31] (also called the sum and products algorithm), the cost ofcomputing (2) is O ( mp n ).Our model generalises (1) to allow θ to vary along the sequence at a set of change-points1 = s < s < · · · < s k +1 = m . Then the full-data likelihood for the change-point model is: p ( x n − , m | k, s k , θ k +1 ) = k +1 (cid:89) j =1 s j − (cid:89) l = s j − n − (cid:89) i =1 f θ j ( x il | x ν ( i ) l ) , and, as in (2), one can sum over x n +1:2 n − , m to obtain an observed-data likelihood: p ( x n, m | k, s k , θ k +1 ) = k +1 (cid:89) j =1 s j − (cid:89) l = s j − p ( x n,l | θ j )where p ( x n,l | θ j ) = (cid:88) x n +1:2 n − ,l ∈ [ p ] n − (cid:26) n − (cid:89) i =1 f θ j ( x il | x ν ( i ) l ) (cid:27) . For 0 ≤ k < m , let S k = { s k ∈ [ m ] : 1 < s < · · · < s k ≤ m } . Then we will define a posterior probability on the space E := m − (cid:91) k =0 (cid:16) { k } × S k × Θ k +1 (cid:17) . Let p ( k, s k , θ k +1 ) be any proper prior probability on E . Our objective is then to considerthe posterior π ( k, s k , θ k +1 ) ∝ p ( x n, m | k, s k , θ k +1 ) p ( k, s k , θ k +1 ) (3)which can be computed pointwise up to a normalizing constant in O ( mp n ) steps. Weassume that we know how to calculate the priors p ( s k , θ k | k ) and p ( k ). One way to cut the cost of the O ( mp n ) calculation of the likelihood, is to remove thetop of the tree (a related idea is used in [26] for the standard coalescent). Suppose we7nly consider the tree backward in time until the parent of node 2 n − g , g ∈ { , . . . , n } . Wepropose the model: p B ( x n − g, m , x n − g +1:2 n − , m | k, s k , θ k +1 ) = k +1 (cid:89) j =1 s j − (cid:89) l = s j − (cid:16)(cid:110) n − g (cid:89) i =1 f θ j ( x il | x ν ( i ) l ) (cid:111) η θ j (cid:0) x B (2 n − g +1:2 n − ,l (cid:1)(cid:17) , where B (2 n − g +1 : 2 n −
1) denotes the nodes in the cut-off part of the tree 2 n − g +1 : 2 n − n − g , and η θ j ( · ) is a joint probabilitydistribution over sequences on these ‘boundary’ nodes. Thus, the joint distribution of anumber of the upper-most g − j site, is replaced by the approximation η θ j .Then one can perform inference from the relevant posterior π B ( k, s k , θ k +1 ) ∝ p B ( x n, m | k, s k , θ k +1 ) p ( k, s k , θ k +1 )using the PMMH method described below. The cost of computing the new likelihood is now O ( mp n ( n − g )). In order to sample from the trans-dimensional state-space of (3) we first consider an SMCsampler which only samples on S k × Θ k +1 , for k fixed. We then show how the SMC sam-pler can be embedded within a PMMH algorithm to target (3). The SMC sampler willbe necessary to ensure a good acceptance probability for trans-dimensional moves. Ourapproach has the advantage over alternative simulation techniques for model selection (see[33]) that the model selection and parameter estimates are simultaneous, which helps tofocus computational resources on the important model(s).For 1 ≤ k < m , and a user-specified T ≥
1, let { ξ t,k } ≤ t ≤ T be a sequence of probabilitieson S k × Θ k +1 , such that ξ ,k ( s k , θ k +1 ) = p ( s k , θ k +1 | k ) and ξ T,k ( s k , θ k +1 ) ∝ p ( x n, m | k, s k , θ k +1 ) p ( s k θ k +1 | k ) . The remaining sequence of targets { ξ t,k } ≤ t ≤ T − interpolate between the (conditional) pos-terior and the prior, e.g. via the tempering procedure: ξ t,k ( s k , θ k +1 ) ∝ p ( x n, m | k, s k , θ k +1 ) κ t p ( s k , θ k +1 | k )with 0 < κ < · · · < κ T − <
1. The SMC sampler will propagate a collection of N particlesfrom the prior ξ ,k all the way to the posterior ξ T,k via the bridging densities ξ t,k by means8f importance sampling, resampling and MCMC move steps. The tempering procedure aimsat controlling the variability of the incremental importance weights, for instance providingrobust estimates of the normalising constants p ( x n, m | k ) which is an important attributefor the overall algorithm. The sampler propagates the particles by using a sequence ofMCMC kernels of invariant densities ξ t,k (which operate on a fixed dimensional space). Allthe details of the specific steps of the SMC sampler are given in the Supporting Informationdocument. We write the probability of all the variables associated to the SMC sampler(which resamples N > T ) asΨ k,N ( a N T − , φ N T ( k )) , where a N T − = ( a , . . . , a N , . . . , a T − , . . . , a NT − ) are the resampled indices and φ N T ( k ) =( φ ( k ) , . . . , φ N ( k ) , . . . , φ T ( k ) , . . . , φ NT ( k )) with φ it ( k ) = ( s it, k , θ it, k +1 ), i ∈ [ N ] , t ∈ { }∪ [ T ],is the collections of the N particles as propagated through the sequence ξ t,k .One can use this SMC sampler within a broader PMMH algorithm to sample fromthe true target of interest (3). The specific steps of PMMH are given in the SupportingInformation document, but briefly, a single iteration of the algorithm is as follows. Giventhe current state of the Markov chain, one proposes to change k with some proposal kernel q ( k (cid:48) | k ). Conditional on this k (cid:48) , we run an SMC sampler Ψ k (cid:48) ,N ( · ) and choose a particle φ lT ( k (cid:48) ), for some 1 ≤ l ≤ N , with probability proportional to a weight. Acceptance of boththe model index k (cid:48) and of the proposed change-point times and rates φ lT ( k (cid:48) ) happens withprobability 1 ∧ p N ( x n, m | k (cid:48) ) p ( k (cid:48) ) p N ( x n, m | k ) p ( k ) × q ( k | k (cid:48) ) q ( k (cid:48) | k ) , where p N ( x n, m | k ) is the SMC (unbiased) estimate of p ( x n, m | k ), the normalizing con-stant of ξ T,k . The Supporting Information document presents the formula used to calculate p N ( x n, m | k ). Note that whilst there are a lot of user set parameters (namely, the temper-atures and tuning parameters for the MCMC kernels), their choice can be done adaptivelyto reduce user involvement (see [34]). In this article we tune the parameters by trial anderror.The advantages of our procedure is that it mitigates having to construct trans-dimensionalproposals which need to mix well (see [27] for another recent work that attempts to dealwith this issue). We note, however, that the cost of each proposal will be O ( N T mp n ), as ξ t,k must be obtained at each time step of the SMC sampler (see Supporting Information).In addition, note that tailored methods for change-point models (e.g. [35]) do not apply hereas one does not have a convenient way to integrate the likelihood.9 .3 Approximate Bayesian Computation (ABC) ABC is another methodology that avoids exact computation of the likelihood, at the costof a biased approximation of the posterior; see for instance [36] for a review. The methodis based on accepting simulated data sets that are similar to the observed dataset, where‘similar’ is usually assessed using summary statistics sensitive to the parameter(s) of interest.ABC can be unreliable as a tool for model selection. According to [37], the best summarystatistics to be used in ABC approximation to a Bayes factor are ancillary statistics withdifferent mean values under two competing models. Otherwise, the summary statistic musthave enough components to prohibit a parameter under a wrong model from generatingsummary statistics that are plausible under the true model. However, summary statisticssatisfying the conditions of [37] for model choice in ABC is not easy (or even possible) toverify in our context.In the numerical examples of Section 4.1, we consider two ABC algorithms which ap-proximate the same ABC posterior. The first algorithm is a PMMH that replaces the SMCsampler of [29] with the SMC sampler of [38, Section 3.3]; see Supporting Information for de-tails. The second ABC algorithm is the ABC-SMC algorithm for model selection appearingon [39, page 190].
We compared three algorithms on their performance in Bayesian model selection for four sim-ulated DNA datasets. Within each dataset, the DNA sequences shared a common ancestralbinary tree with known topology, unknown sequence states at ancestral nodes, and unknownsubstitution rates and branch lengths. The first algorithm was our proposed PMMH algo-rithm outlined in Section 3.2, and we employed three versions of the time machine. Usingthe notation of Section 3.1.2, these used g = 1 (so in effect the time machine was not im-plemented at all), g = 4 and g = 8. We also used two ABC algorithms described in Section3.3. The PMMH algorithms were not run until they converged fully, but were compared onthe basis of results achieved after six hours of computation. Other implementation detailsof the algorithms may be found in the Supporting Information document.10 .1.1 Base dataset The base dataset consists of n = 8 simulated DNA sequences ( p = 4 types of nucleotide),each of m = 50 sites. The sequences evolved according to a binary tree under a Jukes-Cantormodel of DNA evolution with one substitution rate up to site s = 25, and a second ratebeyond this single change-point (so k = 1, but for inference we assumed only k ∈ { , } ).In the standard Newick notation, the structure of the tree was: (((Taxon0:1.0,Taxon1:1.0):1.0,(Taxon2:1.0,Taxon3:1.0):1.0):1.0,((Taxon4:1.0,Taxon5:1.0):1.0,(Taxon6:1.0,Taxon7:1.0):1.0):1.0):1.0 We ran the three algorithms to infer k , location of the change-point, s given k =1, andthe substitution rate(s) θ k +1 . The prior on k was uniform on { , } ; the prior on s was1 /m − .
4, and so expectedvalue of 0 . g = 4 outperformed all other algorithms. It sampled fromthe true model (i.e., k = 1) much more frequently than the incorrect k = 0 model (Table1). In comparison g = 8 performed poorly, as expected since n = 8 for this dataset so g = 8implies removing all internal nodes and assuming independent evolution of each sequence.The ABC algorithms did not perform well. The PMMH-ABC algorithm sampled from thetwo models almost evenly, while the ABC-SMC algorithm had a low effective sample sizes([40],[41]) and actually preferred the wrong model.In Table 2, we give 95% confidence intervals of estimates of s and of the rates given k = 1. The time-machine PMMH algorithms again provide the best inferences and wereable to find the change-point. The PMMH-ABC was more accurate for the substitutionrates but less precise. The ABC-SMC algorithm gave unusable output.We do not present the output for the g = 1 version of the time machine because itperformed very poorly. Without removing any nodes from the top of the tree, the variabilityof p N ( x n, m | k (cid:48) ) /p N ( x n, m | k ) in the acceptance probability of the PMMH was very highwhen k (cid:54) = k (cid:48) (Figure 2 in Section A). Thus, the algorithm accepted jumps between modelsonly rarely and the output was very “sticky”. This phenomenon illustrates that the timemachine is a cost saving technique by two measures. First, it reduces the computationalcomplexity of the algorithm. Second, it aids in mixing and facilitates jumping betweenmodels. 11 .1.2 Further Tests We repeated the above experiment for three more datasets that differed only slightly fromthe base dataset. We found the results to be similar across the datasets (see Tables 1 and 2in Section B). Collectively, these results suggest that when doing Bayesian model selectionunder these scenarios, ABC approximations should be avoided and instead our PMMHmethod used instead, with the time machine but removing as few nodes as computationalconsiderations permit.
Using the publicly available database of [42], we assembled a dataset consisting of n = 6ACT1 gene DNA sequences ( m = 540 sites). We assumed the tree structure given inFigure 1 of Section A, and a Jukes-Cantor model of DNA evolution. We implemented ourtime-machine PMMH algorithm to infer k , s k , and θ k +1 for cut-off parameter g = 4(see Supporting Information for further details). The prior on k was a discrete uniformdistribution on { , , } , the prior on s k was uniform on k -subsets of [ m − . k to get a sense of the algorithm’s ability to fully explore the state space of eachmodel (only some values are reported below). Figure 3 suggests good exploration of thestate space of k , resulting in an estimated distribution: 0 .
17 ( k = 0), 0 .
47 ( k = 1), and 0 . k = 2).From the 4,946 samples with k = 1, we estimated a 95% highest posterior density intervalof (194,199) (see also the histogram in Figure 4). The Z-score for k was − .
60, suggestingthat we were still some way off convergence (values close to 0 imply convergence). Forthe rates θ (before the change-point) and θ (after the change-point), the Z-scores of 0 . .
23, respectively, give stronger evidence for convergence (estimated densities of theseparameters are shown in Figure 5). 12
Discussion
We considered sequence data that originates from a rooted binary tree [2, Chapter 1] ofknown topology and branch lengths but unknown sequence states at internal nodes, andwe the substitution rates in the DNA evolution model allowed to have change-points. Wedetailed Bayesian parameter inference from such a model with an unknown number ofchange-points, implying a trans-dimensional posterior density. Computational inferencefrom this model is challenging, and we introduced two novel contributions to facilitatesampling.Firstly, based on the time machine principle of [26], we showed how the top-most nodesof the binary tree can be replaced with a probability distribution of the sequence evolutionmodel to reduce the cost of computing the likelihood linearly in n (the number of sequences).This approach introduces a bias, but this is was found in practice to have a small effect oninferences.Secondly, we developed a particle marginal Metropolis-Hastings (PMMH) algorithm(Section 3.2) which mitigates having to construct trans-dimensional proposals that needto mix well. We first developed a sequential Monte Carlo (SMC) sampler which only sam-ples on a fixed-dimensional subspace of the full trans-dimensional state-space. We thenshowed how that SMC sampler can be embedded within the PMMH algorithm to targetthe full posterior. By employing the time machine within this PMMH, we attained an al-gorithm that could run with a reduced computational cost and easily jump between modelswith different numbers of change-points.We successfully implemented our PMMH to perform inference from the model in areliable fashion for small to moderately sized datasets. We empirically demonstrated thatour PMMH can outperform approximate Bayesian computation (ABC) techniques [44] interms of precision and accuracy, and we showed that our algorithm can successfully be usedto carry out reliable inference on real data. The success of our PMMH algorithm is largelydue to the time machine, which, as we witnessed in Section 4.1, reduces the variance of theacceptance probability and enables the algorithm to jump easily between models. However,based on the output of Section 4.1, it seems that bias introduced by the time machinereduces the accuracy of the inferred substitution rates.In a future work, one might want to extend the methodology to allow for unknown treetopologies, similar to [15]. Also, a future work could attempt to use a more appropriatedistribution to approximate the distribution at the top of the tree. We attempted to find13pproximations in the point processes and coalescent literature, but we were unable tofind a better approximation than that which we employed here. From the computationalpoint of view, it will certainly be important to further speed up the algorithm and greatsavings could be made by parallelising calculations within the SMC particle method andcarefully investigating adaptive procedures for fine-tuning the temperatures and the MCMCkernels. All such efforts could have a big effect on reducing the variance of the estimate of p ( x n, m | k ), thus further improving the mixing of PMMH even with fewer removed nodes.Also, there could then be great scope to apply the method for larger number of potentialchange points compared to the relatively small one we have tried here. Acknowledgements
This research was funded by the EPSRC grant “Advanced Stochastic Computation for Infer-ence from Tree, Graph and Network Models” (Ref: EP/K01501X/1). AJ was additionallysupported by Singapore MOE grant R-155-000-119-133 and is also affiliated with the riskmanagement institute at the National University of Singapore.
References [1]
Felsenstein , J. (1981). Evolutionary Trees from DNA Sequences: A Maximum Likeli-hood Approach.
J. Mol. Evol. , , 368–376.[2] Felsenstein , J. (2004).
Inferring Phylogenies . Sinauer, Sunderland, MA.[3]
Huelsenbeck , J. P. &
Suchard , M. A. (2007). A nonparametric method for accom-modating and testing across-site rate variation.
System. Biol. , , 975-987.[4] Huelsenbeck , J. P. &
Hillis , D. M. (1993). Success of phylogenetic methods in thefour taxon case.
System. Biol. , , 247-264.[5] Wakeley , J. (1994). Substitution-rate variation among sites and the estimation of tran-sition bias.
Mol. Biol. Evol. , , 436-442.[6] Swofford , D.L.,
Olsen , G.J.,
Waddell , P. J., &
Hillis , D.M. (1996). PhylogeneticInference. In
Molecular systematics , 2nd edition, chap. 5, pp. 407-514. Sinauer and As-sociates, Sunderland, Massachusetts. 147]
Uzzell
T &
Corbin
K.W. (1971). Fitting discrete probability distributions to evolu-tionary events.
Science , , 1089-96.[8] Nei M, Chakraborty
R, &
Fuerst
P. A. (1976). Infinite allele model with varyingmutation rate.
Proc. Natl. Acad. Sci. , , 4164–4168.[9] Olsen , G. J. (1987). Earliest phylogenetic branchings: comparing rRNA-based evolu-tionary trees inferred with various techniques.
Cold Spring Harbor Symp Quant. Biol. , , 825-837.[10] Yang , Z. (1995). A space-time process model for the evolution of DNA sequences.
Genetics , , 993–1005.[11] Felsenstein , J &
Churchill , G. A. (1996) A Hidden Markov Model approach tovariation of evolutionary rates among sites.
Mol. Biol. Evol. , , 93-104.[12] Yang , Z.,
Goldman , N. &
Friday , A. (1994). Comparison for models for nucleotidesubstitution used in maximum likelihood phylogenetic estimation.
Mol. Biol. Evol. , ,316–324.[13] Siepel
A &
Haussler
D. (2005). Phylogenetic hidden Markov models. In R. Nielsen,editor,
Statistical Methods in Molecular Evolution , pp. 325-351, Springer, New York.[14]
Yang , Z. (1993). Maximum likelihood estimation of phylogeny from DNA sequenceswhen substitution rates differ over sites.
Mol. Biol. Evol. , , 1396–1401.[15] Suchard , M. A,
Weiss , R. E.,
Dorman , K. S. &
Sinsheimer , J. S. (2003). Inferringspatial phylogenetic variation along nucleotide sequences: a multiple change-point model.
J. Amer. Statist. Ass. , , 427-437.[16] Pagel , M., &
Meade , A. (2004). A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data.
Syst. Biol. , , 571–581.[17] Loza-Reyes , E.,
Hurn , M.A. &
Robinson
A. (2014). Classification of molecularsequence data using Bayesian phylogenetic mixture models.
Comp. Statist. Data Anal. , , 81-95.[18] Wu , C. H., Suchard , M. A., &
Drummond , A. J. (2013). Bayesian selection ofnucleotide substitution models and their site assignments.
Mol. Biol. Evol. , , 669–688.1519] Boffelli , D.,
McAuliffe , J.,
Ovcharenko , D.,
Lewis , K. D.,
Ovcharenko , I.,et. al. (2003). Phylogenetic shadowing of primate sequences to find functional regions ofthe human genome. S cience, , 1391–1394.[20] Gibbs , R. A.,
Weinstock , G. M.,
Metzker , M. L.,
Muzny , D. M.,
Sodergren ,E. J., et. al. (2004). Genome sequence of the Brown Norway Rat yields insights intomammalian evolution.
Nature , , 493–521.[21] Chinwalla , A. T.,
Cook , L. L.,
Delehaunty , K. D.,
Fewell , G. A.,
Fulton , L.A., et. al. (2002). Initial sequencing and comparative analysis of the mouse genome.
Nature , , 520–562.[22] Siepel , A.,
Bejerano , G.,
Pedersen , J. S.,
Hinrichs , A. S.,
Hou , M., et. al. (2005).Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Genome Res. , (8),1034–1050.[23] Mossel , E. &
Vigoda , E. (2006). Limitations of Markov Chain Monte Carlo Algo-rithms for Bayesian Inference of Phylogeny.
Ann. Appl. Probab , , 2215–2234.[24] Bouchard-Cˆot´e , A.,
Sankararaman , S. &
Jordan , M. I. (2012). Phylogeneticinference via sequential Monte Carlo.
System. Biol. . , 579-593.[25] Green , P. J. (1995). Reversible jump Markov chain Monte Carlo computation andBayesian model determination.
Biometrika , , 711–732.[26] Jasra , A.,
De Iorio , M., &
Chadeau-Hyam , M. (2011). The time machine: a simu-lation approach for stochastic trees.
Proc. R. Soc. A , , 2350–2368.[27] Karigiannis , G. &
Andrieu , C. (2013). Annealed Importance Sampling ReversibleJump MCMC Algorithms.
J. Comp. Graph. Statist. , , 623–648.[28] Andrieu , C.,
Doucet , A., &
Holenstein , R. (2010). Particle Markov chain MonteCarlo methods.
J. R. Statist. Soc. Ser. B , , 269–342.[29] Del Moral , P.,
Doucet , A., &
Jasra , A. (2006). Sequential Monte Carlo samplers.
J. R. Statist. Soc. Ser. B , , 411–436.[30] Ma , Y. (2008). Bayesian and MCMC methods for phylogenetic footprinting. PhD The-sis, Imperial College London. 1631] Pearl , J. (1982). Reverend Bayes on inference engines: A distributed hierarchicalapproach. In:
Proceedings of the Second National Conference on Artificial Intelligence.AAAI-82: Pittsburgh, PA.
AAAI Press, Menlo Park, CA.[32]
Del Moral , P. (2004).
Feynman-Kac Formuale . Springer: New York.[33]
Zhou , Y.,
Johansen , A. M. &
Aston , J. A. D. (2013). Towards Automatic ModelComparison: An Adaptive Sequential Monte Carlo Approach. arXiv preprint.[34]
Jasra , A.,
Kantas , N., &
Persing , A. (2014). Bayesian inference for partially ob-served stopped processes.
Stat. Comp. , , 1-20.[35] Fearnhead , P. &
Liu , Z. (2007). Online Inference for Multiple Changepoint Problems.
J. R. Statist. Soc. Ser. B , , 589–605.[36] Marin , J.-M.,
Pudlo , P.,
Robert , C.P. &
Ryder , R. (2012). Approximate Bayesiancomputational methods.
Statist. Comp. , , 1167–1180.[37] Marin , J. M.,
Pillai , N. S.,
Robert , C. P., &
Rousseau , J. (2013). Relevant statisticsfor Bayesian model choice.
J. R. Statist. Soc. Ser. B , DOI: 10.1111/rssb.12056.[38]
Del Moral , P.,
Doucet , A., &
Jasra , A. (2012). An adaptive sequential MonteCarlo method for approximate Bayesian computation.
Statist. Comp. , , 1009–1020.[39] Toni , T.,
Welch , D.,
Strelkowa , N.,
Ipsen , A., &
Stumpf , M. P. H. (2009).Approximate Bayesian computation scheme for parameter inference and model selectionin dynamical systems.
J. R. Statist. Soc. Interface , , 187–202.[40] Kong , A.,
Liu , J. S., &
Wong , W. H. (1994). Sequential imputations and Bayesianmissing data problems.
J. Amer. Statist. Assoc. , , 278–288.[41] Liu , J. S. (1996). Metropolized independent sampling with comparison to rejectionsampling and importance sampling.
Statist. Comp. , , 113–119.[42] Weiss , S.,
Samson , F.,
Navarro , D., &
Casaregola , S. (2013). YeastIP: a databasefor identification and phylogeny of ascomycetous yeasts.
FEMS Yeast Res. , , 117–125. http://genome.jouy.inra.fr/yeastip/ [43] Geweke , J. (1992). Evaluating the accuracy of sampling-based approaches to calcu-lating posterior moments. In
Bayesian Statistics 4 (eds.
Bernardo , J. M.,
Berger , J.O.,
Dawid , A. P., &
Smith , A. F. M.). Clarendon Press, Oxford, UK.1744]
Tavare , S.,
Balding , D. J.,
Griffiths , R. C., &
Donnelly , P. (1997). Inferringcoalescence times from DNA sequence data.
Genetics , , 505–518.[45] Lopes , J. S.,
Balding , D., &
Beaumont , M. A. (2009). PopABC: a program to inferhistorical demographic parameters.
Bioinformatics , , 2747–2749.[46] Guennebaud , G.,
Jacob , B., et al. (2010). Eigen v3. http://eigen.tuxfamily.org . A Figures
Real Data Case: Phylogeny of a subset of the Saccharomycotina subphylum
Figure 1: This tree is provided by [42]. The labels on the leaves at far right are in the formatSTRAIN taxa GENE. For example, “CBS 6420 Candida boleticola ACT1” is the label forthe ACT1 gene of
Candida boleticola , strain CBS 6420. The edge at the bottom of thefigure with the number 0 . . . im. Data base dataset: Variability of log (cid:2) p N ( x n, m | k ) (cid:3) for g = 1 versus g = 4 l lll lll l ll l ll lll l ll llll −4884 −4883 −4882 −4881 −4880 −4879 −4878 −4877 ll ll ll l ll l lll ll lll l −9355 −9350 −9345 −9340 −9335 −9330 −9325 −9320 l lll lll l ll l ll lll l ll llll −4884 −4883 −4882 −4881 −4880 −4879 −4878 −4877 ll ll ll l ll l lll ll lll l −9355 −9350 −9345 −9340 −9335 −9330 −9325 −9320 Figure 2: The top box plot illustrates the variability of log (cid:2) p N ( x n, m | k ) (cid:3) for the g = 1version of our algorithm, and the bottom plot gives the same for the g = 4 version of ouralgorithm; log (cid:2) p N ( x n, m | k ) (cid:3) runs along the horizontal axis of each plot. One can see thatthe variability in the bottom figure is much less than that of the top figure. Real Data Case: Autocorrelation and trace plot of sampled k . . . . . . Lag A C F Autocorrelation of sampled k . . . . . Trace plot of sampled k
PMCMC iterations k Figure 3: We monitored the convergence of the PMMH implemented in Section 4.2 viaautocorrelation and trace plots. The plots illustrate a non-sticky algorithm and low auto-correlation. 19 eal Data Case: Histogram of sampled s given k = 1 Histogram of sampled s_ 1 given k= 1 s_ 1 F r equen cy Figure 4: The histogram strongly indicates the position of the single change-point in thenumerical example of Section 4.2.
Real Data Case: Kernel density plots of sampled substitution rates given k = 1 D en s i t y Using Gaussian smoothing kernel D en s i t y D en s i t y Using Gaussian smoothing kernel D en s i t y Figure 5: At top, we have the kernel density plot of θ , which is obtained using all samplesfrom the PMMH implemented in Section 4.2 where k = 1. Values for θ run along thehorizontal axis. We have the same plot for θ at bottom. B Tables a b l e : A l go r i t h m d i ag n o s t i c s a nd m o d e l c h o i ce A cc p . A u t o c o rr e l a t i o n o f k E SSp e r m o d e l P ( k | d a t a ) E x a m p l e A l go r i t h m S a m p l e s r a t i o @ @ k = k = k = k = k = k = b a s e d a t a s e t T i m e m a c h i n e , g = . . . . . ( k = , T i m e m a c h i n e , g = . - . - . . . n = , r e d m = ) P MM H w i t h A B C . - . . . . A B C - S M C . . Tw o c h a n g e - T i m e m a c h i n e , g = . . . . . p o i n t s T i m e m a c h i n e , g = . - . - . . . ( k = , P MM H w i t h A B C . . . . . n = , m = ) A B C - S M C . . Sub t l ec h a n g e - T i m e m a c h i n e , g = . . . . . p o i n t T i m e m a c h i n e , g = . . . . . ( k = , P MM H w i t h A B C . . . . . n = , m = ) A B C - S M C . . M o r e s i t e s T i m e m a c h i n e , g = . - . - . . . ( k = , T i m e m a c h i n e , g = . - . - . . . n = , m = ) P MM H w i t h A B C . . . . . A B C - S M C . . A ll a l go r i t h m s r a n f o r s i x h o u r s o n a L i nu x w o r k s t a t i o n t h a t u s e dfi v e I n t e l C o r e i - C P U s , e a c h a t . G H z ( s ee i m p o r t a n t n o t e r e ga r d i n g A B C - S M C i n t h e Supp o r t i n g I n f o r m a t i o nd o c u m e n t) . T h ec o l u m n l a b e ll e d “ A cc p . r a t i o”g i v e s t h e f r e q u e n c y a t w h i c h a n e w s a m p l e i s a cce p t e d i n t h e p a r t i c l e m a r g i n a l M e t r o p o li s - H a s t i n g s ( P MM H ) a l go r i t h m s [ ]. T h ec o l u m n s l a b e ll e d “ A u t o c o rr e l a t i o n o f k ”g i v e t h e a u t o c o rr e l a t i o n a tt i m e d e l a y s o f nd ( w h e n a v a il a b l e ) . “ E SS ” i s t h e a bb r e v i a t i o n f o r t h ee ff ec t i v e s a m p l e s i ze ( [ ],[ ] ) o f t h e A B C - S M C f o r m o d e l s e l ec t i o n o f [ ]. N o t e t h a t f o r t h e “ t w o c h a n g e - p o i n t ” c a s e , t h e p r i o r o n k w a s a d i s c r e t e un i f o r m d i s t r i bu t i o n o n { , } . a b l e : I n f e r e n ce f o r t r u e m o d e l: % c o nfid e n ce i n t e r v a l s E x a m p l e A l go r i t h m S a m p l e s s s C I o f µ C I o f µ C I o f µ b a s e d a t a s e t T i m e m a c h i n e , g = ( , ) – ( . , . )( . , . ) – ( s = , T i m e m a c h i n e , g = ( , ) – ( . , . )( . , . ) – µ = . , µ = . ) P MM H w i t h A B C ( , ) – ( . , . )( . , . ) – A B C - S M C ( , ) – ( . , . )( . , . ) – Tw o c h a n g e - p o i n t s T i m e m a c h i n e , g = ( , )( , )( . , . )( . , . )( . , . ) ( s = , s = , T i m e m a c h i n e , g = ( , )( , )( . , . )( . , . )( . , . ) µ = . , µ = . , P MM H w i t h A B C ( , )( , )( . , . )( . , . )( . , . ) µ = . ) A B C - S M C ( , )( , )( e - , e - )( e - , e - )( e - , e - ) Sub t l ec h a n g e - p o i n t T i m e m a c h i n e , g = ( , ) – ( . , . )( . , . ) – ( s = , T i m e m a c h i n e , g = ( , ) – ( . , . )( . , . ) – µ = . , µ = . ) P MM H w i t h A B C ( , ) – ( . , . )( . , . ) – A B C - S M C ( , ) – ( . , . )( . , . ) – M o r e s i t e s T i m e m a c h i n e , g = ( , ) – ( . , . )( . , . ) – ( s = , T i m e m a c h i n e , g = ( , ) – ( . , . )( . , . ) – µ = . , µ = . ) P MM H w i t h A B C ( , ) – ( . , . )( . , . ) – A B C - S M C ( , ) – ( . , . )( . , . ) – T h e l e f t m o s t c o l u m n c o n t a i n s t h e t r u e p a r a m e t e r v a l u e s f o r e a c h e x a m p l e . W e a l s o r ec o r d t h e nu m b e r o f s a m p l e s o n w h i c h e a c h i n f e r e n ce i s b a s e d ( i. e ., w e r ec o r dh o w m a n y s a m p l e s f r o m t h e t r u e m o d e l e a c h a l go r i t h m o b t a i n e d ) . Supporting Information: Algorithm Summaries
C.1 Sequential Monte Carlo (SMC) sampler
Consider the sequence { ξ t,k } ≤ t ≤ T introduced in Section 3.2 of the main paper, which isa sequence of probabilities known up-to a multiplicative constant. SMC samplers of [29]are designed for sampling from such sequences of distributions. The specific SMC samplerthat we employ in this work simulates a collection of N samples (or, ‘particles’) in paralleland sequentially in time using a) a sequence of MCMC kernels of invariant densities ξ t,k and b) a resampling technique. The algorithm is summarized here as Algorithm 1, with φ it ( k ) = ( s it, k , θ it, k +1 ), i ∈ [ N ] , t ∈ { } ∪ [ T ] and T > ξ t,k will becalculated via belief propagation [31]). The N outputted samples at time step T provide anapproximation of the target ξ T,k . According to [29, Section 3.2.1], the unnormalized weightscan be used to obtain an unbiased estimate of the normalizing constant of ξ T,k : p N ( x n, m | k ) = T (cid:89) t =0 (cid:20) N N (cid:88) i =1 W it (cid:21) . (4) Algorithm 1
Sequential Monte Carlo (SMC) sampler • Step 1: For i ∈ [ N ], sample φ i ( k ) ∼ ξ ,k ( · ) and set the unnormalized weight: W i = 1.Set t = 1. • Step 2: For i ∈ [ N ], sample a it − ∈ [ N ] from a discrete distribution on [ N ] with j th probability w jt − ∝ W jt − . The sample { a Nt − } are the indices of the resampled particlesat time step ( t − i , set the normalized weight w it − = N − . • Step 3: For i ∈ [ N ], sample φ it ( k ) | φ a it − t − ( k ) ∼ K t (cid:18) · | φ a it − t − ( k ) (cid:19) , where K t is anMCMC kernel of invariant density ξ t,k . Compute the unnormalized weight accordingto [29, Equation 31]: W it = ξ t,k ( φ a it − t − ( k )) ξ t − ,k ( φ a it − t − ( k )) . If t = T , stop. Otherwise, set t = t + 1 and return to the start of Step 2. C.2 Particle Marginal Metropolis-Hastings (PMMH)
Recall the target (3) of Section 3.1.1 of the main paper: π ( k, s k , θ k +1 ). Particle Markovchain Monte Carlo (PMCMC) algorithms [28] consider an ‘extended target’ that yields thetrue target of interest – in this case, π ( k, s k , θ k +1 ) – as a marginal. The extended target23s constructed in such a way that an SMC algorithm (e.g., Algorithm 1) can be used tosample some of its variables. In the context of this work, we will follow [28] and write anextended target as π N (cid:0) l, k, a N T − , φ N T ( k ) (cid:1) = (5) π (cid:0) k, φ l T ( k ) (cid:1) N T · Ψ k,N ( a N T − , φ N T ( k )) ξ ,k ( φ l ( k )) (cid:16)(cid:81) Tt =1 w a lt − t − K t (cid:16) φ lt ( k ) | φ a lt − t − ( k ) (cid:17)(cid:17) , where Ψ k,N is the probability of all the variables associated to the SMC sampler. A PMMHalgorithm [28] is a type of PMCMC algorithm that can sample from (5), and we present thedetails of the procedure as Algorithm 2. Algorithm 2
Particle marginal Metropolis-Hastings (PMMH) • Step 0: Set r = 0. Sample k ( r ) ∼ p ( k ). All remaining random variables can be sampledfrom their full conditionals defined by the target (5):- Sample φ ( r ) , N T ( k ( r ) ) , a ( r ) , N T − ∼ Ψ k ( r ) ,N ( · ) via Algorithm 1.- Choose a particle index l ( r ) ∝ W r,l ( r ) T .Finally, calculate p N ( x n, m | k ( r ) ) via (4). • Step 1: Set r = r + 1. Sample k ∗ ∼ q (cid:0) · | k ( r − (cid:1) . All remaining random variables canbe sampled from their full conditionals defined by the target (5):- Sample φ ∗ , N T ( k ∗ ) , a ∗ , N T − ∼ Ψ k ∗ ,N ( · ) via Algorithm 1.- Choose a particle index l ∗ ∝ W ∗ ,l ∗ T .Finally, calculate p N ( x n, m | k ∗ ) via (4). • Step 2: With acceptance probability1 ∧ p N ( x n, m | k ∗ ) p ( k ∗ ) p N ( x n, m | k ( r − ) p ( k ( r − ) × q ( k ( r − | k ∗ ) q ( k ∗ | k ( r − ) , set (cid:16) l ( r ) , k ( r ) , φ ( r ) , N T ( k ( r ) ) , a ( r ) , N T − (cid:17) = (cid:16) l ∗ , k ∗ , φ ∗ , N T ( k ∗ ) , a ∗ , N T − (cid:17) . Otherwise, set (cid:16) l ( r ) , k ( r ) , φ ( r ) , N T ( k ( r ) ) , a ( r ) , N T − (cid:17) = (cid:16) k ( r − , φ ( r − , N T ( k ( r − ) , a ( r − , N T − (cid:17) .Return to the beginning of Step 1. 24 Supporting Information: Implementation Details forSection 4.1
D.1 Particle marginal Metropolis-Hastings (PMMH)
The PMMH algorithms [28] which employed the time machine all used N = 20 particles,and the sequential Monte Carlo (SMC) sampler [29] within these algorithms always ran for T = 10 time steps (for the “More sites” example, we set T = 50). The PMMH employingthe ABC [44] algorithm of [38] also used N = 20, but M = 20 copies of the data weresimulated for each particle in order to obtain the best results. Thus, this latter PMMHactually completed fewer iterations than the former PMMH within six hours.In the numerical examples considered in this paper, we used the following proposalswithin our PMMH algorithms; note that these are only suggested proposals and the method-ology is still valid with other choices. The number of change-points was propagated fromPMMH iteration ( i −
1) to iteration i using a discrete uniform distribution which was cen-tered on k ( i − and had an odd integer width of W >
1. In the event that k ( i − − ( W − / < A or k ( i − + ( W − / > Ω(with A and Ω being the lower and upper limits of possible values for k , respectively), then k ( i ) was sampled from a discrete uniform distribution with W shortened as needed (i.e., anypotential values for k ( i ) that would be out of bounds were removed from the support).The SMC samplers within the PMMH propagated the change-points s k and the com-ponents of θ k +1 (i.e., the unknown substitution rates) via a Metropolis-Hastings kernel,and the unnormalised SMC sampler weights were calculated according to [29, Equation 31].Within the kernel, each individual change-point was propagated via a discrete uniform ran-dom walk similar to that which propagated k . To avoid selecting the same site twice atSMC iteration i , the sampling consisted of two steps:1. Sample s t, conditional on s t − , .2. For j ∈ { , . . . , k } , sample s t,j conditional on s t − ,j , where s t, j − are removed fromthe support of the discrete uniform if need be.Finally, each component θ t,j of θ t, k +1 was independently propagated via a log normaldistribution with mean that depended on the value of θ t − ,j .25 .2 Sequential Monte Carlo (SMC) of [39] We also implemented the ABC-SMC algorithm for model selection on [39, page 190], withsteps labelled MS1-MS3. Within the perturbation kernel, each l th component of θ j wasindependently propagated via a log normal distribution with a mean that depended on thevalue of θ lj from the previous iteration of SMC. Each individual change-point was propagatedvia a discrete uniform random walk. If the site labels range from one to m , then any valuesof the support of the random walk that were less than two or greater than m were assigneda zero probability of being chosen. Furthermore, to avoid selecting the same site twice atiteration i , the sampling consisted of the same two steps enumerated above for the PMMHalgorithms. D.3 Time Machine
In belief propagation [31], one first sends messages up from the leaves of a tree to itstopmost parent nodes. When at the topmost nodes, it is required to input the marginalof each individual parent node and then send the messages back down to the leaves (i.e.,one must input the probability that a parent node is of a particular type at that point intime). We approximated those probabilities with the model’s equilibrium frequencies whenemploying the time machine.
D.4 Summary Statistic
The two ABC algorithms simulated datasets given sampled values of k , s k , and θ k +1 .These model parameters were accepted as output when the simulated data was deemed tobe sufficiently close to the actual dataset. Mean pairwise difference is sometimes used inABC algorithms to compare two genetic sequences (see [45, Section 2.6]). Our datasets had n ≥ n genetic sequences from the simulated datato its counterpart in the actual dataset. If the sum of the sites with different values (acrossall n genetic sequences) was less than a specified tolerance level, then the simulated andactual datasets were deemed to be sufficiently close to one another.In both of the ABC algorithms implemented here, the terminal tolerance level at step T of the SMC algorithm was ( mn/ T (regardless of how high we set T or N to facilitate jumping between temperatures).However, with the tolerance level set as such, the ABC-SMC of [39] was very fast, andit produced 5,000 samples from all of the tested models within a matter of minutes. Wetried increasing N so that the algorithm could use the full six hours alloted, but even whenwe set N to be a very high number, the resulting inference did not change at all and wasconsistently extremely poor. D.5 Parallelisation
The algorithms were implemented in C++, using the OpenMP 3.0 and Eigen [46] libraries.The computation of the likelihood was parallelised across m in the PMMH algorithms whichemployed the time machine. In the ABC algorithms, simulation of the sites of the sequenceswas parallelised across m . E Supporting Information: Implementation for Section4.2
Our particle marginal Metropolis-Hastings (PMMH) algorithm [28] which employed the timemachine used N = 50 particles, and the sequential Monte Carlo (SMC) sampler [29] withinthis algorithm ran for T = 150 time steps. The unknown model parameters were propagatedfrom PMMH iteration ( i −
1) to iteration i using the same schemes as outlined in SectionD.1. For the time machine approximation to the stationary distribution of the tree, we usedthe same scheme as stated in Section D.3. Finally, the algorithm was implemented in C++,using the OpenMP 3.0 and Eigen [46] libraries, and the computation of the likelihood wasagain parallelised across m .We set gg