[PDF] The Tajima heterochronous n-coalescent: inference from heterochronously sampled molecular data

Abstract

The observed sequence variation at a locus informs about the evolutionary history of the sample and past population size dynamics. The Kingman coalescent is used in a generative model of molecular sequence variation to infer evolutionary parameters. However, it is well understood that inference under this model does not scale well with sample size. Here, we build on recent work based on a lower resolution coalescent process, the Tajima coalescent, to model longitudinal samples. While the Kingman coalescent models the ancestry of labeled individuals, the heterochronous Tajima coalescent models the ancestry of individuals labeled by their sampling time. We propose a new inference scheme for the reconstruction of effective population size trajectories based on this model with the potential to improve computational efficiency. Modeling of longitudinal samples is necessary for applications (e.g. ancient DNA and RNA from rapidly evolving pathogens like viruses) and statistically desirable (variance reduction and parameter identifiability). We propose an efficient algorithm to calculate the likelihood and employ a Bayesian nonparametric procedure to infer the population size trajectory. We provide a new MCMC sampler to explore the space of heterochronous Tajima's genealogies and model parameters. We compare our procedure with state-of-the-art methodologies in simulations and applications.

Full PDF

TThe Tajima heterochronous n -coalescent: inferencefrom heterochronously sampled molecular data Lorenzo Cappello ∗ , Amandine V´eber ∗ , Julia A. Palacios , ∗ Department of Statistics, Stanford University CMAP, CNRS, ´Ecole Polytechnique, I.P. Paris Department of Biomedical Data Science, Stanford MedicineMay 23, 2020

Abstract

The observed sequence variation at a locus informs about the evolutionary history ofthe sample and past population size dynamics. The standard Kingman coalescent model ongenealogies – timed trees that represent the ancestry of the sample – is used in a genera-tive model of molecular sequence variation to infer evolutionary parameters. However, thestate space of Kingman’s genealogies grows superexponentially with sample size n , makinginference computationally unfeasible already for small n . We introduce a new coalescentmodel called Tajima heterochronous n-coalescent with a substantially smaller cardinality ofthe genealogical space. This process allows to analyze samples collected at different times, asituation that in applications is both met ( e.g. ancient DNA and RNA from rapidly evolvingpathogens like viruses) and statistically desirable (variance reduction and parameter iden-tiﬁability). We propose an algorithm to calculate the likelihood efﬁciently and present aBayesian nonparametric procedure to infer the population size trajectory. We provide a newMCMC sampler to explore the space of Tajima’s genealogies and model parameters. Wecompare our procedure with state-of-the-art methodologies in simulations and applications.We use our method to re-examine the scientiﬁc question of how Beringian bison went extinctanalyzing modern and ancient molecular sequences of bison in North America, and to recon-struct population size trajectory of SARS-CoV-2 from viral sequences collected in Franceand Germany. Keywords:

Bayesian nonparametric, Kingman n -coalescent, multi-resolution, ancient DNA, Gaus-sian process. ∗ The authors gratefully acknowledge partial funding from the France-Stanford Center for Interdisciplinary Stud-ies. JAP acknowledges support from National Institutes of Health grant R01-GM-131404 and the Alfred P. SloanFoundation. AV acknowledges partial funding from the chaire program Mathematical Modeling and Biodiversity(Ecole polytechnique, Museum National d’Histoire Naturelle, Veolia Environment, Foundation X). a r X i v : . [ s t a t . M E ] A p r Introduction

Statistical inference of evolutionary parameters from a sample of n DNA sequences accounts forthe dependence among samples and models observed variation through two stochastic processes:an ancestral process of the sample represented by a genealogy g , which depends on the effectivepopulation size trajectory ( N e ( t )) t ≥ , and a mutation process with a given set of parameters µ that, conditionally on g , models the phenomena that have given rise to the sequences. However,state-of-the-art methodologies are not scalable to the amount of data available because the latentspace of genealogies lives in a high dimensional space. In this paper, we tackle the problem ofscalability from a modeling perspective: we propose a new ancestral process for heterochronousdata that dramatically reduces the state space of genealogies. We complement this model with anew algorithm for fast likelihood calculations.Inference for ( N e ( t )) t ≥ has important applications in many ﬁelds, such as genetics, anthro-pology, and public health. In the absence of natural selection, the effective population size canbe used to approximate census population size. While census population size estimates can bedifﬁcult to obtain due to high costs and challenging sampling designs, we can reconstruct pastpopulation sizes from observed signatures of genetic diversity in a sample of the population.For example, one can estimate the population size of a virus from genetic samples in a situationwhere census counts are believed to be inaccurate, as it is common during an epidemic. Infer-ring population size dynamics – timing of population events, growth and decline rates – ratherthan estimating census counts, may be of scientiﬁc interest. For example, Shapiro et al. (2004)reconstructed bison population dynamics, providing new insights into the extinction of Beringianbison. In this paper, we include two studies supporting the motivations highlighted above: inone study we analyze viral samples of SARS-CoV-2, the virus responsible for the coronavirusdisease, and in a second study, we analyze ancient samples of bison in North America (datasetdescribed by Froese et al. (2017)).Both Bayesian and frequentist methods rely on the marginal likelihood computed by integrat-ing over the space G n of tree topologies with n leaves, and over the space of branch lengths. That2 A TT A T A G CG CA T

T..A..T..G..G..A T i m e s1s2 Ancestral sequence

Figure 1:

Coalescence and mutation.

A geneal-ogy of 10 individuals at a locus of 100 sites is de-picted as a bifurcating tree. Six mutations (at differ-ent sites) along the branches of the tree give rise tothe 10 sequences. The 94 sites that do not mutate arerepresented by black dots in the ancestral sequence.The nucleotides at the polymorphic sites are shown,and the colored arrows depict how ancestral sites aremodiﬁed by mutation. is: P ( Y | ( N e ( t )) t ≥ , µ ) = (cid:90) g ∈G n × R n − P ( Y | g , ( N e ( t )) t ≥ , µ ) d π ( g | ( N e ( t )) t ≥ ) , (1)where the n − random variables with values in R + are the times between consecutive coa-lescence events in the genealogy, and π ( · | ( N e ( t )) t ≥ ) denotes the probability distribution on G n × R n − implied by the ancestral process as a function of the past population size trajectory ( N e ( t )) t ≥ . The genealogy g is an auxiliary variable introduced to compute P ( Y | g , N e ( t ) , µ ) because direct calculation of the marginal P ( Y | N e ( t ) , µ ) is intractable. The prevailing consensusin the literature is to compute (1) with π deﬁned as the Kingman-coalescent prior law on leaf-labeled genealogies (formally introduced in Kingman (1982a,b)). However, the cardinality of G n grows superexponentially with n ( |G n | = n !( n − / n − ), creating a computational bottleneckin the calculation of the integral (1).An alternative to the Kingman n -coalescent is to use a lower resolution coalescent process,known as the Tajima n -coalescent (Tajima, 1983; Sainudiin et al., 2015; Palacios et al., 2019).While the state space of the Kingman n -coalescent is in bijection with the set of timed and labeled binary trees with n leaves, the state space of the Tajima n -coalescent is in bijection with the setof timed and unlabeled binary trees with n leaves. The cardinality of the space of the timed andunlabeled binary tree topologies with n leaves is given by the ( n − -th Euler zigzag number(Disanto and Wiehe, 2013), which behaves like /π ) n ( n − when n increases. While thecardinality of the space of Tajima trees still grows superexponentially in n , its rate of growthis drastically smaller than that of the space of Kingman trees. Palacios et al. (2019) show that3hen all sequences are sampled at the same time, employing the Tajima coalescent allows fastinference of ( N e ( t )) t ≥ .The main focus of this work is to develop a scalable model for sequences observed at differenttime points like those at the tips of the genealogy in Figure 1. The need to model heterochronousdata is motivated by both applications and statistical reasons. In applications, viral samples(HIV, inﬂuenza) are routinely collected serially. Ancient DNA studies are another very activearea of research in which the sampling design is intrinsically sequential. At least two statisticalreasons are motivating this model. First, it usually leads to a decrease in the variance of theestimate (Felsenstein and Rodrigo, 1999). In coalescent-based inference, the smaller the numberof extant lineages in a given time interval, the greater the variance of our estimate of the intervallength, and consequently, the greater the variance of any estimate that depends on that lengthsuch as ( N e ( t )) t ≥ . By including heterochronous samples, we can increase the number of activelineages in the past, and thus obtain better estimates. Second, Parag and Pybus (2019) show thatincluding heterochronous samples is necessary (in some cases) to make the parameters describing ( N e ( t )) t ≥ identiﬁable.The Tajima heterochronous n -coalescent fundamentally differs from the Tajima isochronous n -coalescent in that sequences sampled at different times are not exchangeable. The Tajimaisochronous n -coalescent distinguishes between singletons and vintaged lineages, where a sin-gleton lineage refers to a lineage that subtends a leaf in g , and a vintaged lineage refers to alineage that subtends an internal node in g . Singletons are indistinguishable while vintages arelabeled by the ranking of the coalescence event at which they were created. When dealing withheterochronous samples, singletons are instead implicitly labeled by their underlying samplingtimes. To account for this difference, we deﬁne a new Markov chain.The present paper contains three main contributions: ﬁrst, we propose a new lower-resolutioncontinuous-time Markov chain (CTMC) which we call the Tajima heterochronous n -coalescent,that allows to model partially-labeled genealogies of heterochronous samples. Second, we intro-duce a new algorithm to compute the likelihood. Likelihood calculation in Palacios et al. (2019)called BESTT relies on a backtracking algorithm that is computationally unfeasible for sample4izes greater than . Our new algorithm can accommodate up to sequences. Lastly, weintroduce new MCMC proposals for efﬁcient exploration of the space of Tajima genealogies anddo Bayesian nonparametric inference on ( N e ( t )) t ≥ .The main challenge in employing Tajima genealogies for coalescent-based inference is thatthe sequence data can be allocated, i.e. mapped, to a given genealogy, in many possible ways. Theallocation is necessary to compute the likelihood. To ﬁnd all possible maps, Palacios et al. (2019)use a backtracking algorithm that proceeds in a bottom-up fashion: starting from the tips of thetree, the algorithm moves along the tree to the root checking for possible allocations of subsets ofthe data Y to clades of the tree g . Our proposal reverts this process and proceeds in a top to bottomfashion eliminating the backtracking step and reducing its computational complexity. While thereis no exact analytical expression for the computational complexity of the backtracking algorithm,a loose upper bound of the backtracking algorithm is O ( n !) , which we bring down to O ( n ) withour new proposal. The lower bound is of the order of O ( n ) for both algorithms. The algorithmrelies on a novel graphical representation of the data as a tree structure. We note that this tree isrelated to the deﬁnition of the directed acyclic graph (DAG) used in Palacios et al. (2019), withsome important differences. The DAG depends on g whereas the tree we introduced is solely afunction of the data. This implies that we need to deﬁne it only once. Also, the DAG groupssequences differently since it does not incorporate sampling time information.The rest of the paper proceeds as follows. In Section 2, we deﬁne the Tajima heterochronous n -coalescent. In Section 3, we introduce the mutation model we shall assume, describe the data,deﬁne the likelihood and the new algorithm to compute it. Section 4 describes the MCMC al-gorithm for posterior inference, and in Section 5, we present a comprehensive simulation studyoutlininghow the model works and comparing our method to state-of-the-art alternatives. In Sec-tion 6, we analyze modern and ancient bison sequences described in Froese et al. (2017). In Sec-tion 7, we apply our method to SARS-CoV-2 viral sequences collected in France and in Germany.Section 8 concludes. An open-source implementation is given in R package phylodyn , whichis available for download at https://github.com/JuliaPalacios/phylodyn. The Tajima heterochronous n -coalescent The Tajima heterochronous n -coalescent is an inhomogeneous continuous-time Markov chainthat describes the ancestral relationships of a set of n individuals sampled, possibly at differenttimes, from a very large population. The set of ancestral relationships of the sample is representedby a ranked genealogy as the one depicted in Figure 2. Every organism is dated and labeledaccording to the time in which the organism lived (if ancient, by radiocarbon date) or in whichthe living organism was sequenced. In this generalization of the Tajima coalescent, each pair ofextant ancestral lineages merges into a single lineage at an instantaneous rate which depends onthe current effective population size N e ( t ) , and new lineages are added when one of the prescribedsampling times is reached. In this work, we do not model the stochasticity of sampling times butwe condition on them as being ﬁxed.

12 34 5 6 7 89 s1, t11s2t10t9t8t7t6t5t4t3t2 g: ranked tree genealogy Jump chain ( a ( t2 ),b( t2 )) = ((0,0), {9})( a ( t3 ),b( t3 )) = ((0,0), {5,8})( a ( t4 ),b( t4 )) = ((1,0), {5,7})( a ( t5 ),b( t5 )) = ((1,1), {5,6})( a ( t6 ),b( t6 )) = ((1,2), {3,5})( a ( t7 ),b( t7 )) = ((1,3), {3,4})( a ( t8 ),b( t8 )) = ((2,3), {2,3})( a ( s2 ),b( s2 )) = ((3,3), {1,2})( a ( t9 ),b( t9 )) = ((3,0), {1,2})( a ( t10 ),b( t10 )) = ((5,0), {1})( a ( t11 ),b( t11 )) = ((7,0), {}) Figure 2:

Example of a Tajima heterochronous genealogy and its jump chain.

A realization of aTajima heterochronous n -coalescent with n = (7 , and s = ( s , s ) , represented as a ranked tree shapewith coalescence and sampling times, denoted g . The column to the right displays the corresponding jumpchain (see the text for notation). Let us introduce some notation. Let m be the number of sampling time points and n be thetotal number of samples. Let n = ( n , . . . , n m ) denote the number of sequences collected attimes s = ( s , . . . , s m ) , with s = 0 denoting the present time, and s j > s j − for j = 2 , . . . , m (time goes from present towards the past). We refer to the sequences counted in n i as “belonging6o sampling group s i ”. Let t = ( t n +1 , . . . , t ) be the vector of coalescent times with t n +1 =0 < t n < ... < t ; these are the times when two lineages have a common ancestor. Notethat the subscript in t k does not indicate the current number of lineages, as it is often done inthe coalescent literature, but it indicates the number of lineages that have yet to coalesce (somesequences may not have been sampled yet). We use the rank order of the coalescent events(bottom-up) to label the internal nodes of the genealogy. That is, the node corresponding to thecoalescent event occurring at time t n is labeled (see t in Figure 2), the node corresponding tothe coalescence event occurring at time t n − is labeled , etc . We refer to the internal node labelsas vintages ( i.e. , rankings).The Tajima heterochronous n -coalescent is the process ( a ( t ) , b ( t )) t ≥ that keeps track of a ( t ) ,a vector of length m whose j -th position indicates the number of singletons ( i.e. , lineages thathave not been involved in a coalescence event) from sampling group s j at time t , and b ( t ) is the setof vintaged lineages at time t . The process starts at t = 0 in state ( a (0) = ( n , , . . . , , b (0) = ∅ ) , jumps deterministically at every sampling time and jumps stochastically at every randomcoalescent time until it reaches the unique absorbing state ( a ( t ) = (0 , . . . , , b ( t ) = { n − } ) at time t , when all n samples have a single most recent common ancestor at the root (Figure 2).At each sampling time s i , the state of the Tajima coalescent jumps deterministically as follows: ( a ( s i ) , b ( s i )) = ( a ( s i − ) + n i e i , b ( s i − )) , where f ( s i − ) denotes the left-limit of the function f at s i and e i is the i -th unit vector.Let us now turn to the embedded jump chain at coalescent times. At time t i , two extantlineages coalesce to create a new lineage with vintage n + 1 − i . Four types of coalescencetransitions are possible depending on which and how many sampling groups are involved: (1)two singletons of the same sampling group coalesce (up to m possible moves for the chain), (2)two singletons of different sampling groups coalesce (up to m ( m − / possible moves), (3)one singleton lineage and one vintaged lineage coalesce (up to m possible moves), or (4) twovintaged lineages coalesce (only one possibility because for vintages, the sampling informationis irrelevant). Each pair coalesces with the same probability and the transition probabilities at7oalescent times are thus given by P (cid:104) ( a ( t i ) ,b ( t i )) (cid:12)(cid:12)(cid:12) ( a ( t i − ) , b ( t i − )) (cid:105) (2) =  (cid:81) mj =1 (cid:18) a j ( t i − ) a j ( t i − ) − a j ( t i ) (cid:19)(cid:18)(cid:80) mj =1 a j ( t i − ) + | b ( t i − ) | (cid:19) if ( a ( t i ) , b ( t i )) ≺ ( a ( t i − ) , b ( t i − ))0 otherwisewhere ( a ( t i ) , b ( t i )) ≺ ( a ( t i − ) , b ( t i − )) means that ( a ( t i ) , b ( t i )) can be obtained by merging twolineages of ( a ( t i − ) , b ( t i − )) and | b | denotes the cardinality of the set b .Observe that the quantity (cid:80) mj =1 a j ( t i − ) + | b ( t i − ) | appearing in (2) corresponds to the to-tal number of extant lineages just before the event at t i . Furthermore, since only two lineagescoalesce at time t i , at most two terms in the product appearing in the numerator of (2) are notequal to one. Finally, if m = 1 , (2) degenerates into the transition probabilities of the Tajimaisochronous n -coalescent; on the other hand if m = n , the process degenerates into the Kingmanheterochronous n -coalescent since all singletons are uniquely labeled by their sampling times.Figure 2 shows a possible realization from the Tajima heterochronous n -coalescent.To deﬁne the distribution of the holding times, we introduce the following notation. Wedenote the intervals that end with a coalescent event at t k by I ,k and the intervals that end with asampling time within the interval ( t k +1 , t k ) as I i,k where i ≥ is an index tracking the samplingevents in ( t k +1 , t k ) . More speciﬁcally, for every k ∈ { , . . . , n } , we deﬁne I ,k = [max { t k +1 , s j } , t k ) , where the maximum is taken over all s j < t k , (3)and for every i ≥ we set I i,k = [max { t k +1 , s j − i } , s j − i +1 ) with the max taken over all s j − i +1 > t k +1 and s j < t k . (4)We also let n i,k denote the number of extant lineages during the time interval I i,k . For example,8n Figure 2, in ( t , t ) we have I , = [ s , t ) , I , = [ t , s ) and no I i, for i ≥ . The vector ofcoalescent times t is a random vector whose density with respect to Lebesgue measure on R n − can be factorized as the product of the conditional densities of t k − knowing t k , which reads: for k = 3 , ..., n + 1 , p ( t k − | t k , s , n , ( N e ( t )) t ≥ ) = C ,k − N e ( t k − ) exp (cid:40) − (cid:90) I ,k − C ,k − N e ( t ) d t + m (cid:88) i =1 (cid:90) I i,k − C i,k − N e ( t ) d t (cid:41) , (5)where t n +1 = 0 by convention, C i,k := (cid:0) n i,k (cid:1) , and the integral over I i,k − is zero if there are lessthan i sampling times between t k and t k − . The distribution of the holding times deﬁned abovecorresponds to the same distribution of holding times in the heterochronous Kingman coalescent(Felsenstein and Rodrigo, 1999). Although the heterochronous Tajima coalescent takes value ona different state space, it remains true that every pair of extant lineages coalesces at equal rate.Finally, given n , s and t , a complete realization of the Tajima heterochronous n -coalescentchain can be uniquely identiﬁed with a partially labeled binary ranked tree shape g of n =( n , . . . , n m ) samples at ( s , . . . , s m ) with its n − coalescent transitions, so that P ( g | t , s , n ) = n (cid:89) i =2 P (cid:104) ( a ( t i ) , b ( t i )) (cid:12)(cid:12)(cid:12) ( a ( t i − ) , b ( t i − )) (cid:105) . (6)Equation (6) gives the prior probability of the tree topology under the Tajima heterochronous n -coalescent. Putting together (5) and (6), we obtain a prior π ( g | s , n , ( N e ( t )) t ≥ ) π ( g | s , n , ( N e ( t )) t ≥ ) = P ( g | t , s , n ) n +1 (cid:89) k =3 p ( t k − | t k , s , n , ( N e ( t )) t ≥ ) , (7)which can be used in (1). 9 Data and Likelihood

We assume that the observed data Y consists of n sequences at z polymorphic (mutating) sitesat a non-recombining contiguous segment of DNA of organisms with low mutation rate. Underthese assumptions, a widely studied mutation model is the inﬁnite sites model (ISM) (Kimura,1969; Watterson, 1975) with Poissonian mutation, which corresponds to throwing a Poisson pointprocess of mutations on the branches of g at rate µ such that every mutation occurs at a differentsite and no mutations are hidden by a second mutation affecting the same site.An important consequence of the ISM is that Y can be represented as an incidence matrix Y and a frequency counts matrix Y . Y is a k × z matrix with - entries, where indicatesthe ancestral type and indicates the mutant type; k is the number of unique sequences (orhaplotypes) observed in the sample, and the columns correspond to polymorphic sites. Y is a k × m count matrix where the ( i, j ) th entry denotes how many haplotype i sequences belongingto group s j are sampled. For example, the n = 10 sequences deﬁned by the realizations ofthe ancestral and mutation processes depicted in Figure 1 can be summarized into Y and Y displayed in Figure 3(A). Y and Y can alternatively be represented graphically as an augmented perfect phylogeny T . This graphical representation of the data is exploited by our likelihood algorithm. The aug-mented perfect phylogeny representation is an extension of the gene tree or perfect phylogeny (Gusﬁeld, 1991; Grifﬁths and Tavare, 1994; Palacios et al., 2019) representation to the hete-rochronous case. The standard perfect phylogeny deﬁnition leaves out the information carriedby Y . To our knowledge, Gusﬁeld’s approach has never been generalized to the heterochronouscase. In the augmented perfect phylogeny T = ( V , E ) , V is the set of nodes of T , and E is the setof weighted edges. We deﬁne T as follows:1. Each haplotype labels at least one leaf in T . If a haplotype is observed at k different sam-pling times, then k leaves in T will be labeled by the same haplotype. The pair (haplotypelabel, sampling group) uniquely labels each leaf node.10 Y1 hahbhchdhehf Y2 s1321010 s2001101 V0V2 V1V3 V4 V5 V6 V7 V8 l3l2 l1 T (hb, s1) (ha, s1) (hf, s2)(hc, s1) (hc, s2) l4 l6l5(hd, s2) (he, s2) (A) (B) |E2|=1|E1|=1|E6|=0|E5|=1|E7|=1|E3|=0|E4|=0|E6|=0 (C) |V2|=2|V1|=4|V6|=1|V5|=3|V7|=1|V3|=1|V4|=1|V6|=1|E8|=2 |V8|=1 Figure 3:

Incidence matrix, frequency matrix and perfect phylogeny representation . Panel (A): datais summarized as an incidence matrix Y ( h denotes the haplotypes, l the segregating sites, the colorscorrespond to those depicted in Figure 1) and a matrix of frequencies Y ( s denotes the sampling group).Panel (B): T denotes the perfect phylogeny corresponding to Y and Y ; each of the 6 polymorphic siteslabels exactly one edge. When an edge has multiple labels, the order of the labels is irrelevant. Each leafnode is labeled by a pair (haplotype, sampling time), with each haplotype possibly labeling more than oneleaf nodes. Panel (C): | E i | corresponds to the number of mutations along the edge subtending node V i in(B) and | V i | corresponds to the number of sequences descending from V i in (B), see the text for details.

2. Each of the z polymorphic sites labels exactly one edge. When multiple sites label the sameedge, the order of the labels along the edge is arbitrary. Some external edges (edges sub-tending leaves) may not be labeled, indicating that they do not carry additional mutationsto their parent node.3. For any pair (haplotype h k , sampling group), the labels of the edges along the unique pathfrom the root to the leaf h k specify all the sites where h k has the mutant type.Figure 3(B) plots T corresponding to Y and Y displayed in Figure 3(A). Observe that T includes sampling information in the leaf labels. In the example, h C labels two leaves becauseit is observed at times s and s . The corresponding edges E and E are unlabeled, i.e. , nomutations are allocated to those edges because the underlying nodes carry identical sequences(same haplotype). We “augment” Gusﬁeld’s perfect phylogeny because the sampling informationis crucial in the likelihood calculation. T implicitly carries some quantitative information that can be quickly summarized. We de-note the number of observed sequences subtended by an internal node V by | V | . If V is a11eaf node, | V | denotes the frequency of the haplotype h observed at the corresponding samplingtime s . Similarly, we denote the number of mutation labels assigned to an edge E by | E | . If nomutations are assigned to E , then | E | = 0 . For parsimony, the edge that connects node V i to itsparent node is denoted by E i . See Figure 3(C) for an example.Gusﬁeld (1991) gives an algorithm to construct the “traditional perfect phylogeny” T ’ inlinear time. Constructing T from T ’ is straightforward since all we need is to incorporate thesampling information and add leaf nodes if a haplotype is observed at multiple sampling times.If we drew T (cid:48) from the data in Figure 3, it would not have node V , but only a single node V labeled by haplotype h C . A description of the algorithm can be found in the supplementarymaterial. The crucial step needed to compute the likelihood of a Tajima genealogy g is to sum over allpossible allocations of mutations to its branches. This can be efﬁciently done by exploiting theaugmented perfect phylogeny representation of the data T and by ﬁrst mapping nodes of T tosubtrees of g . We stress that the need for an allocation step arises only when working withTajima genealogies. In Kingman’s coalescent, tree leaves are labeled by the sequences to whichthey correspond, and so there is a unique possible allocation. In Tajima’s coalescent, leaves areunlabeled, creating potential symmetries in the tree, and so we have to scan all the possible waysin which the observed sequences may be allocated to g . Let a denote a possible mapping of nodes of T to subtrees of g . a is encoded as a vector of length n − , where the i -th entry gives the node in T which is mapped to the subtree with vintage i , g i (including the branch that subtends vintage i ). Our algorithm ﬁrst maps all non-singleton nodes V of T to subtrees of g , that is, only nodes such that | V | > are entries of a . Singleton nodes in T ( V ∈ V such that | V | = 1 ) are treated separately and are initially excluded from the allocationstep. For example, Figure 4 shows a possible vector a whose entries are the non-singleton nodes12 a = ( V2 , V5 , V1 , V5 , V0 , V1 , V0 , V0 , V0 )

12 34 5 6 7 89

Figure 4:

A possible allocation of non-singletonnodes of V to subtrees of g . For a given alloca-tion a (bottom ﬁgure), we display how subtrees in g (identiﬁed by the vintage tag at their root – blacknumber in the top ﬁgure) are allocated to the nodesof T . Each color depicts an allocation of a subtree toa node: V (red), V (blue), V (green) and V (yel-low). V , V , V , and V of T of Figure 3. We note that nodes can appear more than once in a , meaningthat they can be mapped to more than one subtree. On the other hand, a single node V i is notnecessarily mapped to all the vintages, leaves and internal branches of g j ; different nodes may bemapped to some subtrees of g j (including external branches), leading to a situation where V i ismapped to only a subset of the vintages and branches constituting g j . For example, in Figure 4, V is mapped to g and g , but V is mapped to g , a subtree of both g and g ; hence V is onlymapped to the green part of g and g as depicted in the Figure.The precise mapping of nodes in T to subtrees of g described below is needed to allocatemutations in T to branches of g . We will explain the allocation of mutations on g for a given a inthe next subsection.We now deﬁne an algorithm to efﬁciently ﬁnd all possible mappings a for a given g . Weencode the set of all possible a , as an a × ( n − matrix A , where each row is a possible a ( n − columns) and the number of rows a is equal to the number of possible allocations. Togenerate A , the algorithm proceeds recursively from top to bottom in g , by sweeping throughsubtrees in g and matching them to nodes in T according to parent-offspring relationships andnumber of descendants in both T and g . To be more precise, the algorithm is initialized by settingthe × ( n − A matrix to A = ( V , . . . , V ) , i.e. , V is mapped to all subtrees in g . The13lgorithm proceeds iteratively, adding and removing rows from A , iterating over an index i goingfrom n − to . The ﬁrst step is to deﬁne A ( i ) , the set of node allocations in the i -th column of A . Then for all V ∈ A ( i ) , the algorithm iterates through the following steps: deﬁne T V as the setof child nodes of V that have | g i | descendants. If the number of child nodes of V is at least , V is also included in T V . If T V = ∅ , for example if V is a leaf node, the algorithm does nothing. If | T V | = 1 , the algorithm replaces V by the element of T V in the columns I of A corresponding toall subtrees of g i . If | T V | > , the matrix A is augmented by stacking | T V | − copies of A V ( , I ) ,the submatrix of A obtained by extracting all the row vectors whose I -th elements are V . Theoriginal submatrix A V ( , I ) is referred to as A (1) V ( , I ) , and A (2) V ( , I ) , . . . , A ( | T V | ) V ( , I ) denote itscopies. Lastly, the algorithm replaces V by the ﬁrst element of T V in A (1) V ( , I ) , by the secondelement of T V in A (2) V ( , I ) and so on, until the last element of T V is substituted in A ( | T V | ) V ( , I ) .The simple rule described above is fast to compute but it leads to incorrect allocations be-cause nodes may be mapped a redundant number of times. For example, it is easy to see thatimplementing the algorithm above, we could deﬁne an allocation a where node V is allocatedto all subtrees of size two; however, V should be allocated at most once. This issue can beavoided by noting that internal nodes in V should appear in each a a number of times equal totheir number of child nodes minus one, while leaf nodes, say V (cid:48) ∈ V , should appear | V (cid:48) | − times. Hence, we complete each iteration by eliminating rows of A where this rule is violated.A second elimination rule is needed to account for the constraints imposed by the sampling timeinformation: rows are eliminated when their assignments involve nodes labeled by a samplingtime s (cid:48) “matched” to subtrees of g that have leaf branches terminating at a different samplingtime. Algorithm 1 in the Appendix summarizes the above description.Figure 5 gives examples of possible allocations of T to two different genealogies g and g ’.The second genealogy g ’ differs from g by the order of the coalescent events and which areinverted. g and g ’ share the common allocation a = ( V , V , V , V , V , V , V , V , V ) ; however, g has a second possible allocation a = ( V , V , V , V , V , V , V , V , V ) that it is not compatiblewith g ’. This difference is due to the fact that V has three descendants belonging to samplinggroup s , while g has two subtrees with leaves sampled at s , and g ’ has only one. We note that14 g g’a = ( V2 , V5 , V1 , V5 , V0 , V1 , V0 , V0 , V0 ) a = ( V2 , V5 , V1 , V5 , V0 , V1 , V0 , V0 , V0 ) a = ( V5 , V2 , V5 , V1 , V1 , V0 , V0 , V0 , V0 ) A ={ a ,a } A ={ a }

12 34 5 6 7 89

Figure 5:

Example of allocations for two distinct genealogies . Two possible samples from the Tajimaheterochronous n -coalescent. Below we list all the possible allocations of nodes of T to g and g ’. Thetwo genealogies differ solely by the inversion of the coalescent events and . This change gives rise todifferences in the possible allocations: for example, V can be mapped to the subtree deﬁned by node in g but not in g ’. singleton nodes also need to be allocated, both in a and a . We will elaborate on this point inthe next subsection. To calculate the likelihood, we assume the ISM of mutations and that mutations occur accordingto a Poisson point process with rate µ on the branches of g , where µ is the total mutation rate. Tocompute the likelihood we need to map mutations in T to branches of g and this is done for eachmapping a i of non-singleton nodes of T to subtrees of g . For every V in T such that | V | > ,we deﬁne E V as the set formed by the edges in T that subtend singleton children of V and, withthe exception of V = V , E V in addition includes the edge that subtends V . For the example inFigure 3(B), E V = { E , E , E } . Let V ∗ be the set of all V ∈ V such that | V | > , then thelikelihood function is deﬁned as P ( Y | g , N e , µ ) = a (cid:88) i =1 P ( Y , a i | g , N e , µ )= a (cid:88) i =1 (cid:89) V ∈ V ∗ P ( V, E V , a i | g , N e , µ ) , (8)15here we recall that a is the number of possible allocations, we have written N e = ( N e ( t )) t ≥ ,and P ( V, E V , a i | g , N e , µ ) is the probability of observing the mutations of the E V edges alongthe corresponding branches of g deﬁned by the mapping a i as follows.If V has no singleton child nodes, then E V = { E } and P ( V, { E } , a i | g , N e , µ ) ∝ ( µl ) | E | e − µ T , (9)where l is the length of the branch in g that subtends g j , j is the largest index such that a i,j = V , and T denotes the length of the subtree in g to which V is mapped in a i (as described inSubsection 3.2.1). For example, considering V in Figure 4, we have T = 2 t n + ( t n − − t n ) and l = ( t n − − t n ) is the length of the branch connecting vintage to vintage .If node V has singleton child nodes, P ( V, { E, E ch , . . . , E ch k } , a i | g , N e , µ ) ∝ ( µl ) | E | e − µ T (cid:88) R ∈ Π( E V ) k (cid:89) j =1 ( µl R j ) | E chj | , (10)where the ﬁrst term on the r.h.s is deﬁned exactly as the quantity on the r.h.s. of (9), while the sec-ond term corresponds to the probability of all possible different matchings between R , . . . , R k ,the ﬁrst k indexes such that a i,R j = V , and | E ch | , | E ch | , . . . , | E ch k | , the k numbers of mutationsobserved on the edges E ch , . . . , E ch k leading to the child nodes of V . In this expression, Π( E V ) is the set of all possible such matchings R .Before deﬁning Π( E V ) more precisely, we make two observations. First, not all matchingsare possible since not all leaf branches terminate at the same time (heterochronous sampling).Second, it is enough to consider the allocations that contribute to distinct likelihood values, i.e. allocations for which the underlying samples are “distinguishable” in the sense that they have adifferent number of mutations.We deﬁne Π( E V ) as the set of all possible “distinct matchings of number of observed sin-gleton mutations to singleton branches”, that is, allocations which lead to a distinct likelihoodvalues. To construct Π( E V ) , we ﬁrst partition the singleton edges E ch , . . . , E ch k according tothe sampling times of the corresponding nodes V ch , . . . , V ch k . Let k i be the number of nodes16n { V ch , . . . , V ch k } with sampling time s i , i.e. , the size of each subset of the partition. We thenfurther partition these subsets by grouping together the edges carrying the same number of muta-tions (deﬁned as | E ch | , . . . , | E ch k | ). For each given sampling time s j , let k (1) j , . . . , k ( m j ) j denotethe cardinalities of the m j sub-subsets obtained by this procedure, so that k j = (cid:80) m j h =1 k ( h ) j . Thecardinality of Π( E V ) is then | Π( E V ) | = m (cid:89) j =1 k j ! k (1) j ! . . . k ( m j ) j ! , (11)where the product in (11) is the number of permutations with repetition of the different edgesthat are compatible with the data in terms of sampling times and numbers of mutations carried.Note that Equation (11) is not the same as Equation (6) in Palacios et al. (2019) because herewe account for the different sampling groups. It degenerates into Equation (6) in Palacios et al.(2019) in the isochronous case.Lastly, we note that knowing a priori the full matrix A allows to compute efﬁciently thelikelihood (8) via a sum-product algorithm. Indeed, for each V ∈ V ∗ there may be several rows a of A such that P ( V, E V , a | g , N e , µ ) is the same, due to the fact that V is mapped to the samesubtree in all these allocations. For such a V , one could compute the likelihood corresponding tothese r allocations a (cid:48) , . . . , a (cid:48) r in the following way: r (cid:88) i =1 (cid:89) V ∈ V ∗ P ( V, E V , a (cid:48) i | g , N e , µ )= P ( V, E V , a (cid:48) | g , N e , µ ) r (cid:88) i =1 (cid:89) V (cid:48) ∈ V ∗ \{ V } P ( V (cid:48) , E V (cid:48) , a (cid:48) i | g , N e , µ ) . (12)The exact sum-product formulation of (8) is speciﬁc to the observed Y and A . In Section 2 we have introduced a new prior for genealogies and in Section 3, we have expoundedhow to compute the likelihood of heterochronous data Y generated by a Poisson process of mu-17ations superimposed on this new genealogy. We ﬁnally need to specify a prior distribution on ( N e ( t )) t ≥ to complete our Bayesian model. In this paper, we follow Palacios and Minin (2013),who place a Gaussian process (GP) prior on (log( N e ( t ))) t ≥ (the logarithm is used to ensure that N e ( t ) ≥ for all t ). We thus have: Y | g , µ, ( N e ( t )) t ≥ , n , s ∼ Poisson process g | ( N e ( t )) t ≥ , s , n ∼ Tajima heterochronous n -coalescent (13) (log( N e ( t ))) t ≥ | τ ∼ GP (0 , C ( τ )) τ ∼ Gamma ( α, β ) where C ( τ ) is the covariance function of the Gaussian process. As in Palacios and Minin (2013),for computational convenience we use Brownian motion with covariance elements Cov(log( N e ( t )) , log( N e ( t (cid:48) )) = τ min( t, t (cid:48) ) for any t, t (cid:48) > as our GP prior. From (13), the posterior distribution can be written as π ((log( N e ( t ))) t ≥ , τ, g | Y , µ ) ∝ P ( Y | g , (log( N e ( t ))) t ≥ , µ ) π ( g | (log( N e ( t ))) t ≥ ) π ((log( N e ( t ))) t ≥ | τ ) π ( τ ) , (14)which we approximate via MCMC methods. Full conditionals are not available, and so we useMetropolis-within-Gibbs updates. At each MCMC iteration, we jointly update ((log( N e ( t ))) t ≥ , τ ) via a Split Hamiltonian Monte Carlo (HMC) (Shahbaba et al., 2014) suitably adapted to phylo-dynamics inference by Lan et al. (2015); then we update the topology g and t . We propose twoMetropolis steps to update g and t . The latter may also be combined in a single step. The tran-sitions for g and t are tailored to the Tajima n -coalescent genealogies. To update g , we employthe scheme in Palacios et al. (2019). To update t , we propose a new sampler, which allows topropose branch lengths that account for the observed sampling times constraints, an issue speciﬁcto heterochronous samples under the ISM assumption and detailed in the next subsection.18 .1 Constraints imposed by the ISM hypothesis Under the ISM hypothesis, mutations partition the observed sequences into two sets: the se-quences that carry the mutations and the sequences that do not. This recursive partitioning ofthe sequences is graphically represented by T . As a consequence, not all topologies g and not allvectors t are compatible with the data, i.e. have posterior probability or density greater than 0.The combinatorial constraints imposed by the ISM on the space of topologies are discussed indetail in Cappello and Palacios (2020).The constraints on t solely arise in the heterochronous case. First note that the deﬁnition of theTajima heterochronous n -coalescent implies that there can be at most n − coalescence eventsbefore s , at most n + n − events before s , and so on. Moreover, if there are shared mutationsbetween some (but not all) samples with different sampling times, the maximum number ofcoalescent events between the involved sampling times is further restricted. In the example ofFigure 3(A), there is a shared mutation l between 3 samples with sampling time s and a samplewith sampling time s > s . Out of the 7 samples obtained at time s , the 3 samples that share the l mutation could coalesce ﬁrst some time between s and s (at most coalescent events amongthe 4 sequences descending from node V ), but they need to coalesce with the sample at time s in node V before they coalesce with the other 4 samples collected at time s (those can coalesceat most 3 times between s and s ). Therefore, there are at most 5 coalescent events before s .To encode the constraints imposed by the sampling information, we deﬁne a vector c oflength m , where the i th entry denotes the maximum number of coalescent events that can happen(strictly) before time s i for given Y , s and n . Trivially c =0 because there are no samples. Letus stress that c does not deﬁne the number of coalescent events in a given interval (a quantitydetermined by t ), but it is the maximum number of coalescent events that can happen before eachsampling time to ensure compatibility with the data. In the example of Figure 3(A), we have c = (0 , . Note that c is and not n − . In the online supplementary material, we providea greedy search algorithm to deﬁne c . 19 .2 Coalescent times updates Let ∆ t := ( t n − t n +1 , . . . , t − t ) be the vector of intercoalescence times, and (∆ t i ) i ∈ I thesubvector of elements of ∆ t at positions I ⊆ { , . . . , n − } . The proposal is generated in threesteps. First, we uniformly sample the number of intercoalescent times proposal moves – i.e. , thecardinality of I , then we uniformly choose which times to modify – i.e. , we deﬁne I , and lastly,we sample the proposals (∆ t i ) (cid:48) i ∈ I . The ﬁrst two steps balance between fast exploration of thecoalescent times state-space and a high acceptance probability – few changes are expected tolead to higher acceptance rates while many changes are expected to lead to faster exploration ofthe state space. In our implementation we limit the maximum possible number of intercoalescenttimes moves to a ﬁxed number Z (cid:28) n − . Lastly, we sample new states (∆ t i ) (cid:48) , for i ∈ I froma truncated normal with mean ∆ t i and standard deviation σ ∆ t i . The left tail is truncated by aparameter lo i , and the right tail is left unbounded. Three reasons motivate this choice: it haspositive support, it can be centered and scaled around the current ∆ t i using a single parameter σ , and we can set the lower bound lo i to ensure that only compatible times t (cid:48) are proposed. Toset the values of lo i , we rely on c , the vector that speciﬁes the maximum number of coalescentevents possible before each sampling time. We note that the elements of c can be used to indexcoalescent times. In particular, t n − c i denotes the time of the ( c i + 1) th coalescent event. Forexample in Figure 2, t n − c = t is the ﬁrst coalescent event, and t n − c = t is the sixth coalescentevent. Given c , lo i is set to lo i = max j =1 ,...,m { , { [ s j − ( t n − c j − ∆ t i )] ( i ≤ c j + 1) }} , (15)where ( i ≤ c j + 1) is an indicator function. Equation (15) ensures the proposal t (cid:48) n − c j ≥ s j forall j . Indeed, note that t n − c j = (cid:80) c j +1 k =1 ∆ t k . Hence, if ( t n − c j − ∆ t i ) − s j > for any given j such that i ≤ c j + 1 , then the proposed value of (∆ t i ) (cid:48) could be zero and still t (cid:48) n − c j would be acompatible time. In this case, we do not need to impose any restriction on the lower bound of thetruncated normal. On the other hand, if the vector considered in (15) has one or more positivevalues, the proposed value (∆ t i ) (cid:48) should be large enough to ensure that for all sampling times20 j , there will never be more than c j coalescent events before s j . In other words, we truncate theproposal distribution support to ensure the compatibility of t ’. We discuss how to set Z and σ inSection 5.The transition density of coalescent times is given by k ( t , t (cid:48) ) = 1 Z (cid:18) n − | I | (cid:19) − (cid:89) i ∈ I Truncated N (∆ t i , σ ∆ t i , lo i , ∞ ) , (16)with Truncated N (∆ t i , σ it i , lo i , ∞ ) denoting a truncated normal density function with mean ∆ t i ,standard deviation σ ∆ t i , lower bound lo i and upper bound ∞ . We explore the ability of our procedure to reconstruct ( N e ( t )) t ≥ in simulation across a rangeof demographic scenarios which capture realistic and challenging population size trajectories en-countered in applications. The code for simulations and inference is implemented in R package phylodyn , which is available for download at https://github.com/JuliaPalacios/phylodyn. Simulation setup.

Given n , s , and ( N e ( t )) t ≥ , we simulate genealogies under the Tajimaheterochronous n -coalescent (Section 2). Given a realized g and ﬁxed µ , we draw M mutationsfrom a Poisson distribution with parameter µL ( L is the length of the tree g : the sum of all branchlengths of g ) and place them independently and uniformly at random along the branches of thetimed genealogy. Y , Y and T are then constructed as described in Section 3.1. We simulategenealogies with three population scenarios:1. A bottleneck (“bottleneck”): N e ( t ) =  if t ∈ [0 , . , . if t ∈ [0 . , . , if t ∈ [0 . , ∞ ) . (17)21. An instantaneous drop (“drop”): N e ( t ) =  . if t ∈ [0 , . , if t ∈ [0 . , ∞ ) . (18)3. Two periods of constant population size with an exponential growth in between (“exp”): N e ( t ) =  if t ∈ [0 , . ,

10 exp(2 − t ) if t ∈ [0 . , . , . if t ∈ [0 . , ∞ ) . (19)For each scenario, we generated genealogies with three numbers of leaves ( n = 14 , , )and different n , s as summarized in Table 1. The mutation parameter is varied to analyze theeffect of the number of segregating sites on the quality of the estimation.Table 1: Summary of parameter values used in simulations.

List of parameters n , s , µ anddemographic scenarios ( N e ( t )) t ≥ used to simulate data. We report the realized number M ofmutations for each of the 9 data sets ( mutations). n = 14 n = 35 n = 70 B o ttl e n ec k n (5,5,4) (10,10,10,5) (10,10,10,10,5,10,5,5,5) s (0,.11,.32) (0,0.045,0.11,0.32) (0,0.045,0.075,0.11,0.2,0.25,0.31, 0.35,0.45) µ

15 30 18 mutations 122 186 252 D r op n (8,3,3) (10,10,10,5) (15,10,10,15,10,5,5) s (0,0.4,0.6) (0,0.2,0.4,0.6) (0,0.1,0.2,0.4,0.47,0.6,0.8) µ

12 12 12 mutations 121 127 190 E xp n (14) (20,5,5,5) (20,15,10,10,10,5) s (0) (0,0.11,0.16,0.255) (0,0.05,0.07,0.11,0.21,0.26) µ

15 22 22 mutations 66 174 254

We empirically assess the accuracy of our estimates with three commonly used criteria. Theﬁrst one is the sum of relative errors (SRE).

SRE = k (cid:88) i =1 | (cid:98) N e ( v i ) − N e ( v i ) | N e ( v i ) , where ( v , . . . , v k ) is a regular grid of k time points, (cid:98) N e ( v i ) is the posterior median of N e at22ime v i and N e ( v i ) is the value of the true trajectory at time v i . The second criterion is the meanrelative width, deﬁned by M RW = 1 k k (cid:88) i =1 | ˆ N . ( v i ) − ˆ N . ( v i ) | N ( v i ) , where ˆ N . ( v i ) and ˆ N . ( v i ) are respectively the . and . quantiles of the posterior dis-tribution of N ( v i ) . Lastly, we consider the envelope measure deﬁned by EN V = 1 k k (cid:88) i =1 { ˆ N . ( v i ) ≤ N e ( v i ) ≤ ˆ N . ( v i ) } , which measures the proportion of the curve that is covered by the 95% credible region. In thissimulation study we ﬁx k = 100 , v = 0 and v k = . t . MCMC tuning parameters.

The posterior approximation is sensitive to both the initial valuesof ( g , ( N e ( t )) t ≥ ) and the MCMC parameters. We initialize g with the serial UPGMA (Drum-mond and Rodrigo, 2000). In addition to the usual MCMC parameters such as chain length,burnin and thinning, there are three parameters speciﬁc to our method: the HMC step size (cid:15) ,the maximum number of intercoalescent times proposals ( Z ), and the standard deviation σ thatparametrizes the transition kernel k ( t , t (cid:48) ) . While all three parameters contribute to the mixing ofthe Markov chain and acceptance rates, in our experience, (cid:15) and σ are the most inﬂuential. Insettings similar to the ones analyzed here (time scale, type of trajectory patterns, and mutationrate), parameter values (cid:15) ∈ [0 . , . , Z ∈ { , , } , and σ ∈ [0 . , . lead to a similar mix-ing of the Markov chain and accuracy (w.r.t the metrics considered). We based these guidelineson extensive simulation studies on the datasets considered (Table 1), which we believe to berepresentative of a broad set of settings encountered in applications. In our simulations, we set (cid:15) = 0 . , Z = 2 , and σ = 0 . for the “bottleneck” and “drop” trajectories, and we set (cid:15) = 0 . , Z = 2 , and σ = 0 . for the “exp” trajectories.Inference is carried out with × iterations for n = 14 , × iterations for n = 35 and × iterations for n = 70 . Posterior distributions are approximated after a burn-in period of23 × iterations and after thinning every iterations. Comparison to other methods.

To our knowledge, no other method simultaneously deals withheterochronous data, assumes the ISM, samples Tajima genealogies and does Bayesian nonpara-metric inference. All available methodologies rely on the Kingman coalescent coupled with ﬁnitesites mutation models, with the Jukes-Cantor model (Jukes and Cantor, 1969) being the closestto the ISM. Hence, it is not fully possible to isolate the impact of using Tajima’s genealogies inlieu of Kingman’s. Nevertheless, we include some alternative estimates for completeness. Wecompare our results to two popular methodologies implemented in BEAST (Drummond et al.,2012): the Bayesian Skyline (SKY) (Drummond et al., 2005) and the Gaussian Markov RandomField Skyride (GMRF) (Minin et al., 2008). For the SKY and GMRF, we use the Jukes Cantormutation model (Jukes and Cantor, 1969), and approximate posterior distributions with it-erations after a burn-in period of iterations and after thinning every iterations. We alsocompare our results to an oracle estimator that infers ( N e ( t )) t ≥ from the “true” g . The oracleestimation is obtained using the method of Palacios and Minin (2012) with the same Gaussianprocess prior on ( N e ( t )) t ≥ . Note that the goal of the comparison is not to determine whetherour method is superior, rather see if the performance of Tajima-based inference is in line with theresults obtained through two popular Kingman-based methods in some challenging populationscenarios. Results.

The results of the nine curves estimated with our method are plotted in Figure 6. Thesupplementary material includes the plots for SKY and GMRF. True trajectories are depictedas dashed lines, posterior medians as black lines and credible regions as gray shaded areas.Table 2 summarizes SRE, MRW, and ENV for the simulated data sets achieved with our method(“Tajima”), SKY, GMRF, and “Oracle”. SKY estimates for “Exp” n = 14 are not includedbecause we could not obtain convergent runs. Accuracy increases with sample size: credibleregions shrink substantially in all three scenarios. As n increases, posterior medians track moreclosely the true trajectories. It is well known in the literature that abrupt population size changesare the most difﬁcult to recover. The “drop” and “bottleneck” scenarios are less accurate for n = 14 , as exhibited by the wider credible region. We recover the bottleneck (panel ﬁrst row and24rst column), but we do not recover the instantaneous drop (panel ﬁrst row and third column).Table 2: Simulation: performance comparison between Tajima and Oracle models.

We computethree statistics - envelope (ENV), sum of relative errors (SRE) and mean relative width (MRW) - for threepopulation trajectories (Bottle, Exp, Drop) and three population sizes ( n = 14 , , ). Tajima refers tothe estimation of ( N e ( t )) t ≥ through our model, SKY refers to Drummond et al. (2005), GMRF to Mininet al. (2008). Oracle refers to the method of Palacios and Minin (2012) (known g ). Bold depicts the methodwith the best performance (excluding the “oracle”). SKY “Exp” n = 14 results are not included becausewe could not obtain convergent runs. % ENV SRE MRWn Oracle Tajima SKY GMRF Oracle Tajima SKY GMRF Oracle Tajima SKY GMRF B o ttl e

14 100

97 92 450.21 127.47

93 92 121.27

91 111.01 109.02 E xp

14 100 -

35 100 91

100 100

70 100

100 100 100 D r op

14 100

98 99 25.65

100 100

99 17.48 31.24

Table 5 largely conﬁrms the analysis of Figure 6. Recall that SKY and GMRF rely on King-man coalescent rather than Tajima coalescent; SKY and GMRF methods assume a different mu-tation model, whereas Oracle relies on knowing the true g rather than computing its posterior.First, no method unequivocally outperforms the others. The oracle methodology is the methodwith the best overall performance more frequently. Surprisingly, the advantage of knowing g isnot as big as one would expect. Both SKY and GMRF have much narrower credible regionsfor the bottleneck trajectory. On the other hand, Tajima has the best overall performance inthe “drop” trajectory (low SRE and MRW). Note that ENV is not always an indicator ofaccuracy because it can be achieved with a very wide credible region.Lastly, note that no method outperforms the others. This is consistent with theoretical ex-pectations as we are comparing two resolutions of the same ancestral process. Reassuringly,Tajima-based estimates are competitive with Kingman-based estimates. The current simulationstudy cannot single out the beneﬁt of employing Tajima vs Kingman topologies because no avail-able implementations rely on an identical mutation model and MCMC scheme. This analysis isout of the current scope and will be the subject of future research.25 ime to present Time to present Time to present n=14 n=35 n=70 N e ( t ) N e ( t ) N e ( t ) “ D r op ”“ E x p ”“ B o tt l e ” − + + − + + − + + − + − + − + − + − + − + Figure 6:

Simulation: effective population size posterior medians from different trajectories andsample sizes. ( N e ( t )) t ≥ posterior distribution from simulated data with three population size trajectories(rows) - bottleneck (“Bottle”), exponential growth (“Exp”) and instantaneous fall (“Drop”) - differentsample sizes (columns) - n = 14 , n = 35 and n = 70 . Posterior medians are depicted as solid black linesand 95% Bayesian credible intervals are depicted by shaded areas. n and s are depicted by the heat mapsat the bottom of each panel: the squares along the time axis depicts the sampling time, while the intensityof the black color depicts the number of samples. More details are given in Table 1. Recent advances in molecular and sequencing technologies allow recovering genetic materialfrom ancient specimens (P¨a¨abo et al., 2004). In this section, we analyze modern and ancientbison sequences. These mammals offer a case study of a population experiencing a populationgrowth followed by a decline. It was a long-standing question whether the drop was instigatedby human intervention or by environmental changes. Shapiro et al. (2004) ﬁrst reconstructedthe genetic history of Beringian bisons. Their estimate for the start of the decline supports theenvironmental hypothesis, in particular, they suggest that the decline may be due to environmentalevents preceding the last glacial maxima (LGM). This data-set has been the subject of extensive26 n s ( years ago) T Figure 7:

Bison study: data (Froese et al., 2017).

Perfect phylogeny T of bison sequences selectedfrom Froese et al. (2017) data-set. Node labels de-pict the number of sequences subtending that node.The mutations are allocated along the edges of T (all of them are single digits). Sampling informationare not written in this Figure. The two vectors n and s are represented by the columns to the right of T .Sampling times are obtained by radiocarbon dating.The scale is number of years before present. research in the past decade.We analyze new bison data recently described by Froese et al. (2017). We ﬁt our coalescentmodel to these sequences and estimate population size dynamics. To our knowledge, there is nophylodynamics analysis of this data set in the literature. Two motivations underlie this study:ﬁrst, Shapiro et al. (2004) sequences include base pairs from the mitochondrial control re-gion, while Froese et al. (2017) provide the full mitochondrial genome ( base pairs afteralignment); second, we are interested in testing whether the previously published overwhelm-ing evidence in favor of the environmentally induced population decline is conﬁrmed by thisnew data. Froese et al. (2017) data comprises sequences ( modern and ancient). DNAwas extracted from bison specimens from Canada (28, three locations), USA (9, two locations);Siberia (7, three locations), and unknown locations (5). It includes sequences of Bison priscus(extinct ancient bison), Bison latifrons (extinct ancient bison), Bison bison (modern bison),and Bos grunniens (control group).We selected out of sequences. We removed the control group sequences and theSiberian sequences to analyze samples from a single population (Froese et al. (2017) (Figure 1)suggested population structure). We removed the Bison latifrons sequence because it has ambiguities i.e. , sites in a sequence that cannot be unambiguously assigned to a unique nucleotide27 e ( t ) Time to present Time to present Time to present

Shapiro et al. (2004) Tajima GMRF + + + + + Figure 8:

Bison in North America: effective population size “expected trajectory” and posteriormedian estimates from Froese et al. (2017) data set.

The ﬁrst panel depicts a sketch of a “consensuspopulation trajectory” obtained from the phylodynamics study of the data of Shapiro et al. (2004) inFaulkner et al. (2020). The second and third panels display estimated posterior medians of ( N e ( t )) t ≥ (asblack curves) obtained from n = 38 ancient and modern sequences from North America specimens inFroese et al. (2017) data. The second panel corresponds to our method and the third panel to GMRF. Theposterior medians are depicted as solid black curves and the 95% Bayesian credible regions are depictedby shaded areas. n and s are depicted by the heat maps at the bottom of the last two panels: the squaresalong the time axis depicts the sampling time, while the intensity of the black color depicts the number ofsamples. More details are given in Figure 7. basis at sites where all the other samples have valid entries. Out of the observed polymorphicsites, we retain sites compatible with the ISM assumption. To encode data in the − in-cidence matrix representation Y , we use the root of the UPGMA tree reconstructed using R function upgma ( phanghorn ) as the ancestral state. Figure 7 displays the perfect phylogeny T and the vectors s and n .For our inference procedure, we set (cid:15) = 0 . , Z = 2 , σ = 0 . , and approximated theposterior distribution with . × iterations after a burn-in of × and after thinning every iterations. As a comparison, we ran GMRF on BEAST and approximated the posteriordistribution with × iterations after a burn-in of × and after thinning every iteration.We used the default values for all GMRF hyperparameters. We initialized both methods with thesame genealogy (serial UPGMA). To compute the likelihood, we used the BEAST mutation rateestimate per site per year of . × − .The ﬁrst panel of Figure 8 plots a summary of the effective population size pattern recoveredby a recent analysis of Shapiro et al. (2004) data by Faulkner et al. (2020). While the precisetimings and the details of the trajectory differ from method to method, the broad patterns are28onsistent. The population peak is estimated to be between 41.6 and 47.3 kya. The timing of thestart of the decline is the main feature of interest. We plot the posterior medians (black lines) of ( N e ( t )) t ≥ along with the credible regions (gray area) obtained from posterior samples bysampling Tajima’s trees (“Tajima”, second panel) and Kingman’s trees (“GMRF”, third panel).Both our method and GMRF recover the pattern described in the ﬁrst panel. We detect thepopulation decline only up to about kya ago, afterward the median trajectory is quite ﬂat whilethe credible regions are wide. This can be explained by the fact that we have no samples from kya to . kya. On the other hand, GMRF detects more clearly the population decline. TheGMRF median time estimate of the population peak is . kya, while the median time estimatefor our method is . kya. Thus, the estimates of the main event of interest, the populationdecline, are practically identical. The difference between the estimates obtained analyzing data differ substantially from the estimates of a population peak between 41.6 and 47.3 kyaobtained analyzing the data.The LGM in the Northern hemisphere reached its peak between . and kya (Clark et al.,2009). Hence, the analysis of the data still supports the hypothesis of a decline that initiatedbefore the LGM. However, our estimates suggest an initial decline much closer to the LGM peakthan the analysis of the data. Human arrival in North America via the Berigian bridge routeshould have happened around − kya (Llamas et al., 2016). Therefore, despite the mismatchof the timing, the human-induced decline hypothesis has little evidence also according to ouranalysis of this new dataset. SARS-CoV-2 is the virus causing the pandemic of novel coronavirus disease in 2019-2020 andit is of interest to explore the utility of viral molecular sequences for surveillance during the out-break of the epidemic. Here, we analyze whole genome sequences collected in France, and sequences collected in Germany that were made publicly available in the GISAID EpiCovdatabase (Shu and McCauley, 2017). We note that our estimates may not reﬂect the whole coun-tries effective population sizes but simply local effective population trajectories of the locations29 e ( t ) Time Time

Tajima GMRF N e ( t ) F r a n ce G e r m a n y . . . Jan −

03 Jan −

18 Feb −

02 Feb −

17 Mar −

03 Mar − . . . Jan −

03 Jan −

18 Feb −

02 Feb −

17 Mar −

03 Mar − − − Jan −

02 Jan −

17 Feb −

01 Feb −

16 Mar −

02 Mar − − − Dec −

31 Jan −

15 Jan −

30 Feb −

15 Mar −

01 Mar − Figure 9: N e ( t ) posterior median estimates from SARS-CoV-2 GISAIDdata sets from France and Germany. Black curves in the ﬁrst row panels depict estimated posteriormedians of ( N e ( t )) t ≥ obtained from n = 32 viral samples from Germany. Black curves in the secondrow panels depict estimated posterior medians of ( N e ( t )) t ≥ obtained from n = 123 viral samples fromFrance. The left column corresponds to our method’s results and the second column to GMRF results. Theposterior medians are depicted as solid black lines and the 95% Bayesian credible regions are depicted byshaded areas. n and s are depicted by the heat maps at the bottom of the last two panels: the squaresalong the time axis depicts the sampling time, while the intensity of the black color depicts the number ofsamples. in which our samples were obtained. We only analyzed high coverage sequences with more than25000 base pairs and performed multiple sequence alignment with Mafft (Katoh and Standley,2013). To encode nucleotide data as binary sequences Y , we used the GenBank MN908947(Wu et al., 2020) sequence as ancestral reference and eliminated sites that were not present inthe ancestral sequence. The numbers of variable sites observed are and for France andGermany respectively. The observed patterns of mutations in both datasets are compatible withthe ISM (no site was further removed). The Gisaid reference numbers of the sequences includedin this study and data access acknowledgment are included in the supplementary material. Wenote that observed differences may be caused by sequencing errors and these are being ignoredin our study. The heat maps included in each panel of Figure 9 show the sampling frequencyinformation. In the French dataset, out of samples were collected in March (at least onesample every day from / / to / / ), in February (spread over different dates),

30n January (spread over days, oldest sample dated / / ). In the German dataset, out of samples were collected in March (spread over different dates and / / last samplingday), in February (spread over dates), in January (oldest sample / / ). We include ineach dataset the reference sequence.For our inference procedure, we set (cid:15) = 0 . , Z = 2 , σ = 0 . , and approximate theposterior distribution with . × iterations after a burn-in of × and after thinning every iterations. For comparison, we ran GMRF on BEAST assuming the HKY mutation model(Hasegawa M, 1985) as proposed in previous studies (Scire et al., 2020) and approximate theposterior distribution with × iterations after a burn-in of × and after thinning every iteration. We used the default values for all GMRF hyperparameters. We initialized bothmethods with the serial UPGMA genealogy (Drummond and Rodrigo, 2000). BEAST estimatesa mutation rate of . × − mutations per site per year in the French dataset, and . × − mutations per site per year in the German dataset. Our estimate follows the method discussedby Rambaut et al. (2016) obtained by regressing the Hamming distance of the sequences to theancestral reference sequence on time difference between the sampling times and the referencesampling time. We estimated a mutation rate of . × − mutations per site per year in theFrench dataset, and . × − mutations per site per year in the German dataset.We show the estimates of effective population size with our method in the ﬁrst column ofFigure 9 and with BEAST in the second column. Results for Germany correspond to the ﬁrst rowand for France to the second row. Both analyses of the French dataset exhibit exponential growthfrom mid-December of to the end of February (Tajima estimate of median population peakis 2020/02/29, GMRF estimate is 2020/03/1). Following the exponential growth, both methodssuggest a decline. Both analyses of the German dataset recover nearly constant trajectories,possibly due to sampling time concentration in mid-march and spatial sampling concentration inDuesseldorf (see online supplementary material for details).A ﬁnal remark. Our estimates should be interpreted as estimates of genetic diversity overtime and not as number of infections. Our model ignores recombination, population structureand selection. Viruses tend to exhibit antigenic drifts, selective sweeps, and tend to cluster spa-31ially following migration events (Rambaut et al., 2008). All these aspects may hinder the use ofcoalescent-based models to analyze viral population size dynamics. Indeed, the scientiﬁc knowl-edge on this virus is still limited and the validity of our model assumptions to SARS-CoV-2 is anactive area of research. We have introduced a new methodology for Bayesian nonparametric inference of population sizetrajectory from heterochronous DNA sequences collected at a single non-recombining locus. Themain focus of this work is scalability. In this respect, we developed a fast alternative to the King-man’s coalescent that can be used for nonparametric inference of serially sampled sequences.We also developed a fast algorithm to compute the likelihood of Tajima genealogies, which is initself a relevant contribution to the literature.We applied our method to a recent data set including modern and ancient bison sequences.There has been a lot of interest in determining whether the decline in the bison population washuman-induced or climate-induced. Genetic evidence supported the environmental hypothesis,estimating the population peak to be approximately 45kya. Our analysis reconstructed a similarpopulation size pattern. However, we estimated the peak to be about . kya. These analysesconﬁrm that the population decline started sometimes before the LGM. We believe that this bringsfurther genetic evidence to the environmentally induced population decline hypothesis.This paper makes important steps in the direction of a more scalable coalescent-based infer-ence. However, the Tajima heterochronous n -coalescent has some limitations which need to beaddressed. An obvious one is that we do not model population structure. In the ancient bisonapplication, we removed the Beringian sequences, keeping only North-American ones, becausepopulation structure violated the assumption of the standard coalescent, and consequently of anyof its “resolutions”. We have also stressed the importance of this feature in the analysis of viraldata.In addition, throughout the paper, we assumed the inﬁnite site model of mutation. This pre-vented us from analyzing the original bison data of Shapiro et al. (2004), as well as many other32ata that violate the ISM assumption. Given the promise of Tajima based inference shown here,incorporating other mutation models seems to be an interesting avenue of research. Appendix

Algorithm 1

Description of the algorithm to deﬁne the allocation matrix

Inputs: T , sOutput: A

1. Initialize A = ( V , . . . , V ) For i = n − to do (a) Deﬁne A ( i ) unique nodes in the i th column of A (b) For all V ∈ A ( i ) do i. Deﬁne T V , set of (non-singleton) child nodes of V having | g i | descendantsii. Include V in T V if it has more than two child nodesiii. Deﬁne I , set of vintages corresponding to all subtrees of g i iv. If | T V | = 0 : do nothing Else if | T V | = 1 : set column A V ( · , I ) equal to T V Else if | T V | > : copy A V ( , I ) | T V | − times, attach the copies to A and seteach copy equal to one element of T V v. Eliminate rows in A where V appears too frequently (rule in the paper)vi. Eliminate rows not compatible with s and t

3. Return A . References

Cappello, L. and Palacios, J. A. (2020), ‘Sequential importance sampling for multi-resolutionKingman-Tajima coalescent counting’, Annals of Applied Statistics in press .Clark, P. U., Dyke, A. S., Shakun, J. D., Carlson, A. E., Clark, J., Wohlfarth, B., Mitrovica,J. X., Hostetler, S. W. and McCabe, A. M. (2009), ‘The last glacial maximum’, Science (5941), 710–714.Disanto, F. and Wiehe, T. (2013), ‘Exact enumeration of cherries and pitchforks in ranked treesunder the coalescent model’, Mathematical biosciences (2), 195–200.33rummond, A. J., Rambaut, A., Shapiro, B. and Pybus, O. G. (2005), ‘Bayesian coalescent infer-ence of past population dynamics from molecular sequences’, Molecular biology and evolution (5), 1185–1192.Drummond, A. and Rodrigo, A. G. (2000), ‘Reconstructing genealogies of serial samples underthe assumption of a molecular clock using serial-sample UPGMA’, Molecular Biology andEvolution (12), 1807–1815.Drummond, A., Suchard, M., Xie, D. and Rambaut, A. (2012), ‘Bayesian phylogenetics withBEAUti and the BEAST 1.7’, Molecular Biology and Evolution (8), 1969–1973.Faulkner, J. R., Magee, A. F., Shapiro, B. and Minin, V. N. (2020), ‘Horseshoe-based Bayesiannonparametric estimation of effective population size trajectories’, Biometrics in press .Felsenstein, J. and Rodrigo, A. G. (1999), Coalescent approaches to HIV population genetics, in‘The Evolution of HIV’, Johns Hopkins University Press, pp. 233–272.Froese, D., Stiller, M., Heintzman, P. D., Reyes, A. V., Zazula, G. D., Soares, A. E., Meyer, M.,Hall, E., Jensen, B. J., Arnold, L. J. et al. (2017), ‘Fossil and genomic evidence constrains thetiming of bison arrival in North America’, Proceedings of the National Academy of Sciences (13), 3457–3462.Grifﬁths, R. C. and Tavare, S. (1994), ‘Sampling theory for neutral alleles in a varying envi-ronment’, Philosophical Transactions of the Royal Society of London. Series B: BiologicalSciences (1310), 403–410.Gusﬁeld, D. (1991), ‘Efﬁcient algorithms for inferring evolutionary trees’, Networks (1), 19–28.Hasegawa M, Kishino H, Y. T. (1985), ‘Dating of the human-ape splitting by a molecular clockof mitochondrial DNA’, Journal of Molecular Evolution , 160–164.Jukes, T. H. and Cantor, C. R. (1969), ‘Evolution of protein molecules’, Mammalian proteinmetabolism (21), 132.Katoh, K. and Standley, D. M. (2013), ‘Mafft multiple sequence alignment software version 7:improvements in performance and usability’, Molecular biology and evolution (4), 772–780.Kimura, M. (1969), ‘The number of heterozygous nucleotide sites maintained in a ﬁnite popula-tion due to steady ﬂux of mutations’, Genetics (4), 893.Kingman, J. F. (1982a), ‘On the genealogy of large populations’, Journal of Applied Probability (A), 27–43.Kingman, J. F. C. (1982b), ‘The coalescent’, Stochastic processes and their applications (3), 235–248. 34an, S., Palacios, J. A., Karcher, M., Minin, V. N. and Shahbaba, B. (2015), ‘An efﬁ-cient Bayesian inference framework for coalescent-based nonparametric phylodynamics’,Bioinformatics (20), 3282–3289.Llamas, B., Fehren-Schmitz, L., Valverde, G., Soubrier, J., Mallick, S., Rohland, N., Norden-felt, S., Valdiosera, C., Richards, S. M., Rohrlach, A. et al. (2016), ‘Ancient mitochondrialDNA provides high-resolution time scale of the peopling of the americas’, Science advances (4), e1501385.Minin, V. N., Bloomquist, E. W. and Suchard, M. A. (2008), ‘Smooth skyride through a roughskyline: Bayesian coalescent-based inference of population dynamics’, Molecular biology andevolution (7), 1459–1471.P¨a¨abo, S., Poinar, H., Serre, D., Jaenicke-Despr´es, V., Hebler, J., Rohland, N., Kuch, M., Krause,J., Vigilant, L. and Hofreiter, M. (2004), ‘Genetic analyses from ancient DNA’, Annu. Rev.Genet. , 645–679.Palacios, J. A. and Minin, V. N. (2012), Integrated nested Laplace approximation for Bayesiannonparametric phylodynamics, in ‘Proceedings of the Twenty-Eighth Conference on Uncer-tainty in Artiﬁcial Intelligence’, UAI’12, AUAI Press, Arlington, Virginia, United States,pp. 726–735.Palacios, J. A. and Minin, V. N. (2013), ‘Gaussian process-based Bayesian nonparametric infer-ence of population size trajectories from gene genealogies’, Biometrics (1), 8–18.Palacios, J. A., V´eber, A., Cappello, L., Wang, Z., Wakeley, J. and Ramachandran, S.(2019), ‘Bayesian estimation of population size changes by sampling Tajimas trees’, Genetics (2), 967–986.Parag, K. V. and Pybus, O. G. (2019), ‘Robust design for coalescent model inference’, Systematicbiology (5), 730–743.Rambaut, A., Lam, T. T., Max Carvalho, L. and Pybus, O. G. (2016), ‘Exploring the temporalstructure of heterochronous sequences using tempest (formerly path-o-gen)’, Virus evolution (1), vew007.Rambaut, A., Pybus, O. G., Nelson, M. I., Viboud, C., Taubenberger, J. K. and Holmes, E. C.(2008), ‘The genomic and epidemiological dynamics of human inﬂuenza a virus’, Nature (7195), 615–619.Sainudiin, R., Stadler, T. and V´eber, A. (2015), ‘Finding the best resolution for the Kingman–Tajima coalescent: theory and applications’, Journal of Mathematical Biology (6), 1207–1247.Scire, J., Vaughan, T. G. and Stadler, T. (2020), ‘Phylodynamic analyses based on 93 genomes’.35hahbaba, B., Lan, S., Johnson, W. O. and Neal, R. M. (2014), ‘Split Hamiltonian Monte Carlo’,Statistics and Computing (3), 339–349.Shapiro, B., Drummond, A. J., Rambaut, A., Wilson, M. C., Matheus, P. E., Sher, A. V., Pybus,O. G., Gilbert, M. T. P., Barnes, I., Binladen, J. et al. (2004), ‘Rise and fall of the beringiansteppe bison’, Science (5701), 1561–1565.Shu, Y. and McCauley, J. (2017), ‘Gisaid: Global initiative on sharing all inﬂuenza data–fromvision to reality’, Eurosurveillance (13).Tajima, F. (1983), ‘Evolutionary relationship of dna sequences in ﬁnite populations’, Genetics (2), 437–460.Watterson, G. (1975), ‘On the number of segregating sites in genetical models without recombi-nation’, Theoretical Population Biology (2), 256–276.Wu, F., Zhao, S., Yu, B., Chen, Y.-M., Wang, W., Song, Z.-G., Hu, Y., Tao, Z.-W., Tian, J.-H., Pei,Y.-Y. et al. (2020), ‘A new coronavirus associated with human respiratory disease in china’,Nature (7798), 265–269. 36 UPPLEMENTARY MATERIAL

Algorithm Description: Augmented Perfect Phylogeny.

The algorithm below uses Gusﬁeld’sperfect phylogeny as an input, duplicates nodes corresponding to haplotypes that are sam-pled more than once, and returns the augmented perfect phylogeny T . Algorithm 2

Deﬁne

TInputs: T ’ (Gusﬁeld, 1991), s , Y Output: T For i = 1 to k doIf h i is observed at multiple sampling times (from Y ):[let w.l.o.g. r be the number of sampling groups in which h i is observed, and s i , . . . , s i r the corresponding sampling times](a) Take the leaf node V (cid:48) in T ’ labeled by h i (each haplotype labels a unique node inGusﬁeld T ’)(b) If | E (cid:48) | = 0 : make r − copies of V ( r − nodes with edges connecting them to thesame parent of V with no edge labels). Then label each of these nodes uniquely witha pair ( h i , s i ) , . . . , ( h i , s i r ) Else if | E (cid:48) | ≥ : create r new nodes with unlabeled edges connecting them to V (cid:48) .Then label each of these nodes in a unique way with a pair ( h i , s i ) , . . . , ( h i , s i r ) Else if h i is observed at a single sampling time (from Y ):(a) Identify V (cid:48) in T ’ labeled h i (b) Label V (cid:48) with a pair ( h i , its corresponding sampling time)2. Return T . Algorithm Description: Computing c.

We compute c through greedy search. The idea is sim-ple: it is not possible to build a compatible topology g conditionally on an incompatiblevector t . We initially assume that ISM does not impose any constraints on t , check if wecan build a compatible topology, if we are, it will mean that indeed ISM does not imposeconstraints, if not, it will mean that we need to add some constraints. We continue iter-atively until we manage to sample a compatible g . To do this process, we consider onesampling group at a time. We deﬁne a vector add of length m whose i th entry is the num-ber of coalescent events that happens before s i . Note that if we are interested in sampling g (ignoring branch information), add is the only time information we need. We can sam-ple compatible g ’s through a simple extension of an Algorithm 2 in Cappello and Palacios372020). We refer to that paper for details. Algorithm 3

Deﬁne cInputs: T , sOutput: c

1. Initialize c = ( n − , n + n − , . . . , (cid:80) m − i =1 n i − , n − For i = 1 to m − do (a) Set add = (0 , . . . , add i = 0 , add i +1 = (cid:80) ii =1 n i − , . . . , add m = (cid:80) ii =1 n i − (b) Given add , try to sample a compatible topology g (c) If g compatible: set c i = add i +1 Else if g not compatible: set add i +1 = a i +1 − , . . . , add m = add m − and return to(b)3. Return c . 38 imulation study: Plots of the estimates obtained from BEAST of the methods GMRF and SKYfor the examples discussed in Section 5.

Time to present Time to present Time to present n=14 n=35 n=70 N e ( t ) N e ( t ) N e ( t ) “ D r op ”“ E x p ” S K Y “ D r op ”“ E x p ”“ B o tt l e ” − + − + − + − + − + − + − + − + − + − + − + − + − + − + − + − + − + − + N e ( t ) N e ( t ) N e ( t ) G R M F Not available

Figure 10:

Simulation: effective population size posterior medians from different trajectories andsample sizes. ( N e ( t )) t ≥ posterior distribution from simulated data of under three population size trajecto-ries (rows) - bottleneck (“Bottle”), exponential growth (“Exp”) and instantaneous fall (“Drop”) - differentsample sizes (columns) - n = 14 , n = 35 and n = 70 . Posterior medians are depicted as solid black linesand 95% Bayesian credible intervals are depicted by shaded areas. n and s are depicted by the heat mapsat the bottom of each panel: the squares along the time axis depicts the sampling time, while the intensityof the black color depicts the number of samples. Top three rows panels depict estimates obtained throughSKY, bottom three rows depicts estimates obtained through GMRF. More details are given in Table 1. SARS-CoV-2 Molecular Data Description:

Data set used in the study in Section 7. We ac-knowledge the following sequence submitting laboratories to Gisaid.org: • Charit´e Universit¨atsmedizin Berlin, Institute of Virology. Victor M Corman, JuliaSchneider, Talitha Veith, Barbara M¨uhlemann, Markus Antwerpen, Christian Drosten,39oman W¨olfel. • Bundeswehr Institute of Microbiology. Mathias C Walter, Markus H Antwerpen andRoman W¨olfel. • Center of Medical Microbiology, Virology, and Hospital Hygiene, University of Dues-seldorf. Ortwin Adams, Marcel Andree, Alexander Dilthey, Torsten Feldt, SandraHauka, Torsten Houwaart, Bjrn-Erik Jensen, Detlef Kindgen-Milles, Malte KohnsVasconcelos, Klaus Pfeffer, Tina Senff, Daniel Strelow, Jrg Timm, Andreas Walker,Tobias Wienemann. • CNR Virus des Infections Respiratoires - France SUD. Antonin Bal, Gregory Destras,Gwendolyne Burﬁn, Solenne Brun, Carine Moustaud, Raphaelle Lamy, AlexandreGaymard, Maude Bouscambert-Duchamp, Florence Morﬁn-Sherpa, Martine Valette,Bruno Lina, Laurence Josset. • National Reference Center for Viruses of Respiratory Infections, Institut Pasteur,Paris. M´elanie Albert, Marion Barbet, Sylvie Behillil, M´eline Bizard, Angela Brise-barre, Flora Donati, Fabiana Gambaro, Etienne Simon-Lori`ere, Vincent Enouf, MaudVanpeene, Sylvie van der Werf, L`ea Pilorge. ••