[PDF] The shape of a seed bank tree

Abstract

We derive the asymptotic behavior of the total, active and inactive branch lengths of the seed bank coalescent, when the size of the initial sample grows to infinity. Those random variables have important applications for populations evolving under some seed bank effects, such as plants and bacteria, and for some cases of structured populations like metapopulations. The proof relies on the study of the tree at a stopping time corresponding to the first time that a deactivated lineage reactivates. We also give conditional sampling formulas for the random partition and we study the system at the time of the first deactivation of a lineage. All these results provide a good picture of the different regimes and behaviors of the block-counting process of the seed bank coalescent.

Full PDF

TThe shape of a seed bank tree

Adri´an Gonz´alez Casanova a , Lizbeth Pe˜naloza b , Arno Siri-J´egousse b a Instituto de Matem´aticas, b IIMAS, Departamento de Probabilidad y Estad´ıstica.Universidad Nacional Aut´onoma de M´exico.

September 28, 2020

Abstract

We derive the asymptotic behavior of the total, active and inactive branchlengths of the seed bank coalescent, when the size of the initial sample grows toinﬁnity. Those random variables have important applications for populationsevolving under some seed bank eﬀects, such as plants and bacteria, and forsome cases of structured populations like metapopulations. The proof relieson the study of the tree at a stopping time corresponding to the ﬁrst timethat a deactivated lineage reactivates. We also give conditional samplingformulas for the random partition and we study the system at the time ofthe ﬁrst deactivation of a lineage. All these results provide a good picture ofthe diﬀerent regimes and behaviors of the block-counting process of the seedbank coalescent.

Keywords : Seed bank, Structured coalescent, Branch lengths, Sampling formula.

MSC2010 : 60J95 (primary), 60C05, 60J28, 60F15, 92D25.

Seeds, cysts and other forms of dormancy generate seed banks, which store genetic in-formation that can be temporally lost from a population at a certain time and resusci-tate later. Having a seed bank is a prevalent evolutionary strategy which has importantconsequences. For example, in the case of bacteria, it buﬀers against the selective pres-sure caused by environmental variability and at the same time increases genetic variation[10, 15, 17].A ﬁrst attempt to construct a probabilistic model to study this phenomenon is due to Kaj,Krone and Lascoux [12]. They considered a modiﬁed Wright-Fisher model in which eachindividual chooses its parent from the individuals at several generations in the past, and a r X i v : . [ m a t h . P R ] S e p igure 1: The discrete seed bank model. In this picture N = 5, M = 3 and (cid:98) εN (cid:99) =1, i.e., in each generation four plants are produced by active individuals, oneseed germinates and one new seed is produced. not only from the previous one. This construction has an important technical complicationarising from the loss of the Markov property. A new model was deﬁned and studied in [1]to avoid this issue. It consists in a two-level discrete Markov chain, which again generalizesthe Wright-Fisher model.Consider a haploid population of ﬁxed size N which supports a seed bank of constantsize M . The N active individuals are called plants and the M dormant individuals arecalled seeds . Let 0 ≤ ε ≤ (cid:98) εN (cid:99) ≤ M . The N plants from generation 0produce N individuals in generation 1 by multinomial sampling (as in the Wright-Fishermodel). However, N − (cid:98) εN (cid:99) randomly chosen of these individuals are plants and (cid:98) εN (cid:99) are seeds. Then (cid:98) εN (cid:99) uniformly (without replacement) sampled seeds from the seed bankin generation 0 become plants in generation 1. The (cid:98) εN (cid:99) seeds produced by the plantsin generation 1, take the place of the seeds that germinate. Thus, we have again N plants and M seeds in generation 1 (see Figure 1). This random mechanism is repeatedindependently to produce the next generations. Observe that this model has, unlike [12],non-overlapping reproduction events.If we let N (and M ) go to inﬁnity and rescale the time, the stochastic process that describesthe limiting gene genealogy of a sample taken from the seed bank model is called the seedbank coalescent [1]. Apart from populations of plants or bacteria, it is remarkable that theseed bank coalescent is a convenient genealogical model for some metapopulations [14]. Infact, it was independently introduced in that context and named the peripatric coalescent .It corresponds to a special modiﬁcation of structured coalescence in which small coloniescan emerge from a main population and merge again with it. The seed bank coalescent isa structured coalescent with an active part, having the dynamics of a Kingman coalescent ,and a dormant part where the lineages are like frozen. Lineages can activate or deactivateat certain rates, see Figure 2 for an illustration.In this paper we study the asymptotic behavior of some functionals of the seed banktree. These can be useful for genetic applications, but also they provide a light shed on the connections between theory and applications. As an illustration, there is a closerelation between the shape of the genealogical tree of a sample of size n and the number ofmutations observed in it. More precisely, suppose that mutations appear in the genealogyby simply superimposing a Poisson process on the ancestral lineages (as it is in the inﬁnitesites model, see Chapter 1.4 in [7]). Then, the shape of the tree determines the distributionof the data obtained by DNA sequencing and thus, it can be inferred from it. For example,conditionally on the total length of the coalescent, denoted by L n , the number of mutationsobserved in the sample has Poisson distribution with parameter µL n , where µ is themutation rate. Thus, if we know the asymptotic behavior of the total length of the treewe can deduce the asymptotic behavior of the number of mutations. This is the key toolfor obtaining a Watterson-type estimator for the mutation rate, see [7]. Not surprisingly,asymptotics of the total length of many classical coalescents have been studied, e.g. in[6, 3, 13, 5].In [1], it was established that the time to the most recent common ancestor of a sampleof size n in the seed bank coalescent is of order log log n . This is an important diﬀerencewith the classical Kingman coalescent, whose height is ﬁnite. In our study, we establishthat the total length of the tree built from a sample of n plants and zero seeds is ofthe same order than that of the Kingman coalescent, behaving like log n , but with adiﬀerent multiplicative constant depending on the activation and deactivation parametersof the model. Moreover, we show that the total active length behaves precisely like thetotal length of the Kingman coalescent. This means that it is technically very hard to istinguish between the null Kingman model and the alternative seed bank model, unlessthe dormant individuals have the possibility of mutate while being in the seed bank, thatis actually the case in the metapopulation framework described in [14]. To discriminateboth null and seed bank models, some ﬁner results such as sampling formulas can alsobe derived. We are able to describe the seed back tree in detail as it undergoes diﬀerentphases. Indeed, it can be said that we describe the shape of the seed bank tree.Our results also have practical implications. In [16], Maughan observed experimentallythat a population of bacteria undergoing dormancy typically does not have signiﬁcantlydiﬀerent number of mutations. Our ﬁndings agree with this observation and oﬀer newinsights on the reason for this: most of the mutations occur in the Kingman phase i.e.shortly before the leaves of the tree, and in this part of the ancestral tree dormancy isirrelevant. On the other hand, populations suﬀering a signiﬁcant amount of mutationswhile being in the dormant state would be expected to have a higher evolutionary rate.This remark together with [16] suggests that the mutations that occur to individuals inlatent state play a minor role (at least number-wise). This is opposed to previous workssuggesting that the normal rate of molecular evolution of bacteria with a seed bank isevidence that mutations aﬀecting dormant individuals are frequent [16].

We study some relevant stopping times of the seed bank coalescent, leading to a completedescription of the shape of the tree and explaining how long the genealogies spend insuccessive dynamical phases, as is detailed precisely in Table 1 and Figure 3.Let us now deﬁne properly the seed bank coalescent. Fix n ∈ N and let P n be the set ofpartitions of [ n ] := { , , ..., n } . Then, the set of marked partitions P { p,s } n is built out from P n by adding a ﬂag (either p for a plant or s for a seed) to each block of the partition. Forexample, for n = 7, π = {{ , , } p , { } s , { , } s , { } p } is an element of P { p,s } . The seedbank n -coalescent (Π n ( t )) t ≥ , with deactivation intensity c > c >

0, is the continuous-time Markov chain with values in P { p,s } n having the followingdynamics. As for the Kingman coalescent, each pair of plant blocks merges at rate 1,independent of each other. Moreover, any block can change its ﬂag, from p to s at rate c , and vice versa at rate c , see Figure 2 for an illustration.The block-counting process of the seed bank n -coalescent is the two-dimensional Markovchain ( N n ( t ) , M n ( t )) t ≥ with values in ([ n ] ∪ { } ) × ([ n ] ∪ { } ) and the following transitionrates, for t ≥ N n ( t ) , M n ( t )) jumps from ( i, j ) to  ( i − , j ) , at rate (cid:0) i (cid:1) (coalescence) , ( i − , j + 1) , at rate c i (deactivation) , ( i + 1 , j − , at rate c j (activation) . Note that N n ( t ) can have either an upward jump if a seed becomes a plant, or a downwardjump if there is a coalescence event or a plant becomes a seed. Also observe that eachjump has size one. In the sequel, we suppose that N n (0) = n and M n (0) = 0. topping time ( τ ) Asymptotics of τ Asymptotics of N n ( τ ) Asymptotics of M n ( τ ) γ n − Y ) /Y n Y n θ n T / log n Z log n c log nσ n log log n Y is a Beta (2 c ,

1) distributed random variable, T is an exponential random variable with parameter 2 c c and Z is a Fr´echet randomvariables with shape parameter 1 and scale parameter 4 c c . For i ∈ [ n ], we denote by τ in the reaching time of the level i by the process N n , i.e. τ nn = 0and τ in = inf { t ≥ N n ( t ) = i } . (1)Furthermore, let γ n and θ n be, respectively, the ﬁrst time that some plant becomes a seedand the ﬁrst time that some seed becomes a plant, i.e. γ n = inf { t > M n ( t − ) < M n ( t ) } = inf { t > M n ( t ) = 1 } (2)and θ n = inf { t > M n ( t − ) > M n ( t ) } . (3)Finally, denote by σ n the time to the most recent common ancestor, already studied in[1], σ n = inf { t > N n ( t ) + M n ( t ) = 1 } = inf { t > N n ( t ) = 1 , M n ( t ) = 0 } . We ﬁrst obtain asymptotic results on the random variables γ n and θ n and the size of thesystem at those times. The results obtained in Section 2 and 3 can be summarized inTable 1 and Figure 3.In Section 4 we analyze the total length L n = A n + I n (4)where the active length is deﬁned by A n = (cid:90) σ n N n ( t ) dt (5)and the inactive length by I n = (cid:90) σ n M n ( t ) dt. (6)Our main result is stated as follows. Theorem 1.1.

Consider the seed bank coalescent starting with n plants and no seeds.Then, lim n →∞ L n log n = 2 (cid:18) c c (cid:19) in probability. Y is a Beta (2 c ,

1) distributed random variable, T is an exponential random variable with parameter 2 c c and Z is a Fr´echet randomvariables with shape parameter 1 and scale parameter 4 c c . The symbol A n p ∼ B n means that A n B n → A n D ∼ XB n means that A n B n → X in distribution. The symbol A n (cid:16) B n means that C B n ≤ E [ A n ] ≤ C B n for someconstants C , C . Interestingly, numerical techniques of [11] used to study the total length for ﬁxed n showthat the balance between active and inactive lengths is equally conserved for their expec-tations for any n ≥ c E [ A n ] = c E [ I n ] . The behavior of both A n and I n is obtained by considering those variables before andafter the time of the ﬁrst activation θ n . Hence, results of Section 3 are key tools for theforthcoming proofs. Theorem 1.1 also gives an immediate corollary on the number ofactive and inactive mutations on the seed bank tree. Corollary 1.2.

Consider the seed bank coalescent starting with n plants and no seeds. et S n be the number of mutations in the seed bank tree and let µ be the mutation rate.Then lim n →∞ S n log n = 2 µ (cid:18) c c (cid:19) in probability. Finally, in Section 5, we establish a sampling formula which is inspired by Watterson’sideas in [18] and which help us to understand the ﬁne conﬁguration of the blocks of a seedbank coalescent at given times.

We start with the study of γ n , the time of the ﬁrst deactivation deﬁned in (2), and thesize of the system at this time. Observe that, if N n (0) = n and M n (0) = 0, there are n − N n ( γ n ) − γ n and we can write γ n = n (cid:88) i = N n ( γ n )+1 V i where the V i ’s are independent exponential random variables with respective parameters (cid:0) i (cid:1) + c i .We start with an easy limit result on the variable N n ( γ n ). Note that, by considering c as a mutation rate, n − N n ( γ n ) − Proposition 2.1.

Consider a seed bank coalescent starting with n plants and no seeds.Then, lim n →∞ N n ( γ n ) n = Y in distribution, where Y ∼ Beta (2 c , .Proof. Let z ∈ (0 , P ( N n ( γ n ) ≤ zn ) = n (cid:89) i = (cid:98) zn (cid:99) +1 (cid:0) i (cid:1)(cid:0) i (cid:1) + c i = n − (cid:89) i = (cid:98) zn (cid:99) ii + 2 c = exp  − n − (cid:88) i = (cid:98) zn (cid:99) log (cid:18) c i (cid:19) . sing that log(1 + x ) ∼ x near 0, we obtain P ( N n ( γ n ) ≤ zn ) ∼ exp  − n − (cid:88) i = (cid:98) zn (cid:99) c i  ∼ exp (cid:26) − c log (cid:18) z (cid:19)(cid:27) = z c which is the distribution function of a Beta (2 c ,

1) random variable.Now, let us establish the asymptotic behavior of the time of the ﬁrst deactivation, γ n . Proposition 2.2.

Consider a seed bank coalescent starting with n plants and no seeds.Then, lim n →∞ nγ n = Γ := 2(1 − Y ) Y (7) in distribution, where Y is Beta (2 c , distributed. The density function of Γ is f Γ ( x ) = c (cid:18)

22 + x (cid:19) c +1 for x ≥ . In particular, if c > / , then the expectation of Γ is ﬁnite E [Γ] = 22 c − and if c > , the variance of Γ is ﬁnite Var(Γ) = 4 c ( c − c − . Proof.

Let G n (0) = 0 and, for t ∈ (0 , G n ( t ) = n (cid:88) i = (cid:98) (1 − t ) n (cid:99) +1 V i = n (cid:88) i = (cid:98) (1 − t ) n (cid:99) +1 e i i ( i − c ) , where the e i ’s are i.i.d standard exponential random variables. With this notation, weobtain γ n = G n (1 − N n ( γ n ) /n ) . We ﬁrst show that, for any t ∈ (0 , n →∞ ( nG n ( s )) s ≤ t = (cid:18) s − s (cid:19) s ≤ t (8)in distribution, in the sense of weak convergence in the path space D[0,t]. To this, let usﬁrst establish that, for a ﬁxed t ∈ (0 , n →∞ nG n ( t ) = 2 t − t (9) n L . By deﬁnition, we have that E [ nG n ( t )] = n (cid:88) i = (cid:98) (1 − t ) n (cid:99) +1 ni ( i − c ) ∼ n n (cid:88) i = (cid:98) (1 − t ) n (cid:99) +1 i/n ) . By a Riemann sum argument, we obtain that E [ nG n ( t )] ∼ (cid:90) − t x dx = 2 t − t . Now, by the independence of the random variables e i ,Var( nG n ( t )) = n (cid:88) i = (cid:98) (1 − t ) n (cid:99) +1 n i ( i − c ) ∼ n (cid:88) i = (cid:98) (1 − t ) n (cid:99) +1 n i . Again, by a Riemann sum argument, we obtain that Var( nG n ( t )) converges to 0 as n → ∞ .This gives (9).To obtain (8) we follow the same steps as those of Proposition 6.1 in [4], with α = 2.Then, the proof of (7) follows by adapting the alternative proof of Theorem 5,2 in [4], p.1713, taking α = 2 and the limit variable σ being 1 − Y and Beta (1 , c ) distributed.The distribution function of Γ is given by P (Γ ≤ x ) = P (cid:18) Y ≥

22 + x (cid:19) = 1 − (cid:18)

22 + x (cid:19) c for x ≥

0. We get the density by diﬀerentiating. The moments of Γ are obtained bycomputing E [Γ k ] = (cid:90) ∞ kx k − P (Γ > x ) dx = (cid:90) ∞ kx k − (cid:18)

22 + x (cid:19) c dx. In particular, the k th moment is ﬁnite for c > k/ In this section we study θ n , the ﬁrst time that a seed becomes a plant, which we introducedin (3). We also provide some limit laws for N n ( θ n ) and M n ( θ n ). Observe that from timezero up to time θ n only two types of events occur, either coalescence or deactivation.Recall the successive reaching times of the chain N n , denoted by ( τ in ) ni =1 and deﬁned in(1). roposition 3.1. Consider a seed bank coalescent starting with n plants and no seeds.Then, the following asymptotics hold. lim n →∞ N n ( θ n )log n = Z (10) in distribution, where Z is a Fr´echet random variable with shape parameter 1 and scaleparameter c c , with distribution function P ( Z ≤ z ) = exp {− c c /z } . Also lim n →∞ M n ( θ n )log n = 2 c (11) in probability. Finally, lim n →∞ log nθ n = T (12) in distribution, where T is an exponential random variable with parameter c c . The proof of (11) is obtained by combining Lemmas 3.2 and 3.6. The proof of (10) and(12) is obtained by combining Lemmas 3.3 and 3.7 which appear in the sequel. We getthese results by coupling the seed bank coalescent with two simpler models.The coloured seed bank coalescent (see Deﬁnition 4.2 in [1]) is a marked coalescent whereadditionally each element of [ n ] has a ﬂag indicating its color: white or blue. Movementsand mergers of the blocks of the colored coalescent follow the same dynamics as those ofthe classical seed bank coalescent. Additionally, if a block activates, each individual insidethis block gets the color blue. In other cases colors remain unchanged.As in [1], we start with all individuals colored with white, so color blue only appears aftera reactivation event, and we also use the notation N n ( t ) (resp. M n ( t )) for the number ofwhite plants (resp. white seeds) at time t , starting with n (white) plants and zero seeds.The notation for the reaching times of N n are τ nn = 0 and, for i ∈ [ n − τ in = inf { t > n ( t ) = i } . Note that, on the event { τ in < θ n } , we have τ in = τ in a.s., and in general the stochasticbound τ i − n − τ in ≤ st τ i − n − τ in (13)holds.This model is of particular use to prove that the number of seeds that “survive” up tomoment θ n is of order log n . More precisely, as in [1], consider the independent Bernoullirandom variables B in = { deactivation at τ in } , for i ∈ [ n − P ( B in = 1) = c ( i + 1) (cid:0) i +12 (cid:1) + c ( i + 1)= 2 c i + 2 c , (14)independently of the number of seeds in the system. It is clear that, almost surely forany t ≥ M n ( t ) ≤ (cid:80) n − i =1 B in . This and Bienaym´e-Chebyshev’s inequality lead to thefollowing straightforward result. emma 3.2. For any ε > , P (cid:18) sup t ≥ M n ( t ) > c (1 + ε ) log n (cid:19) ≤ c ε log n . (15) In particular, for any ε > , lim n →∞ P ( M n ( θ n ) ≤ c (1 + ε ) log n ) = 1 . The bounded seed bank coalescent is a modiﬁcation of the original seed bank coalescent,where only m seeds can be accumulated in the bank. Thus, when the bank is full, adeactivating lineage disappears instead of moving to the bank. In our case, we start with n plants and m seeds (the bank is full from the beginning).Denote by ¯ N n,m ( t ) (resp. ¯ M n,m ( t )) for the number of plants (resp. seeds) at time t in thebounded coalescent starting with n plants and m seeds. The block-counting process ofthe bounded coalescent with parameters c , c > i ≤ n and j ≤ m ,( ¯ N n,m ( t ) , ¯ M n,m ( t )) jumps from ( i, j ) to  ( i − , j ) , at rate (cid:0) i (cid:1) + c i { j = m } , ( i − , j + 1) , at rate c i { j

Recall T and Z from Proposition 3.1. We have that lim n →∞ P ( θ n log n ≤ t ) ≤ P ( T ≤ t ) (16) and lim n →∞ P ( N n ( θ n ) > z log n ) ≤ P ( Z > z ) . (17) Proof.

Fix ε > (cid:98) c (1 + ε ) log n (cid:99) by m n . On the event { M n ( θ n ) ≤ m n } ,which occurs asymptotically with probability 1 by Lemma 3.2, the variable θ n is boundedfrom below, stochastically, by the random variable ¯ θ n,m n deﬁned by¯ θ n,m n = inf { t ≥ M n,m n ( t − ) > ¯ M n,m n ( t ) } and having exponential distribution with parameter c m n . Then, for t > P ( θ n log n ≤ t ) = P ( θ n log n ≤ t, M n ( θ n ) ≤ m n ) + o (1) ≤ P (cid:0) ¯ θ n,m n log n ≤ t (cid:1) + o (1)= 1 − exp (cid:26) − t c (cid:98) c (1 + ε ) log n (cid:99) log n (cid:27) + o (1) . So, for any ε >

0, lim n →∞ P ( θ n log n ≤ t ) ≤ P ( T ≤ t (1 + ε )) . (18) his gives (16).To prove (17), observe that, on the event { M n ( θ n ) ≤ m n } , the variable N n ( θ n ) is boundedfrom above, stochastically, by the random variable ¯ N n,m n (¯ θ n,m n ). So, P ( N n ( θ n ) > z log n ) ≤ P (cid:0) ¯ N n,m n (¯ θ n,m n ) > z log n (cid:1) + P ( M n ( θ n ) > m n ) . (19)Let us study the asymptotic of ¯ N n,m n (¯ θ n,m n ). To this, we have that P ( ¯ N n,m n (¯ θ n,m n ) ≤ z log n ) = n (cid:89) i = (cid:98) z log n (cid:99) +1 (cid:0) i (cid:1) + c i (cid:0) i (cid:1) + c i + c m n = exp  − n (cid:88) i = (cid:98) z log n (cid:99) +1 log (cid:18) c m n i ( i − c ) (cid:19) ∼ exp  − c m n n (cid:88) i = (cid:98) z log n (cid:99) +1 i  . By a Riemann sum argument, we know thatlim n →∞ m n n (cid:88) i = (cid:98) z log n (cid:99) +1 i = 2 c (1 + ε ) (cid:90) ∞ z x dx = 2 c (1 + ε ) z . (20)Since P ( Z ≤ z ) = exp {− c c /z } , we obtain, by taking the limits in (19), thatlim n →∞ P ( N n ( θ n ) > z log n ) ≤ P ( Z > z/ (1 + ε ))which implies (17).The bounded seed bank coalescent is also useful to bound N n ( t ) from above, for any t ≥ K n ( t )) t ≥ stand for the block-counting process of the Kingman coalescent startingwith n lineages. Let ( χ i ( t )) i ≥ be a sequence of i.i.d. Bernoulli variables of parameter1 − exp( − c t ). Those variables are more easily understood as χ i ( t ) = { e i Lemma 3.4.

For a > b ≥ such that a + b > , lim n →∞ P (cid:16) τ ( a ) n ≤ (log n ) − b (cid:17) = 1 . roof. Denote (cid:98) c (1 + ε ) log n (cid:99) by m n and let E n = { sup t M n ( t ) ≤ m n } . We start byobserving that P ( τ ( a ) n > (log n ) − b ) = P ( τ ( a ) n > (log n ) − b , E n ) + P ( τ ( a ) n > (log n ) − b , E cn )From (15), we get that P ( E cn ) ≤ c ε log n . So it just remains to control the probability on the event E n . Recall ( K n ( t )) t ≥ and( χ i ( t )) i ≥ from (21). Let ω n,a = inf { t > K n ( t ) = (log n ) a } . Observe that { τ ( a ) n > t, E n } = { N n ( t ) > (log n ) a , E n }⊂ { K n ( t ) + m n (cid:88) i =1 χ i ( t ) > (log n ) a }⊂ { K n ( t ) >

12 (log n ) a } ∪ { m n (cid:88) i =1 χ i ( t ) >

12 (log n ) a } = { ω n,a > t } ∪ { m n (cid:88) i =1 χ i ( t ) >

12 (log n ) a } . Taking t = (log n ) − b , we obtain P ( τ ( a ) n > (log n ) − b , E n ) ≤ P ( ω n,a > (log n ) − b ) + P ( m n (cid:88) i =1 χ i ((log n ) − b ) >

12 (log n ) a ) . An elementary calculation on sum of independent exponential variables shows that E [ ω n,a ] ∼ n ) − a . So, Markov’s inequality for ω n,a gives P ( ω n,a > (log n ) − b ) ≤ C (log n ) b − a for some constant C >

0, which converges to 0 whenever b < a . On the other hand,Markov’s inequality applied to a binomial random variable with parameters (cid:98) c (1 + ε ) log n (cid:99) and 1 − exp( − c (log n ) − b ) (which expectation is of order (log n ) − b ) lead to P ( m n (cid:88) i =1 χ i ((log n ) − b ) >

12 (log n ) a ) ≤ C (log n ) − b − a . This quantity converges to 0 as a + b > Remark 3.5.

The rate of coalescence is quadratic with respect to the number of plantswhile the rate of deactivation (resp. the rate of activation) is linear with respect to thenumber of plants (resp. the number of seeds). The latter lemma suggests that, until time τ ( a ) n , for a > /

2, the block-counting process ( N n ( t )) t ≥ behaves similar to that of the ingman coalescent. However, at time τ (1 / n , the system reaches a level of √ log n plantsand the times of decay are no longer close to those of the Kingman coalescent. Indeed, atthis time, we claim that the number of seeds is of order log n and the coalescence eventsdo not dominate any more the dynamics. The seed bank coalescent then enters into amixed regime with coalescence and activation occurring at the same velocity.We now provide the lower bound for M n ( θ n ). This result, combined with Lemma 3.2provides the convergence (11) in Proposition 3.1. Lemma 3.6.

For any ε > and a > , lim n →∞ P ( M n ( τ ( a ) n ) > c (1 − ε ) log n ) = 1 . (22) which implies that lim n →∞ P ( M n ( θ n ) > c (1 − ε ) log n ) = 1 . (23) Proof.

Let us ﬁrst note that (17) implies thatlim n →∞ P ( N n ( θ n ) < (log n ) a ) = 1 , which, thanks to the monotonicity of ( N n ( t )) t ≥ until time θ n , is equivalent tolim n →∞ P (cid:16) θ n > τ ( a ) n (cid:17) = 1 . Due to the monotonicity of ( M n ( t )) t ≥ until time θ n , (22) implies (23).Now, on the event { θ n > τ ( a ) n } , we have M n ( τ ( a ) n ) = n − (cid:88) i = (cid:98) (log n ) a (cid:99) B in where the B in ’s are the Bernoulli random variables introduced in (14). So, P ( M n ( τ ( a ) n ) < c (1 − ε ) log n ) = P ( M n ( τ ( a ) n ) < c (1 − ε ) log n, θ n > τ ( a ) n ) + o (1) ≤ P  n − (cid:88) i = (cid:98) (log n ) a (cid:99) B in < c (1 − ε ) log n  + o (1)= P  n − (cid:88) i =1 B in < c (1 − ε ) log n + (cid:98) (log n ) a (cid:99)− (cid:88) i =1 B in  + o (1) . It is easy to convince oneself that (cid:80) (cid:98) (log n ) a (cid:99)− i =1 B in is of order log(log n ) a . The latterconverges to 0 thanks to Bienaym´e-Chebyshev’s inequality.We are now able to end the overview of the system at time θ n . The following result,combined with Lemma 3.3 provides the convergences (10) and (12) in Proposition 3.1. emma 3.7. Recall T and Z from Proposition 3.1. We have that lim n →∞ P ( N n ( θ n ) ≤ z log n ) ≤ P ( Z ≤ z ) . (24) which implies that lim n →∞ P ( θ n log n > t ) ≤ P ( T > t ) . (25) Proof.

Fix ε > (cid:98) c (1 − ε )log n (cid:99) by m n . Also denote τ (cid:98) z log n (cid:99) n by ˆ τ n . Firstobserve that P ( N n ( θ n ) ≤ z log n ) = P ( θ n ≥ ˆ τ n )So it is enough proving that lim n →∞ P ( θ n ≥ ˆ τ n ) ≤ P ( Z ≤ z ) . (26)For any t ≥

0, deﬁne X ( t ) to be the number of reactivations until time t . Let E i be anexponential random variable with parameter c i , that can be understood as the minimumof i independent exponential random variables with parameter c . Then, for any a > P ( θ n ≥ ˆ τ n ) = P ( X (ˆ τ n ) = 0) = P ( X (ˆ τ n ) − X ( τ ( a ) n ) = 0 , X ( τ ( a ) n ) = 0) ≤ P ( X (ˆ τ n ) − X ( τ ( a ) n ) = 0 | X ( τ ( a ) n ) = 0) ≤ P ( E M n ( τ ( a ) n ) > ˆ τ n − τ ( a ) n ) . The latter inequality follows by observing that if there are no activations in the timeinterval [ τ ( a ) n , ˆ τ n ], then none of the M n ( τ ( a ) n ) seeds present at time τ ( a ) n have activated.Hence, P ( θ n ≥ ˆ τ n ) ≤ E (cid:104) e − c (ˆ τ n − τ ( a ) n ) M n ( τ ( a ) n ) (cid:105) = E (cid:104) e − c (ˆ τ n − τ ( a ) n ) M n ( τ ( a ) n ) { M n ( τ ( a ) n ) >m n } (cid:105) + E (cid:104) e − c (ˆ τ n − τ ( a ) n ) M n ( τ ( a ) n ) { M n ( τ ( a ) n ) ≤ m n } (cid:105) ≤ E (cid:104) e − c m n (ˆ τ n − τ ( a ) n ) (cid:105) + P ( M n ( τ ( a ) n ) ≤ m n ) . So, by denoting for simplicity n z = (cid:98) z log n (cid:99) and n a = (cid:98) (log n ) a (cid:99) , and by (13), we obtain P ( θ n ≥ ˆ τ n ) ≤ E (cid:104) e − c m n (cid:80) nai = nz +1 ( τ i − n − τ in ) (cid:105) + P ( M n ( τ ( a ) n ) ≤ m n ) . (27)Since the variables τ i − n − τ in are independent and exponentially distributed, we have E (cid:104) e − c m n (cid:80) nai = nz +1 ( τ i − n − τ in ) (cid:105) = n a (cid:89) i = n z +1 (cid:0) i (cid:1) + c i (cid:0) i (cid:1) + c i + c m n = exp (cid:40) − n a (cid:88) i = n z +1 log (cid:18) c m n i ( i − c ) (cid:19)(cid:41) . ow, we can use equivalences. E (cid:104) e − c m n (cid:80) nai = nz +1 ( τ i − n − τ in ) (cid:105) ∼ exp (cid:40) − n a (cid:88) i = n z +1 c m n i (cid:41) A similar limit as that given in (20) implies thatlim n →∞ E (cid:104) e − c m n (cid:80) nai = nz +1 ( τ i − n − τ in ) (cid:105) = e − c c − ε ) z = P ( Z ≤ z/ (1 − ε )) . (28)Plugging (28) and (22) into (27), and observing that the result is true for any ε >

0, weget (26).A very similar path is followed to obtain (25). For some t > , let t n = t (log n ) − and forsome b >

1, let s n = (log n ) − b . As before, we get P ( θ n log n > t ) = P ( θ n > t n )= P ( X ( t n ) = 0) ≤ e − c m n ( t n − s n ) + P ( M n ( s n ) ≤ m n ) , The ﬁrst term converges to P ( T > t (1 − ε )) and the second to 0. To get the latter, ﬁrstuse (16) to see that lim n →∞ P ( θ n > s n ) = 1 . Then, just choose a > b such that Lemma 3.4 holds, and use (22) Since the result is truefor any ε >

0, we get (25).

In this section, we study the total branch length L n of the seed bank coalescent startingwith n plants and no seeds as deﬁned in (4) and prove Theorem 1.1 by combining upcomingTheorems 4.1 and 4.2. Consider the active length deﬁned in (5). We prove that this variable has the sameasymptotics as those of the total length of the Kingman coalescent.

Theorem 4.1.

Consider the seed bank coalescent starting with n plants and no seeds.Then, lim n →∞ A n log n = 2 in probability. roof. Recall the notation τ ( a ) n = τ (cid:98) (log n ) a (cid:99) n and consider some a ∈ (1 / , A n inthree parts A n = (cid:90) θ n N n ( t ) dt , A n = (cid:90) τ ( a ) n θ n N n ( t ) dt and A n = (cid:90) σ n τ ( a ) n N n ( t ) dt. Here we will work on the event { θ n ≤ τ ( a ) n } . On the complementary event, the proof ismore easily following the same steps. The result is obtained from (29), (30) and (31) inthe sequel.i) Let us ﬁrst prove that lim n →∞ A n log n = 2 (29)in probability. Observe that, between times 0 and θ n , only coalescence or deactivationevents occur. This implies that we can rewrite A n as follows, A n = n (cid:88) i = N n ( θ n )+1 iE i , where, given M n ( τ in ), the E i ’s are independent exponential random variables with respec-tive parameters (cid:0) i (cid:1) + c i + c M n ( τ in ). Let h n = (cid:80) n − i =1 2 i +2 c . By proving that E [ | A n − h n | ] = o (log n ) , we get the desired result. To this. Observe that the variable A n is stochastically boundedby the length of a Kingman coalescent with freezing, that is H n = n (cid:88) i =2 iV i , where the V i ’s, as in Section 2, are independent exponential random variables with re-spective parameters (cid:0) i (cid:1) + c i . This is true because the seeds “accelerate” the jump times.To be precise consider the following coupling. Let V i = min { E ( c ) i , E ( d ) i } where E ( c ) i isexponential with parameter (cid:0) i (cid:1) and E ( d ) i is exponential with parameter c i . Now let E ( a ) i,m be exponential with parameter c m . Construct a process ( ˜ N n ( t ) , ˜ M n ( t )) t ≥ , equal in dis-tribution to ( N n ( t ) , M n ( t )) t ≥ up to time θ n , recursively, using these exponential randomvariables. This is,( ˜ N n ( t ) , ˜ M n ( t )) jumps from ( i, m ) to  ( i − , m ) , if min { E ( c ) i , E ( d ) i , E ( a ) i,m } = E ( c ) i , ( i − , m + 1) , if min { E ( c ) i , E ( d ) i , E ( a ) i,m } = E ( d ) i , (0 , , otherwise . Here (0 ,

0) represents a cemetery state. Note that in distribution ( ˜ N n ( t ) , ˜ M n ( t )) =( N n ( t ) , M n ( t ))1 { θ n >t } . Thus, by writing (˜ τ in ) ni =1 for the successive jump times of the newprocess and ˜ r n = sup { i ≥ { E ( c ) i , E ( d ) i , E ( a ) i, ˜ M n (˜ τ in ) } = E ( a ) i, ˜ M n (˜ τ in ) } , we obtain that A n = n (cid:88) i =˜ r n +1 iV i ≤ n (cid:88) i =2 iV i = H n , here the ﬁrst equality is in distribution and the others stand almost surely. The ﬁrstequality is true because, although the V i ’s are variables with the “wrong” parameter, theyare not independent of ˜ r n , and this dependence “accelerates” these exponential randomvariables. Hence, E [ | A n − h n | ] ≤ E [ H n − A n ] + E [ | H n − h n | ] . The second term is bounded thanks to the L -convergence of sums of independent expo-nential variables. For the ﬁrst term, E [ H n − A n ] = E (cid:2) H n − E (cid:2) A n | N n ( θ n ) , ( M n ( τ in )) i ≥ (cid:3)(cid:3) = h n − E  n (cid:88) i = N n ( θ n )+1 i − c + c M n ( τ in ) i  ≤ h n − E  n (cid:88) i = N n ( θ n )+1 i − c + c sup t M n ( t ) i  . Then, denote (cid:98) c (1 + ε ) log n (cid:99) by m n , and (cid:98) (log n ) ε (cid:99) by a n , for some ε , ε >

0. Now,set the event E n = { sup t M n ( t ) ≤ m n , N n ( θ n ) ≤ a n } . We obtain that E [ H n − A n ] ≤ h n − E  E n n (cid:88) i = N n ( θ n )+1 i − c + c sup t M n ( t ) i  ≤ h n − P ( E n ) n (cid:88) i = a n +1 i − c + c m n i ≤ h n − P ( E n ) n (cid:88) i = a n +1 i − c + c m n a n +1 . Since m n a n +1 ≤ C (log n ) − ε for some constant C and P ( E n ) converges to 1 (thanks toProposition 3.1), we get that E [ H n − A n ] = o (log n ) . The L -convergence is thus obtained. This implies (29).ii) Let us now prove that lim n →∞ A n log n = 0 (30)in probability. It is clear that, almost surely, A n ≤ τ ( a ) n ( N n ( θ n ) + M n ( θ n )) . Combining Proposition 3.1 and Lemma 3.4 (choosing b < a ), we obtain the result.iii) Finally, let us prove that lim n →∞ A n log n = 0 (31) n probability. To this end, deﬁne U = N n ( τ ( a ) n ) = (cid:98) (log n ) a (cid:99) (by deﬁnition), V = M n ( τ ( a ) n ) (which, by Lemma 3.2, is stochastically bounded by 2 c (1 + ε ) log n ), and, forany k ≥ U k (resp. V k ) as the number of plants (resp. seeds) at the k th event aftertime τ ( a ) n . Each event can be a coalescence, an activation or a deactivation. Note that theincrements of U k and V k are in {− , } . Let S n be the number of jump times during theinterval ( τ ( a ) n , σ n ], i.e. S n = inf { k ≥ U k + V k = 1 } . With these notations, the active branch length on this time interval can be written as A n = S n − (cid:88) k =0 U k E k where, conditional on U k and V k , the E k ’s are independent exponential random variableswith respective parameters (cid:0) U k (cid:1) + c U k + c V k . So, we have E [ A n ] = E (cid:34) S n − (cid:88) k =0 U k (cid:0) U k (cid:1) + c U k + c V k (cid:35) . Now deﬁne D n := |{ k ≥ U k +1 − U k = − , V k +1 − V k = 1 }| as the number of deactivations during this time interval, and observe that E [ D n ] = E (cid:34) S n − (cid:88) k =0 c U k (cid:0) U k (cid:1) + c U k + c V k (cid:35) . This implies that E [ A n ] = 1 c E [ D n ] . So, it is enough to study the expectation of D n . We decompose D n = N n ( τ ( a ) n )+ M n ( τ ( a ) n ) (cid:88) i =2 D in where D in is the number of deactivations occurring while the total number of lineagesequals i , that is, D in := |{ k ≥ U k +1 − U k = − , V k +1 − V k = 1 , U k + V k = i }| . We willbound E [ D n ] thanks to the next model from Deﬁnition 4.9 of [1].Let ( (cid:98) N n ( t ) , (cid:99) M n ( t )) t ≥ having the same transitions as ( N n ( t ) , M n ( t )) t ≥ whenever (cid:98) N n ( t ) ≥ (cid:113) (cid:98) N n ( t ) + (cid:99) M n ( t ). If not, coalescence events are not permitted. For any i ≥

2, by Lemma4.10 of [1], E [ D in ] ≤ E [ (cid:98) D in ], where (cid:98) D in stands for the number of deactivations in this modelwhile (cid:98) N n ( t ) + (cid:99) M n ( t ) = i . In what follows we will give an idea of why E [ (cid:98) D in ] = O ( i − / ),implying that E [ D n ] = O ((log n ) / ), and hence proving (31).Details of the proof, which are unfortunately quite tedious, can be found inside the proofof Lemmas 4.10 and 4.11 of [1]. In the sequel, suppose that c = c = 1, for sake ofsimplicity. ix i ≥

2. The higher values that (cid:98) D in can take is when the coalescences are not permitted.Thus suppose that at time t , (cid:98) N n ( t ) + (cid:99) M n ( t ) reaches i , with (cid:98) N n ( t − ) = (cid:98)√ i (cid:99) + 1 ≥ √ i + 1.This means that (cid:98) N n ( t ) = (cid:98)√ i (cid:99) ≤ √ i . Reactivations are then needed to allow a new coa-lescence. Conditional on this conﬁguration, the probability that (cid:98) D in equals 0 is equivalentto i − (cid:98)√ i (cid:99) i × (cid:0) (cid:98)√ i (cid:99) (cid:1)(cid:0) (cid:98)√ i (cid:99) (cid:1) + (cid:98)√ i (cid:99) ∼ − √ i =: p i . This corresponds approximately to the probability of one reactivation, followed by onecoalescence before one deactivation. So we have the following almost sure bound (cid:98) D in ≤ G i − (cid:88) j =0 ∆ j where G i is a geometric random variable of parameter p i and the ∆ j ’s give the numberof deactivations between each visit of the state (cid:98)√ i (cid:99) . The time when coalescence is notallowed, is stochastically bounded from above by the time that a random walk that goesup one unit at rate i − √ i (rate at of a reactivation) and down at rate √ i (rate of adeactivation), started at zero, spends below level √ i . The random walk has ballistic speedof order i . In particular, it reaches the level √ i after √ i/i = 1 / √ i units of time in average.During the period in which coalescence events are not allowed there are always less that √ i plants, each of which deactivates at rate c (= 1). Then, we conclude that, for any j , E [∆ j ] ≤ √ i · √ i = 1This uniform bound implies that E [ (cid:98) D in ] ≤ E [ G i − E [∆ ] = O (cid:18) √ i (cid:19) , since E [ G i − ∼ √ i . Consider the inactive length deﬁned in (6).

Theorem 4.2.

Consider the seed bank coalescent starting with n plants and no seeds.Then, lim n →∞ I n log n = 2 c c in probability.Proof. Divide I n in two parts I n = (cid:90) θ n M n ( t ) dt and I n = (cid:90) σ n θ n M n ( t ) dt. t is easy to prove that I n / log n converges to 0 in probability by observing that, almostsurely, I n ≤ M n ( θ n ) · θ n , and using Proposition 3.1.To study I n , we approximate it by the accumulated time for the M n ( θ n ) seeds to activate,namely ˜ I n = M n ( θ n ) (cid:88) k =1 e k c where the e k ’s are i.i.d. standard exponential random variables. The asymptotics ofthis random variable are easily obtained reproducing the arguments of Section 2. First,by Proposition 3.1, we have that M n ( θ n ) / log n → c in probability. Second, we usethe functional law of large numbers for sums of exponential variables. This leads to thedesired result, lim n →∞ ˜ I n log n = 2 c c in probability.Finally, the diﬀerence between I n and ˜ I n can be bounded by I N n ( θ n ) + I M n ( θ n ) . Indeed,the variable I N n ( θ n ) bounds the inactive length resulting from the plants present at time θ n and the variable I M n ( θ n ) bounds the inactive length resulting from the seeds presentat time θ n that activate and deactivate again. Its expectation is clearly of order log log n .This can be seen repeating the earlier arguments of this proof. Consider the seed bank coalescent at time θ n and go back, through the active part ofthe genealogical tree, until time zero when there are n active lineages and zero inactivelineages. During this period of time we observe n − N n ( θ n ) events divided into two types:branching inside one lineage (corresponding to a coalescence) and appearance of a newlineage (corresponding to a deactivation). When there are k lineages, the probability thata branching event occurs is (cid:0) k +12 (cid:1)(cid:0) k +12 (cid:1) + c ( k + 1) = kk + 2 c whereas the probability that a new lineage appears is c k +2 c . This observation leads tomake a connection with classical Hoppe’s urn and the Chinese restaurant process (withparameter 2 c ), which are the key tools to prove Ewens’ sampling formula for the law ofthe allele frequency spectrum in the neutral model, see Chapter 1.3 in [7]. However, in ourcase, the initial conﬁguration is made of a random number N n ( θ n ) of tables (old lineages)with one client in each. By applying results of [18], we can obtain a conditional sampling ormula corresponding to observe a certain conﬁguration of lineages that passed throughthe seed bank and lineages that did not deactivate (until time θ n ).Now, let k ≤ n be a positive integer, we deﬁne the sets A ( k, n ) = (cid:40) a i , b i ≥ , i ∈ [ n ] : n (cid:88) i =1 a i = k and n (cid:88) i =1 i ( a i + b i ) = n (cid:41) and ¯ A ( k, n ) = (cid:40) a i ≥ , i ∈ [ n ] : n (cid:88) i =1 a i = k and n (cid:88) i =1 ia i ≤ n (cid:41) . From equation (3.3.2) in [18], we obtain the next theorem.

Theorem 5.1.

Let O i be the number of “old” blocks of size i (i.e. active blocks of size i at time θ n ) and let R i be the number of “recent” blocks of size i (i.e. inactive blocks ofsize i at time θ n ). Then P ( O = a , . . . , O n = a n , R = b , . . . , R n = b n | N n ( θ n )) a.s. = ( n − N n ( θ n ))! N n ( θ n )!( N n ( θ n ) + 2 c ) ( n − N n ( θ n )) n (cid:89) i =1 a i ! n (cid:89) j =1 b j ! (cid:18) c j (cid:19) b j , (32) with ( a i , b i ) i ∈ [ n ] ∈ A ( N n ( θ n ) , n ) . The notation x ( n ) stands for the ascending factorial, that is, x ( n ) = x ( x + 1) . . . ( x + n − Remark 5.2.

From the latter result and Proposition 3.1, we can obtain an approximateof a sampling formula for large n . P ( O = a , . . . , O n = a n , R = b , . . . , R n = b n )= (cid:90) ∞ P ( O = a , . . . , O n = a n , R = b , . . . , R n = b n | N n ( θ n ) = (cid:98) z log n (cid:99) ) × P ( N n ( θ n ) = (cid:98) z log n (cid:99) ) dz ∼ n (cid:89) i =1 a i ! n (cid:89) j =1 b j ! (cid:18) c j (cid:19) b j × (cid:90) ∞ Γ( n − z log n + 1)Γ( z log n + 1)Γ( z log n + 2 c )Γ( n + 2 c ) . c c z e − c c z dz. which does not depend on the non-observable variable N n ( θ n ). The variables O i and R i can be inferred if we are capable of deciding if a present individual has visited the seedbank or not. This provides a possible method of estimation of the parameters of the seedbank model. rom (32), we obtain the probability generating function of the old and recent blocks. Corollary 5.3.

Let O , ...O n , R , ..., R n be random variables with joint density given by(32). Then, their (conditional) probability generating function is E  n (cid:89) i =1 t O i i n (cid:89) j =1 s R j j | N n ( θ n )  = ( n − N n ( θ n ))! N n ( θ n )!( N n ( θ n ) + 2 c ) ( n − N n ( θ n )) × (cid:88) a ,...,a n ,b ,...,b n ∈ A ( N n ( θ n ) ,n ) n (cid:89) i =1 ( t i ) a i a i ! n (cid:89) j =1 b j ! (cid:18) c s j j (cid:19) b j . (33)Following the idea of Watterson [18], we use two artiﬁcial variables, u ∈ ( − ,

1) and v ∈ ( − , a i , b i ) ∈ A ( k, n ), n (cid:89) i =1 ( uv i ) a i n (cid:89) j =1 ( v j ) b j = u (cid:80) ni =1 a i v (cid:80) ni =1 i ( a i + b i ) = u k v n . Now, let c k,n be the multiplying coeﬃcient of u k v n in exp (cid:110)(cid:80) ni =1 uv i t i + (cid:80) ∞ j =1 2 c j s j v j (cid:111) .We can rewrite (33) as E  n (cid:89) i =1 t O i i n (cid:89) j =1 s R j j | N n ( θ n )  = ( n − N n ( θ n ))! N n ( θ n )!( N n ( θ n ) + 2 c ) ( n − N n ( θ n )) c N n ( θ n ) ,n . (34)From this relation, we obtain the probability generating function of the lineages that havenot gone through the seed bank at time θ n . Corollary 5.4.

Let O i be the number of “old” blocks of size i (i.e. active blocks of size i at time θ n ). Then, the joint probability generating function of O , O , ..., O n is E (cid:34) n (cid:89) i =1 t O i i | N n ( θ n ) (cid:35) = (cid:88) a ,..,a n ∈ ¯ A ( N n ( θ n ) ,n ) N n ( θ n )! a ! a ! ...a n ! t a t a · · · · t a n n (cid:0) c + n − z − n − z (cid:1)(cid:0) c + n − n − N n θ n (cid:1) (35) where z = (cid:80) ni =1 ia i . Proof.

First, we will write explicitly the term c k,n when s j = 1 for all j . Observe that,exp  n (cid:88) i =1 uv i t i + ∞ (cid:88) j =1 c j v j  = (1 − v ) − c exp (cid:40) u n (cid:88) i =1 v i t i (cid:41) = (1 − v ) − c ∞ (cid:88) k =0 (cid:2) u (cid:80) ni =1 v i t i (cid:3) k k ! . t implies that the coeﬃcient of u k in the latter expression is (cid:2)(cid:80) ni =1 ( v i t i ) (cid:3) k k ! (1 − v ) − c = (cid:2)(cid:80) ni =1 ( v i t i ) (cid:3) k k !  ∞ (cid:88) j =0 (cid:18) c + j − j (cid:19) v j  . Now, we need to ﬁnd the coeﬃcient of v n in the latter expression. First, observe that (cid:34) n (cid:88) i =1 ( v i t i ) (cid:35) k = (cid:88) a + ... + a n = k k ! a ! a ! ...a n ! t a t a · · · · t a n n v z where z = (cid:80) ni =1 ia i . Hence , for z ≤ n , the coeﬃcient of v n − z in the expression (cid:16)(cid:80) ∞ j =0 (cid:0) c + j − j (cid:1) v j (cid:17) is (cid:0) c + n − z − n − z (cid:1) . So, c k,n = 1 k ! (cid:88) a ,...,a n ∈ ¯ A ( k,n ) k ! a ! a ! ...a n ! t a t a · · · · t a n n (cid:18) c + n − z − n − z (cid:19) . Thus, replacing c N n ( θ n ) ,n and s j = 1 for all j in (34) we have the result.From the previous corollary we obtain the joint distribution of the lineages which havenot gone through the seed bank at time θ n . P [ O = a , ...., O n = a n | N n ( θ n )] a.s. = N n ( θ n )! a ! a ! · · · a n ! (cid:0) c + n − z − n − z (cid:1)(cid:0) c + n − n − N n ( θ n ) (cid:1) when a , . . . , a n ∈ ¯ A ( N n ( θ n ) , n ).Now, by taking t i = t i and s j = 1 for all i, j ∈ [ n ] in (34), and ﬁnding the correspond-ing coeﬃcient c N n ( θ n ) ,n , we obtain the conditional probability generating function of thenumber of lineages at time zero that has not been through the seed bank until time θ n E (cid:104) t (cid:80) ni =1 iO i | N n ( θ n ) (cid:105) = n (cid:88) z = N n ( θ n ) t z (cid:0) c + n − z − n − z (cid:1)(cid:0) z − z − N n ( θ n ) (cid:1)(cid:0) c + n − n − N n ( θ n ) (cid:1) . Finally, from (34), by taking t i = 1 for all i ∈ [ n ], and from (35) we can ﬁnd the conditionalexpectations of O j and R j for all j = 1 , , ...n − N n ( θ n ), E ( O j | N n ( θ n )) = N n ( θ n ) (cid:0) c + n − j − n − j − N n ( θ n )+1 (cid:1)(cid:0) c + n − n − N n ( θ n ) (cid:1) and E ( R j | N n ( θ n )) = 2 c j (cid:0) c + n − j − n − j − N n ( θ n ) (cid:1)(cid:0) c + n − n − N n ( θ n ) (cid:1) . Acknowledgement.

This project was partially supported by CoNaCyT grant FC-2016-1946. eferences [1] Blath J., Gonz´alez-Casanova A., Kurt N. and Wilke-Berenguer M. A new coalescentfor seed bank models. Ann. Appl. Probab. , 26(2):857–891, 2016.[2] Billingsley, P.

Convergence of Probability Measures.

Second Edition. Wiley, NewYork, 1999.[3] Delmas J.-F., Delmas J.-S. and Siri-J´egousse A. Asymptotic results on the length ofcoalescent trees.

Ann. Appl. Probab. , 18(3):997 – 1025, 2008.[4] Dhersin J.-S., Freund F., Siri-J´egousse A. and Yuan L. On the length of an externalbranch in the Beta-coalescent.

Stochastic Process. Appl. , 123(5):1691–1715, 2013.[5] Diehl C.S. and Kersting G. Tree lengths for general Λ-coalescents and the asymptoticsite frequency spectrum around the Bolthausen-Sznitman coalescent.

Ann. Appl.Probab. , 29(5):2700–2743, 2019.[6] Drmota M., Iksanov A., M¨ohle M. and R¨osler U. Asymptotic results concerning thetotal branch length of the Bolthausen-Sznitman coalescent.

Stochastic Process. Appl. ,117(10):1404 – 1421, 2007.[7] Durrett R.

Probability Models for DNA Sequence Evolution. [8] Freund, F. and Siri-J´egousse, A. (2020) The minimal observable clade size of ex-changeable coalescents. Accepted for publication in

Braz. J. Probab. Stat. , 2020.[9] Freund, F. and Siri-J´egousse, A. (2020) The impact of genetic diversity statistics onmodel selection between coalescents. Accepted for publication in

Comput. Stat. DataAnal. , 2020.[10] Gonz´alez-Casanova A., Aguirre-von-Wobeser E., Esp´ın G., Serv´ın-Gonz´alez L., KurtN., Span`o D., Blath J. and Sober´on-Ch´avez G. Strong seed bank eﬀects in bacterialevolution.

J. Theor. Biol. , 356:62–70, 2014.[11] Hobolth A., Siri-J´egousse A. and Bladt M. Phase-type distributions in populationgenetics.

Theor. Pop. Biol. , 127:16–32, 2019.[12] Kaj I., Krone S. and Lascoux M. Coalescent theory for seed bank models.

J. Appl.Probab. , 38:285–300, 2001.[13] Kersting G. The asymptotic distribution of the length of beta-coalescent trees.

Ann.Appl. Probab. , 22(5): 2086 – 2107, 2012.[14] Lambert A. and Ma C. The coalescent in peripatric metapopulations.

J. Appl.Probab. , 52(2):538–557, 2015.[15] Lennon J.T. and Jones S.E. Microbial seed banks: the ecological and evolutionaryimplications of dormancy

Nat. Rev. Microbiol. , 9(2):119, 2011.

16] Maughan, H. Rates of molecular evolution in bacteria are relatively constant despitespore dormancy.

Evolution , 61: 280–288, 2007.[17] Tellier A., Laurent S.J.Y., Lainer H., Pavlidis P. and Stephan W. Inference of seedbank parameters in two wild tomato species using ecological and genetic data.

Proc.Natl. Acad. Sci. , 108:17052–17057, 2011.[18] Watterson G.A. Lines of descent and the coalescent.

Theor. Pop. Biol. , 26(1):77–92,1984., 26(1):77–92,1984.