[PDF] Nonparametric estimation of the division rate of an age dependent branching process

Abstract

We study the nonparametric estimation of the branching rate B(x) of a supercritical Bellman-Harris population: a particle with age x has a random lifetime governed by B(x) ; at its death time, it gives rise to k≥2 children with lifetimes governed by the same division rate and so on. We observe in continuous time the process over [0,T] . Asymptotics are taken as T→∞ ; the data are stochastically dependent and one has to face simultaneously censoring, bias selection and non-ancillarity of the number of observations. In this setting, under appropriate ergodicity properties, we construct a kernel-based estimator of B(x) that achieves the rate of convergence exp(− λ B β 2β+1 T) , where λ B is the Malthus parameter and β>0 is the smoothness of the function B(x) in a vicinity of x . We prove that this rate is optimal in a minimax sense and we relate it explicitly to classical nonparametric models such as density estimation observed on an appropriate (parameter dependent) scale. We also shed some light on the fact that estimation with kernel estimators based on data alive at time T only is not sufficient to obtain optimal rates of convergence, a phenomenon which is specific to nonparametric estimation and that has been observed in other related growth-fragmentation models.

Full PDF

NNONPARAMETRIC ESTIMATION OF THE DIVISION RATE OF AN AGEDEPENDENT BRANCHING PROCESS

MARC HOFFMANN AND AD´ELA¨IDE OLIVIER

Abstract.

We study the nonparametric estimation of the branching rate B ( x ) of a supercriticalBellman-Harris population: a particle with age x has a random lifetime governed by B ( x ); atits death time, it gives rise to k ≥ , T ]. Asymptotics are taken as T → ∞ ; the data are stochastically dependent and one has to face simultaneously censoring, biasselection and non-ancillarity of the number of observations. In this setting, under appropriateergodicity properties, we construct a kernel-based estimator of B ( x ) that achieves the rate ofconvergence exp( − λ B β β +1 T ), where λ B is the Malthus parameter and β > B ( x ) in a vicinity of x . We prove that this rate is optimal in a minimax sense andwe relate it explicitly to classical nonparametric models such as density estimation observed onan appropriate (parameter dependent) scale. We also shed some light on the fact that estimationwith kernel estimators based on data alive at time T only is not suﬃcient to obtain optimal ratesof convergence, a phenomenon which is speciﬁc to nonparametric estimation and that has beenobserved in other related growth-fragmentation models. Mathematics Subject Classiﬁcation (2010) : 35A05, 35B40, 45C05, 45K05, 82D60, 92D25,62G05, 62G20.

Keywords : Growth-fragmentation, cell division, nonparametric estimation, bias selection, min-imax rates of convergence, Bellman-Harris processes.1.

Introduction

Motivation.

Structured models have been paid particular attention over the last few years,both from a probabilistic and an applied analysis angle, in particular with a view toward a betterunderstanding of population evolution in mathematical biology (see for instance the textbook byPerthame [21] and the references therein). In this context, a more speciﬁc focus and need forstatistical methods has emerged recently ( e.g.

Doumic et al. [9, 8, 7] and the references therein)and this is the topic of the present paper. If x denotes a so-called structuring variable – for instanceage, size, any measure of variability or DNA content of a cell or bacteria, and if n ( t, x ) denotes thenumber or density of cells at time t of a population starting from a single ancestor at time t = 0,a sound mathematical model can be obtained by specifying an evolution equation for n ( t, x ).Consider for instance the paradigmatic problem of age-dependent cell division, where the evo-lution of n ( t, x ) is given by the simplest transport-fragmentation equation(1)  ∂∂t n ( t, x ) + ∂∂x n ( t, x ) + B ( x ) n ( t, x ) = 0 n ( t,

0) = m (cid:82) ∞ B ( y ) n ( t, y ) dy, t > , n (0 , x ) = δ , where δ denotes the Dirac mass at point 0. In this model, each cell dies according to a divisionrate x (cid:59) B ( x ) that depends on its age x only (a living cell of age x has probability B ( x ) dx of a r X i v : . [ s t a t . O T ] O c t MARC HOFFMANN AND AD´ELA¨IDE OLIVIER dying in the interval [ x, x + dx ]) and, at its time of death, it gives rise to m ≥ m, B ) specify the so-called age-dependent model.In this seemingly simple context, we wish to draw statistical inference on the division ratefunction x (cid:59) B ( x ) and on m in the most rigorous way, when we observe the evolution of thepopulation through time and when the shape of the function B can be arbitrary, to within aprescribed smoothness class, i.e. in a nonparametric setting. In order to do so, we transfer thedeterministic description (1) into a probabilist model that consists of a system of (non-interacting)particles speciﬁed by a probability distribution p on the integers (the oﬀspring distribution) and aprobability density f on [0 , ∞ ). A particle has a random lifetime drawn according to f ( x ) dx ; atthe time of its death, it gives rise to k children with probability p k (with p = p = 0), each childhaving independent lifetimes distributed as f ( x ) dx , and so on. The resulting process is a classicalsupercritical Bellman-Harris, see for instance the textbooks of Harris [12] or Athreya and Ney [2].It is described by a piecewise deterministic Markov process(2) X ( t ) = (cid:0) X ( t ) , X ( t ) , . . . (cid:1) , t ≥ , with values in (cid:83) k ≥ [0 , ∞ ) k , where the X i ( t )’s denote the (ordered) ages of the living particles attime t . The formal link between X ( t ) and n ( t, x ) is obtained via n ( t, x ) = E (cid:2) (cid:80) ∞ i =1 δ X i ( t )= x (cid:3) whichhas to be understood in a weak (measure) sense, i.e. the empirical measure (in expectation) of theparticle system and solves Equation (1), we refer to [20].The correspondence between ( m, B ) and ( f, p ) is given by(3) B ( x ) = f ( x )1 − (cid:82) x f ( s ) ds , x ∈ [0 , ∞ ) , and m = (cid:88) k ≥ kp k , provided everything is well deﬁned. Under fairly reasonable assumptions described below, it isone-to-one between B and f , but not between m and p . We are interested in the nonparametricestimation of x (cid:59) B ( x ), which is nothing but the hazard rate function of the lifetime density f of each particle, and also in the mean oﬀspring m , the whole distribution p being considered as anuisance parameter.1.2. Objectives and results.

Observation schemes.

We assume we observe the whole trajectory ( X ( t ) , t ∈ [0 , T ]), where T > T → ∞ . If we denote by T T the populationof individuals that are born before T and observed up to time T and if ( ζ Tu , u ∈ T T ) denotes thevalues of the ages of the diﬀerent individuals of T T (at their time of death or at time T ), we wishto draw inference on B ( x ) based on (cid:8) X ( t ) , t ∈ [0 , T ] (cid:9) = (cid:8) ζ Tu , u ∈ T T (cid:9) . Although the lifetimes of the individuals are independent (and identically distributed) withcommon density f , this is no longer the case for the population ( ζ Tu , u ∈ T T ) considered as a whole:the tree structure plays a crucial role and we have to face several non-trivial diﬃculties: Bias selection: particles with small lifetimes are more often observed than particles withlarge lifetimes since the observation of the process is stopped along all the branches at theﬁxed time T , as illustrated in Figure 1. Censoring: if ∂ T T ⊂ T T denotes the population of individuals alive at time T (in red inFigure 1), they are censored in our observation scheme (we observe their lifetime only up totime T ) but contribute to the whole estimation process at the same level as the population ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 3

Figure 1.

The eﬀect of bias selection. Simulation of a binary ( p = 1 so m = 2 )age-dependent tree with B given in Section 4, up to time T = 8 ( |T T | = 145 ). Left:the size of each segment represents the lifetime of an individual. Individuals aliveat time T are represented in red. Right: genealogical representation of the samerealisation of the tree. ˚ T T ⊂ T T of individuals born and dead before T : due to the supercriticality of the process( m >

1) we have |T T | ≈ | ˚ T T | ≈ | ∂ T T | as T grows to inﬁnity, and this aﬀects the statisticalanalysis, see Section 2.2 below. Non-ancillarity: the number of observations |T T | that governs the amount of statisticalinformation is random and its distribution depends on B : we essentially have less observa-tions if B is small (particles split at a slow rate) than if B is large (particles split at a fastrate). This means that |T T | is not ancillary in the terminology of Fisher: it is not possibleto ignore its randomness (by conditioning upon its value for instance) without losing somestatistical information. We refer to the Encyclopedia of Statistics [17] for more details. Main results.

We ﬁrst study in Section 2 the behaviour of empirical measures of the form E T ( V , g ) = |V| − (cid:88) u ∈V g ( ζ Tu ) , with V = ˚ T T or ∂ T T for suitable test functions g . From the classical study of critical branching processes, it is knownthat | ˚ T T | ≈ | ∂ T T | ≈ e λ B T , where λ B > E T ( ˚ T T , g ) and E T ( ∂ T T , g ) converge to their respective limitswith rate exp( − λ B T / B and g as shown in Theorem 3 and 4 below.For the proof, we heavily rely on the recent studies of Cloez [5] and Bansaye et al. [3], two keyreferences for this paper, adjusting the tools developed in [3] to the non-Markovian case: theessential ingredient is the use of many-to-one formulae that reduce the problem to studying theevolution of a particle picked at random along the genealogical tree (Propositions 10 and 11). Therate of convergence to equilibrium of this tagged particle, which governs the rates of convergencefor statistical estimators, is obtained by a simple coupling argument (Proposition 12).These preliminary results enable us to address the main issue of the paper: we constructin Section 3 a nonparametric estimator (cid:98) B T ( x ) of B ( x ) that achieves the rate of convergenceexp( − λ B β β +1 T ) for pointwise error and uniformly over functions B with local smoothness of order β > MARC HOFFMANN AND AD´ELA¨IDE OLIVIER to statistical tools developed in L¨ocherbach [18]. This result is obtained under the restriction thatconvergence to equilibrium of a tagged particle is faster than the growth of the tree. Otherwise, westill have a rate of convergence, but we do not have (nor believe in) its optimality. We bypass theaforementioned bias selection diﬃculty 1) by weighting a kernel estimator by a de-biasing factorthat depends on preliminary estimators of λ B and m . These estimators (essentially) converge withrate exp( − λ B T /

2) as shown in Proposition 5. As for the censoring part 2), we base our nonpara-metric kernel estimator on E T ( ˚ T T , g ) and not on E T ( ∂ T T , g ), since that latter quantity would leadto a subobtimal rate of convergence as discussed in Section 3.3. Finally, the non-ancillarity issue3) is solved by specifying a random bandwidth for the kernel that also depends on the preliminaryestimation of λ B . This last point requires extra eﬀorts in order to show a form of stability that isdetailed in Proposition 17.The statistical study of branching processes goes back to Athreya and Keiding [1] for derivingmaximum likelihood theory in the case of a parametric (constant) division rate, relying on the factthat the number of living cells is then a Markov process, a property we lose here for a non-constantdivision rate x (cid:59) B ( x ). The textbook of Guttorp [11] gives an account of existing parametricmethods in the 1990’s. In the early 2000’s the regularity in the sense of the LAN and LAMNproperty was established in the comprehensive study of L¨ocherbach [18, 19], see also Hyrien [15]for statistical computational methods and Johnson et al. [16] for Bayesian analysis, and Delmasand Marsalle [6] in discrete time. In nonparametric estimation, only few results exist; we mentionthe case when dynamics between jumps is driven by a diﬀusion in H¨opfner et al. [14]. To the best orour knowledge, our study provides with the ﬁrst fully nonparametric approach in continuous timein supercritical branching processes which are piecewise deterministic. Admittedly, the Bellman-Harris model is a toy model for the study of population dynamics, but we believe that the presentcontribution sheds some light in the intrinsic diﬃculties that need to be solved in more elaboratemodels like cell equation for which only simpliﬁed statistical models have been considered so far (indiscrete time or under additional deterministic or stochastic noise like in e.g. [9, 8, 7]). Concerningbias selection, density estimation when observing a biased sample has been studied at lengthframework by Efromovich [10]. Organisation of the paper.

In Section 2, we deﬁne our rigorous statistical framework by meansof continuous time rooted trees (Section 2.1) and study the convergence properties of the biasedempirical measures E T ( ˚ T T , g ) and E T ( ∂ T T , g ) in Section 2.3. We start by deriving heuristically therespective limits of the empirical measures in Section 2.2 (that can also be found in Cloez [5] andBansaye et al. [3]) in order to shed some light on the speciﬁc methods of proof in the subsequentstudy of rate of convergence. We construct in Section 3 the estimators of m , λ B and B ( x ) and stateour statistical results together with a discussion on the extensions and limitations of our ﬁndings.Section 4 tackles the problem of numerical implementation on simulated data, advocating for areasonably use of our estimators in practice. Section 5 is devoted to the proofs. An appendix(Section 6) contains auxiliary useful results.2. Rate of convergence for biased empirical measures

Continuous time rooted trees.

It will prove more convenient to work with a representationof ( X ( t )) t ≥ in terms of a continuous time rooted tree. We need some notation and closely followBansaye et al. [3]. Let U = (cid:91) k ≥ ( N (cid:63) ) k ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 5 with N (cid:63) = { , , . . . } and ( N (cid:63) ) = { ∅ } denote the inﬁnite genealogical tree. We use throughoutthe following standard notation: for u = ( u , u , . . . , u m ) and v = ( v , . . . , v n ) in U , we write uv = ( u , . . . , u m , v , . . . , v n ) for the concatenation, we identify ∅ u , u ∅ and u , we write u (cid:22) v ifthere exists w such that uw = v and u ≺ v if u (cid:22) v and w (cid:54) = ∅ . For u = ( u , u , . . . , u m ), we alsowrite | u | = m .Given a family ( ν u , u ∈ U ) of integers representing the number of children of the individuals u ∈ U , we construct an ordered rooted tree T ⊂ U as follows:i) ∅ ∈ T ,ii) If v ∈ T , u (cid:22) v implies u ∈ T ,iii) For every u ∈ T , we have uj ∈ T if and only if 1 ≤ j ≤ ν u .For a family ( ζ u , u ∈ U ) of nonnegative numbers representing the lifetimes of the individuals u ∈ U ,we set(4) b u = (cid:88) v ≺ u ζ v and d u = b u + ζ u for the times of birth and death of the individual u ∈ U . Let U = U × [0 , ∞ ) . A continuous timerooted tree is then a subset T of U such that(i) ( ∅ , ∈ T ,(ii) The projection T of T on U is an ordered rooted tree,(iii) There exists a family ( ζ u , u ∈ U ) of nonnegative numbers such that ( u, s ) ∈ T if and onlyif b u ≤ s < d u , where ( b u , d u ) are deﬁned by (4).We now work on some probability space (Ω , F , P ). In this setting, we have the following Deﬁnition 1 (The Bellman-Harris model) . A random continuous time rooted tree is a Bellman-Harris model with oﬀspring distribution p = ( p k ) k ≥ and division rate B : [0 , ∞ ) → [0 , ∞ ) if (i) The family of the number of children ( ν u , u ∈ U ) are independent random variables withcommon distribution p . (ii) The family of lifetimes ( ζ u , u ∈ U ) are independent random variables such that (5) P (cid:0) ζ u ≥ x (cid:1) = exp (cid:0) − (cid:90) x B ( y ) dy (cid:1) , x ≥ , with (cid:90) ∞ B ( x ) dx = ∞ , (iii) The families of random variables ( ν u , u ∈ U ) and ( ζ u , u ∈ U ) are independent. Going back to the process ( X ( t )) t ≥ deﬁned in (2),we have an identity between point measureson (0 , ∞ ) that reads (cid:88) i ≥ { X i ( t ) > } δ X i ( t ) = (cid:88) u ∈T { t ∈ [ b u ,d u ) } δ t − b u . The following assumption will be in force in the paper:

Assumption 2.

The oﬀspring distribution p = ( p k ) k ≥ satisﬁes p = p = 0 , ≤ m = (cid:88) k ≥ kp k < ∞ , (cid:88) k ≥ k p k < ∞ and ¯ m = (cid:88) i (cid:54) = j (cid:88) k ≥ i ∨ j p k < ∞ . The technical condition ¯ m < ∞ is needed for the so-called many-to-one formulae, see Proposi-tion 11 below. MARC HOFFMANN AND AD´ELA¨IDE OLIVIER

The limiting objects.

In order to extract information about x (cid:59) B ( x ), we consider theempirical distribution function over the lifetimes indexed by some V T ⊂ T T for a test function g ,that is E T ( V T , g ) = |V T | − (cid:88) u ∈V T g ( ζ Tu ) , and expect a law of large number as T → ∞ . Without much of a surprise, it turns out thatdepending whether ζ Tu = ζ u or not, i.e. if the data are still alive at time T , therefore censored ornot, we have a diﬀerent limit. More precisely, deﬁne˚ T T = { u ∈ T , b u < T and d u ≤ T } and ∂ T T = { u ∈ T , b u ≤ T < d u } , i.e. the set of particles that are born and that die before T , and the set of particles alive at time T ,so that T T = ˚ T T ∪ ∂ T T . We need some notation. Introduce the

Malthus parameter λ B > (cid:90) ∞ B ( x ) e − λ B x − (cid:82) x B ( y ) dy dx = 1 m . To a division rate function x (cid:59) B ( x ) satisfying the properties of Deﬁnition 1, we associate its density lifetime f B ( x ) = B ( x ) exp (cid:0) − (cid:90) x B ( y ) dy (cid:1) , x ≥ f H B ( x ) = me − λ B x f B ( x ) , x ≥ , which in turns uniquely deﬁnes a biased division rate(7) H B ( x ) = me − λ B x f B ( x )1 − m (cid:82) x e − λ B y f B ( y ) ds . Finally, we deﬁne the limiting measures(8) ∂ E B ( g ) = λ B mm − (cid:90) ∞ g ( x ) e − λ B x e − (cid:82) x B ( y ) dy dx and(9) ˚ E B (cid:0) g (cid:1) = m (cid:90) ∞ g ( x ) e − λ B x f B ( x ) dx = (cid:90) ∞ g ( x ) f H B ( x ) dx. It is known that E T ( ∂ T T , g ) → ∂ E B ( g ) and E T (˚ T T , g ) → ˚ E B (cid:0) g (cid:1) in probability as T → ∞ , seeAppendix 6.1 for heuristics and references. We establish in Theorems 3 and 4 in the next Section2.3 a rate of convergence with some uniformity in B . The rate is linked to λ B and the geometricergodicity of an auxiliary one-dimensional Markov process with inﬁnitesimal generator(10) A H B g ( x ) = g (cid:48) ( x ) + H B ( x ) (cid:0) g (0) − g ( x ) (cid:1) densely deﬁned on continuous functions vanishing at inﬁnity and that represents the value of abranch along the tree picked uniformly at random at each branching event.2.3. Convergence results for biased empirical measures.

ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 7

Notation.

For constants b, C >

0, introduce the sets L C = (cid:110) g : [0 , ∞ ) → R , sup x | g ( x ) | ≤ C (cid:111) and B b,C = (cid:110) B : [0 , ∞ ) → [0 , ∞ ) , ∀ x ≥ b ≤ B ( x ) ≤ b max { C, } (cid:111) . For a family Γ T = (cid:0) Γ T ( γ ) (cid:1) T ≥ of real-valued random variables, with distribution depending onsome parameter γ ∈ G we say that Γ T is G -tight for the parameter γ ifsup T > ,γ ∈G P (cid:0) | Γ T ( γ ) | ≥ K (cid:1) → K → ∞ . Results.

We have a trade-oﬀ between the growth rate λ B of the tree E [ |T T | ] ≈ e λ B T and theconvergence to equilibrium of the Markov process with inﬁnitesimal generator A H B deﬁned in (10)above. More, precisely, we show in Proposition 12 below the estimate (cid:12)(cid:12)(cid:12) P tH B g ( x ) − (cid:90) ∞ g ( y ) µ B ( y ) dy (cid:12)(cid:12)(cid:12) ≤ y | g ( y ) | e − ρ B t for every x ∈ (0 , ∞ ) . Here, ( P tH B ) t ≥ denotes the semigroup associated to A H B and µ B its unique invariant probability,and ρ B = inf x H B ( x )where H B ( x ) is the biased division rate deﬁned in (7) above. The rate of convergence of the biasedempirical measures E T ( ˚ T T , g ) and E T ( ∂ T T , g ) to their limits ∂ E B ( g ) and ˚ E B ( g ) respectively deﬁnedby (8) and (9) are goverened by λ B and ρ B : deﬁne(11) v T ( B ) =  e − min { ρ B ,λ B / } T if λ B (cid:54) = 2 ρ B ,T / e − λ B T/ if λ B = 2 ρ B . We have:

Theorem 3 (Rate of convergence for particles living at time T ) . Work under Assumption 2. Forevery b, C, C (cid:48) > , v T ( B ) − (cid:0) E T (cid:0) ∂ T T , g (cid:1) − ∂ E B ( g ) (cid:1) is B b,C × L C (cid:48) -tight for the parameter ( B, g ) . Theorem 4 (Rate of convergence for particles dying before T ) . In the same setting as Theorem 3, v T ( B ) − (cid:0) E T (cid:0) ˚ T T , g (cid:1) − ˚ E B ( g ) (cid:1) is B b,C × L C (cid:48) -tight for the parameter ( B, g ) . Several comments are in order:

About the rate of convergence and the class B b,C : the restriction B ∈ B b,C enables us to obtainuniform convergence results. This is important for the subsequent statistical analysis. However,this can be relaxed if only L C (cid:48) -tightness is sought, provided B complies to the conditions ofDeﬁnition 1 and Assumption 2 and ρ B >

0. In the same direction, the rate v T ( B ) can be improvedreplacing ρ B = inf x H B ( x ) in (11) by(12) ρ (cid:63)B = sup (cid:110) ρ, ∀ x, t > | P tH B g ( x ) − (cid:90) ∞ g ( y ) µ B ( y ) dy | ≤ y | g ( y ) | e − ρt (cid:111) , and we have in particular ρ (cid:63)B ≥ ρ B . MARC HOFFMANN AND AD´ELA¨IDE OLIVIER

About the tightness: what we need in order to handle the random normalisation in E T (cid:0) ˚ T T , g (cid:1) is actually the convergence of e λ B T | ˚ T T | − . This convergence still holds in probability but notnecessarily in L ( P ), so we only have tightness in Theorems 3 (and 4 for the same reason). However,if we replace E T (˚ T T , g ) by 1 E [ | ˚ T T | ] (cid:88) u ∈ ˚ T T g ( ζ Tu ) , then we have a bound in L ( P ) together with a control on g , see Proposition 15 below. Such aﬁner control is mandatory for the subsequent statistical analysis, since we need to pick a function g that depends on T and that mimics the behaviour of the Dirac mass δ x , see Section 3 below.3. Statistical estimation

Construction of an estimation procedure.

Estimation of m and λ B . To a particle sitting at node u ∈ ˚ T T , we associate its number of children ν u (see Deﬁnition 1). Note that the knowledge of T T enables us to reconstruct ν u for every u ∈ ˚ T T .This enables us to deﬁne an estimator for m by setting(13) (cid:98) m T = | ˚ T T | − (cid:88) u ∈ ˚ T T ν u on the set | ˚ T T | (cid:54) = 0 and 2 otherwise. In order to estimate λ B , we ﬁrst observe that for Id( x ) = x ,we can write˚ E B (Id) = m (cid:90) ∞ x (cid:0) B ( x ) + λ B (cid:1) e − (cid:82) x ( B ( y )+ λ B ) dy dx − mλ B (cid:90) ∞ xe − λ B x e − (cid:82) x B ( y ) dy dx = m (cid:90) ∞ e − (cid:82) x ( B ( y )+ λ B ) dy dx − mλ B m − mλ B ∂ E B (Id) = m m − mλ B − ( m − ∂ E B (Id) , the last equality being obtained integrating by parts. So we obtain the following representation λ B = (cid:16) m − ˚ E B (Id) + ∂ E B (Id) (cid:17) − and this yields the estimator(14) (cid:98) λ T = (cid:16) (cid:98) m T − | ˚ T T | − (cid:88) u ∈ ˚ T T ζ u + | ∂ T T | − (cid:88) u ∈ ∂ T T ζ Tu (cid:17) − . The following convergence result for (cid:98) λ T is then a consequence of Theorems 3 and 4. Proposition 5.

In the same setting as Theorem 3 with v T ( B ) given in (11) above, we have that e λ B T/ (cid:0) (cid:98) m T − m (cid:1) and T − v T ( B ) − (cid:0)(cid:98) λ T − λ B (cid:1) are B b,C -tight for the parameter B . ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 9

Reconstruction formula for B ( x ) . An estimator (cid:98) B T : [0 , ∞ ) → R of B is a random function (cid:98) B T ( x ) = (cid:98) B T (cid:0) x, ( X ( t )) t ∈ [0 ,T ] ) , x ∈ [0 , ∞ )that is measurable as a function of ( X ( t )) t ∈ [0 ,T ] but also as a function of x . By (3), we have B ( x ) = f B ( x )1 − (cid:82) x f B ( y ) dy and from the deﬁnition ˚ E B (cid:0) g (cid:1) = m (cid:82) ∞ g ( x ) e − λ B x f B ( x ) dx we obtain the formal reconstructionformula(15) B ( x ) = ˚ E B (cid:0) m − e λ B · δ x ( · ) (cid:1) − ˚ E B (cid:0) m − e λ B · {·≤ x } (cid:1) where δ x ( · ) denotes the Dirac function at x . Therefore, substituting m and λ B by the estimatorsdeﬁned in (13) and (14) and taking g as a weak approximation of δ x , we obtain a strategy forestimating B ( x ) replacing furthermore ˚ E B ( · ) by its empirical version E T (˚ T T , · ). Construction of a kernel estimator and function spaces.

Let K : R → R be a kernel function. For h >

0, set K h ( x ) = h − K ( h − x ). In view of (15), we deﬁne the estimator (cid:98) B T ( x ) = E T (cid:0) ˚ T T , (cid:98) m − T e (cid:98) λ T · K h ( x − · ) (cid:1) − E T (cid:0) ˚ T T , (cid:98) m − T e (cid:98) λ T · {·≤ x } (cid:1) on the set E T (cid:0) ˚ T T , (cid:98) m − T e (cid:98) λ T · {·≤ x } (cid:1) (cid:54) = 1 and 0 otherwise. Thus (cid:98) B T ( x ) is speciﬁed by the choice ofthe kernel K and the bandwidth h >

0. Note that the observations ( ζ u , u ∈ ∂ T T ) only occur inthe estimator (cid:98) λ T of λ B .We need the following property on K : Assumption 6.

The kernel K : R → R is diﬀerentiable with compact support and for some integer n ≥ , we have (cid:82) ∞−∞ x k K ( x ) dx = { k =0 } for k = 1 , . . . , n . Assumption 6 will enable us to have nice approximation results over smooth functions B , de-scribed in the following way: for a compact interval D ⊂ (0 , ∞ ) and β >

0, with β = (cid:98) β (cid:99) + { β } ,0 < { β } ≤ (cid:98) β (cid:99) an integer, let H β D denote the H¨older space of functions g : D → R possessinga derivative of order (cid:98) β (cid:99) that satisﬁes(16) | g (cid:98) β (cid:99) ( y ) − g (cid:98) β (cid:99) ( x ) | ≤ c ( g ) | x − y | { β } . The minimal constant c ( g ) such that (16) holds deﬁnes a semi-norm | g | H β D . We equip the space H β D with the norm (cid:107) g (cid:107) H β D = sup x | g ( x ) | + | g | H β D and the balls H β D ( L ) = { g : D → R , (cid:107) g (cid:107) H β D ≤ L } , L > . Convergence results for (cid:98) B T ( x ) . We are ready to give our main result, namely a rate ofconvergence of (cid:98) B T ( x ) for x restricted to a compact interval D , uniformly over H¨older balls H β D ( L )of (known) smoothness β intersected with B b,C . Deﬁne(17) w T ( B ) = T { λB =2 ρB } exp (cid:0) − min { λ B , ρ B } β − ( λ B /ρ B − + / β + 1 T (cid:1) and note that when ρ B ≥ λ B , we have w T ( B ) = e − λ B β β +1 T ≈ E [ |T T | ] − β/ (2 β +1) . Theorem 7 (Upper rate of convergence) . Specify (cid:98) B T with a kernel satisfying Assumption 6 forsome n > and (18) h = (cid:98) h T = exp (cid:0) − (cid:98) λ T β +1 T (cid:1) for some β ∈ [1 / , n ) . For every b, C > , L > , every compact interval D in (0 , ∞ ) (withnon-empty interior) and every x ∈ D , w T ( B ) − (cid:0) (cid:98) B T ( x ) − B ( x ) (cid:1) is B b,C ∩ H β D ( L ) -tight for the parameter B . We have a partial optimality result in a minimax sense. Deﬁne B + b,C = (cid:8) B ∈ B b,C , λ B ≤ ρ B (cid:9) and B − b,C = (cid:8) B ∈ B b,C , ρ B ≤ λ B (cid:9) so that B b,c = B + b,C ∪ B − b,C We then have the following

Theorem 8 (Lower rate of convergence over B + b,C ) . Let D be a compact interval in (0 , ∞ ) . Forevery x ∈ D and every positive b, C, β, L , there exists C (cid:48) > such that lim inf T →∞ inf (cid:98) B T sup B P (cid:0) e λ B β β +1 T (cid:12)(cid:12) (cid:98) B T ( x ) − B ( x ) (cid:12)(cid:12) ≥ C (cid:48) (cid:1) > , where the supremum is taken among all B ∈ B + b,C ∩ H β D ( L ) and the inﬁmum is taken among allestimators. We observe a conﬂict between the rate growth of the tree λ B and its convergence rate to equilib-rium ρ B . On B + b,C we retrieve the expected usual optimal rate of convergence exp( − λ B β β +1 T ) ≈ E [ |T T | ] − β/ (2 β +1) whereas if ρ B ≤ λ B , we obtain the deteriorated rate exp (cid:0) − min { λ B , ρ B } ( β − ( λ B ρ B − / (2 β + 1) T (cid:1) and this rate is presumably not optimal, as discussed at length in Section3.3 below.3.3. Discussion of the results.

Rates of convergence.

The “parametric case” for a constant division rate B ( x ) = b with b > t (cid:59) | ∂ T t | , i.e. the number of cells alive at time t is Markov. In that setting, explicit (asymptotic) informationbounds are available (Athreya and Keiding [1]). In particular, the model is regular with asymp-totic Fisher information of order e λ B T , thus the best-achievable (normalised) rate of convergenceis e − λ B T/ . This is consistent with the minimax rate exp( − λ B β β +1 T ) that we obtain for the class H β D ( L ) ∩ B + b,C , and we retrieve the parametric rate by formally setting β = ∞ in the previousformula.However, this rate is strongly parameter dependent in the sense that it also depends on B via λ B . This dependence is severe, since it appears at the same level as the smoothness exponent β/ (2 β + 1) in the rate exponent β β +1 λ B . For instance, in the simplest case of a constant function B ( x ) = b for every x ≥

0, we have λ B = ( m − b , and we see that B ( b here) plays at the same levelas β/ (2 β + 1). This also has a non-trivial technical cost in establishing rates of convergence for theestimator (cid:98) B T ( x ): in order to minimise the bias-variance tradeoﬀ, the (log)-bandwidth has to bechosen as − λ B β +1 T (cid:0) o (1) (cid:1) exactly, and this is achieved by the plug-in rule − (cid:98) λ T β +1 T thanksto Proposition 17. We then have to carefully check that our estimator is not too sensitive to this ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 11 further approximation, and this requires the analysis of the smoothness of the process h (cid:59) (cid:98) B T,h ( x )where h is the bandwidth of (cid:98) B T ( x ), as shown in Proposition 17. Fast convergence to equilibrium in B + b,C versus slow convergence in B − b,C . While we have an optimalrate of convergence over B + b,B , the situation is unclear over B − b,C . First, the convergence rate toequilibrium ρ B should be replaced by an estimator and that would lead to extraneous diﬃculties.Even if we knew ρ B , optimising the bias-variance trade-oﬀ in the proof of Theorem 7 would notlead to the expected rate exp( − min { λ B , ρ B } β β +1 T ) but to an intermediate rate that reads(19) exp (cid:0) − min { λ B , ρ B } min { max { ρ B /λ B , / } , } β { max { ρ B /λ B , / } , } β + 1 T (cid:1) , and that continuously deteriorates as ρ B separates λ B from below. Let us also mention that theclasses B + b,C and B − b,C are never trivial. To that end, deﬁne(20) B b,m = (cid:8) B ∈ B b,m/ ( m − , ∀ x ≥ B (cid:48) ( x ) − B ( x ) ≤ (cid:9) where m = (cid:80) k ≥ kp k is the mean number of children at each branching event. Proposition 9.

For any b > , we have B b,m ⊂ B + b,m/ ( m − . For every C > m ( m + 2) b/ ( m − , β > and any compact interval D ⊂ (0 , ∞ ) , there exists B ∈ H β D such that B ∈ B − b,C and B / ∈ B + b,C . In the proof of Proposition 9 below we show a versatility in the choice of functions B that yieldeither fast or slow rate of convergence to equilibrium. Finally, one could (at least formally) replace ρ B by ρ (cid:63)B , the optimal geometric rate of convergence to equilibrium deﬁned in(12) above, but thatwould only improve on the rate of convergence (19) replacing ρ B by ρ (cid:63)B which we do not know howto estimate, neither analytically nor statistically and the obtained result would still presumablynot be optimal. This suggests a totally diﬀerent estimation strategy – that we do not have at themoment – whenever convergence to equilibrium is slow. Other loss functions. If K ⊂ ˚ D is a closed interval ( ˚ D denotes the interior of D ), then Theorem 7also holds uniformly in x ∈ K . So we also have that w T ( B ) − (cid:90) K ( (cid:98) B T ( x ) − B ( x ) (cid:1) dx is B b,C ∩ H β D ( L )-tight for the parameter B . For integrated squared error-loss, we could weaken thesmoothness constraint B ∈ H β D ( L ) to Sobolev smoothness (see e.g. [24]) when the smoothness ismeasured in L -norm. An extension of Theorem 8 can be obtained likewise. Smoothness adaptation.

Our estimator (cid:98) B T ( x ) is not β -adaptive, in the sense that the choice of the B + b,C -optimal (log) bandwidth − (cid:98) λ T β +1 T still depends on β , which is unknown in principle. In thenumerical implementation Section 4 below, we address this issue from a practical point of view.However, a theoretical result is still needed. The classical analysis of adaptive (or other) kernelmethods `a la Lepski for instance shows that this boils down to proving concentration inequalitiesof the type(21) P (cid:0)(cid:12)(cid:12) E T (cid:0) ˚ T T , g h (cid:1) − ˚ E B ( g h ) (cid:12)(cid:12) ≥ e λ B T/ c ( q, T ) (cid:1) ≤ e − qλ B T , q > , where, for 0 < h − ≤ e λ B T , the test function g h has the form g h ( y ) = h − / g (cid:0) h − ( x − y ) (cid:1) with x ∈ D and g ∈ L C . The threshold c ( q, T ) should be of order qλ B T and would inﬂate the risk by aslow term (of order T). By a suitable choice of q , it would then be possible to obtain adaptation for β in compact intervals. Concentration inequalities like (21) have been explored in [4] in discretetime. To the best of our knowledge, such inequalities are not yet available in continuous time andlie beyond the scope of the paper. Information from ˚ T T versus ∂ T T . In the regime B ∈ B + b,C , having ∂ E B ( g ) = λ B mm − (cid:90) ∞ g ( x ) e − λ B x exp (cid:0) − (cid:90) x B ( y ) dy (cid:1) dx and ignoring the fact that the constants m and λ B are unknown (or rather knowing that they canbe estimated at the superoptimal rate e λ B T/ ), we can anticipate that by picking a suitable testfunction g mimicking a delta function g ( x ) ≈ δ x , the information about B ( x ) can only be inferredthrough exp( − (cid:82) x B ( y ) dy ), which imposes to further take a derivative hence some ill-posedness.We can brieﬂy make all these arguments more precise (still in the regime B ∈ B + b,C ) : we assumethat we have estimators of (cid:98) m T of m and (cid:98) λ T of λ B (using the ones deﬁned in (13) and (14) or byany other means) that converge with rate T − e λ B T/ as in Proposition 5. Consider the quantity (cid:98) f h,T ( x ) = −E T (cid:16) ∂ T T , (cid:98) m T − (cid:98) λ T (cid:98) m T (cid:0) K h (cid:1) (cid:48) ( x − · ) (cid:17) for a kernel satisfying Assumption 6. By Theorem 3 and integrating by part, we readily see that(22) (cid:98) f h,T → − ∂ E B (cid:16) m − λ B m (cid:0) K h (cid:1) (cid:48) ( x − · ) (cid:17) = (cid:90) ∞ K h ( x − y ) f B + λ B ( y ) dy in probability as T → ∞ , where f B + λ B is the density associate to the division rate x (cid:59) B ( x ) + λ B .On the one hand, it is not diﬃcult to show that Proposition 15 (used in the proof of Theorem 7below) is valid when substituting ˚ T T by ∂ T T , so we expect (altough not formally established) therate of convergence in (22) to be of order h − / e λ B T/ since we take the derivative of the kernel K h . On the other hand, the limit (cid:82) ∞ K h ( x − y ) f B + λ B ( y ) dy approximates f B + λ B ( x ) with an errorof order h β if B ∈ H β D . Balancing the two error terms in h , we see that we can estimate f B + λ B ( x )with an error of (presumably optimal) order exp( − λ B β β +3 T ). Due to the fact that the denomina-tor in representation (3) can be estimated with parametric error rate exp( − λ B T /

2) (possibly upto polynomially slow terms in T ), we end up with the rate of estimation exp( − λ B β β +3 T ) for B ( x )as well, and that can be related to an ill-posed problem of order 1 (see for instance [24]).This phenomenon, namely the structure of an ill-posed problem of order 1 in restriction to dataalive at time T , has already been observed in other settings: for the estimation of a size-divisionrate from living cells at a given large time in Doumic et al. [9, 8] or for the estimation of thedislocation measure for a homogeneous fragmentation in Hoﬀmann and Krell [13]. Note also thatthis phenomenon does not appear in parametric estimation, since the number of data in ˚ T T and ∂ T T are of the same order of magnitude (or put diﬀerently, the rates in Theorems 3 and 4 are thesame and govern the rate of estimation of a one dimensional parameter).4. Numerical implementation

We assume that each cell u ∈ U has exactly two children at each division ( p = 1). Thiscan model the evolution of a population of cells reproducing by binary divisions, as described ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 13 deterministically by (1). We pick a trial division rate B deﬁned analytically by B ( x ) =  x − x + x + if 0 ≤ x ≤ − exp (cid:0) − ( x − ) (cid:1) if x > and represented in Figure 2 (bold red line). We have b ≤ B ( x ) ≤ mm − b for any x ≥ b = 0 . m = 2 and the lifetime density f B is non increasing (except in a vicinity of zero). Given T > ζ ∅ with probability density f B and set d ∅ = ζ ∅ .For u ∈ U such that d u > T , we do not simulate the lifetimes of its descendants since they arenot in the observation scheme ˚ T T ∪ ∂ T T . For u ∈ U such that d u ≤ T we simulate ζ u and ζ u independently with probability density f B ; we set d u := d u + ζ u and d u := d u + ζ u . Using Rsoftware, we generate M = 100 trees up to time T = 23, so that the mean number of observations | ˚ T T | is suﬃciently large. (Note that for a binary tree, we always have the identity | ∂ T T | = | ˚ T T | +1.)Figure 1 represents a typical observation scheme with continuous or discrete representation. The(random) number of observations ﬂuctuates a lot as shown in Table 1 where some elementarystatistics are given. Min. 1st Qu. Med. Mean 3rd Qu. Max. Std.

Table 1.

Fluctuations of the number of observations | ˚ T T | for M = 100 Monte-Carlo continuous trees observed up to time T = 23 . We take a Gaussian kernel K ( x ) = (2 π ) − / exp( − x /

2) and the bandwidth (cid:98) h T is chosen hereaccording to the rule-of-thumb 1 . σ | ˚ T T | − / where ˆ σ is the empirical standard deviation of ( ζ u , u ∈ ˚ T T ). We also implemented standard cross-validation with less success. We evaluate (cid:98) B T on a regulargrid of D = [0 . , .

5] with mesh ∆ x = 0 .

01. For each sample we compute the empirical error e i = (cid:107) (cid:98) B ( i ) T − B (cid:107) ∆ x (cid:107) B (cid:107) ∆ x , i = 1 , . . . , M, where (cid:107) · (cid:107) ∆ x denotes the discrete norm over the numerical sampling. Table 2 displays the mean-empirical error e = M − (cid:80) Mi =1 e i together with the empirical standard deviation (cid:0) M − (cid:80) Mi =1 ( e i − e ) (cid:1) / . The comparison of the density of interest f B and the biased density f H B on Figure 2 T

13 15 17 19 21 23

Mean | ˚ T T |

652 1 847 5 202 14 634 41 151 115 760 e Std. dev.

Table 2.

Mean empirical relative error e and its standard deviation, with respectto T , for the division rate B reconstructed over the interval D = [0 . , . by theestimator (cid:98) B T . highlights the bias selection since f H B gives more weight to small lifetimes than f B . The errordeteriorates as x grows since the biased density f H B (bold blue line - we approximate the Malthusparameter using (6) and we ﬁnd λ B ≈ . T , the better the reconstruction at a visual level, as shown on Figure 2 where 95%-level conﬁdence bands are builtso that for each point x , the lower and upper bounds include 95% of the estimators ( (cid:98) B ( i ) T ( x ) , i =1 . . . M ). Close to 0, B ( x ) does not lie in the conﬁdence band: our estimator exhibits a large biasthere, and this is presumably due to a boundary eﬀect. The error is close to exp( − λ B T /

5) asexpected: indeed, for a kernel of order n , the bias term in density estimation is of order h β ∧ ( n +1) .Given that B is smooth in our example, we rather expect exp( − λ B ( n +1)2( n +1)+1 T ) = exp( − λ B T / n = 1 that we use here, and this is consistent with what we observein Figure 3. Figure 2.

Reconstruction of B over D = [0 . , with -level conﬁdence bandsconstructed over M = 100 Monte-Carlo continuous trees. In bold red line: x (cid:59) B ( x ) ; in bold blue line: f H B ; in blue line: f B (on the same y -axis scale). Left: T = 15 . Right: T = 23 . Proofs

For a locally integrable B : [0 , ∞ ) → [0 , ∞ ) such that (cid:82) ∞ B ( y ) dy = ∞ , recall that we set f B ( x ) = B ( x ) e − (cid:82) x B ( y ) dy , x ≥ . Recall that H B is characterised by f H B ( x ) = me − λ B x f B ( x ) , x ≥ . Preliminaries.

Many-to-one formulae.

For u ∈ U , we write ζ tu for the age of the cell u at time t ∈ I u = [ b u , d u ), i.e. ζ tu = ( t − b u ) { t ∈ I u } . We extend ζ tu over [0 , b u ) by setting ζ tu = ζ tu ( t ) , where u ( t ) is the ancestorof u living at time t , deﬁned by u ( t ) = v if v (cid:22) u and ( v, t ) ∈ T . For t ≥ d u we set ζ tu = ζ u . Notethat ζ Tu = ζ u on the event u ∈ ˚ T T .Let ( χ t ) t ≥ and ( (cid:101) χ t ) t ≥ denote the one-dimensional Markov processes with inﬁnitesimal gen-erators (densely deﬁned on continuous functions vanishing at inﬁnity) A B and A H B respectively, ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 15

Figure 3.

The log-average relative empirical error over M = 100 Monte-Carlocontinuous trees vs. T (i.e. the log-rate) for x (cid:59) B ( x ) reconstructed over D =[0 . , . with x (cid:59) (cid:98) B T ( x ) (dashed blue line) compared to the expected log-rate(solid red line). where A B g ( x ) = g (cid:48) ( x ) + B ( x ) (cid:0) g (0) − g ( x ) (cid:1) and such that P ( χ = 0) = P ( (cid:101) χ = 0) = 1. We also denote by ( P tH B ) t ≥ the Markov semigroupassociated to A H B . Proposition 10 (Many-to-one formulae) . For any g ∈ L C , we have (23) E (cid:104) (cid:88) u ∈ ∂ T T g ( ζ Tu ) (cid:105) = e λ B T m E (cid:104) g ( (cid:101) χ T ) B ( (cid:101) χ T ) − H B ( (cid:101) χ T ) (cid:105) , and (24) E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ Tu ) (cid:105) = E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:105) = 1 m (cid:90) T e λ B s E (cid:104) g ( (cid:101) χ s ) H B ( (cid:101) χ s ) (cid:105) ds. In order to compute rates of convergence, we will also need many-to-one formulae over pairsof individuals. We can pick two individuals in the same lineage or over forks, i.e. over pairs ofindividuals that are not in the same lineage. If u, v ∈ U , u ∧ v denote their most recent commonancestor. Deﬁne FU = { ( u, v ) ∈ U , | u ∧ v | < | u | ∧ | v |} and FT = FU ∩ T . Introduce also ¯ m = (cid:80) i (cid:54) = j (cid:80) k ≥ i ∨ j p k which is ﬁnite by Assumption 2. Proposition 11 (Many-to-one formulae over pairs) . For any g ∈ L C , we have E (cid:104) (cid:88) u,v ∈ ∂ T T ,u (cid:54) = v g ( ζ Tu ) g ( ζ Tv ) (cid:105) = ¯ mm (cid:90) T e λ B s (cid:16) e λ B ( T − s ) P T − sH B (cid:0) g H B B (cid:1) (0) (cid:17) P sH B H B (0) ds, (25) E (cid:104) (cid:88) ( u,v ) ∈FT ∩ ˚ T T g ( ζ u ) g ( ζ v ) (cid:105) = ¯ mm (cid:90) T e λ B s (cid:18) (cid:90) T − s e λ B t P tH B ( gH B )(0) dt (cid:19) P sH B H B (0) ds, (26) and E (cid:2) (cid:88) u,v ∈ ˚ T T ,u ≺ v g ( ζ u ) g ( ζ v ) (cid:3) = (cid:90) T e λ B s (cid:16) (cid:90) T − s e λ B t P tH B (cid:0) gH B (cid:1) (0) dt (cid:17) P sH B ( gH B )(0) ds. (27)The identity (25) is a particular case of Lemma 3.9 of Cloez [5]. In order to obtain identity(26), we closely follow the method of Bansaye et al. [3]. Although the setting in [3] is much moregeneral than ours, it formally only applies for exponential renewal times (corresponding to constantfunctions B ) so we need to slightly accommodate their proof. The same ideas enable us to prove(27). This is set out in details in the appendix. Geometric ergodicity of the auxiliary Markov process.

Deﬁne the probability measure µ B ( x ) dx = c B exp( − (cid:90) x H B ( y ) dy ) dx for x ≥ . We have the fast convergence of P TH B toward µ B as T → ∞ . More precisely, Proposition 12.

Let ρ B = inf x H B ( x ) . For any B ∈ B b,C , g ∈ L C (cid:48) , t ≥ and x ∈ (0 , ∞ ) , wehave (cid:12)(cid:12)(cid:12) P tH B g ( x ) − (cid:90) ∞ g ( y ) µ B ( y ) dy (cid:12)(cid:12)(cid:12) ≤ y | g ( y ) | exp (cid:0) − ρ B t (cid:1) . Proof.

First, one readily checks that (cid:82) ∞ A H B f ( x ) µ B ( x ) dx = 0 for any continuous f , and sincemoreover P tH B is Feller, it admits µ B ( x ) dx as an invariant probability. It is now suﬃcient to show (cid:107) Q x,tB − µ B (cid:107) T V ≤ exp( − ρ B t )where Q x,tB denotes the law of of the Markov process with inﬁnitesimal generator A H B startedfrom x at time t = 0 and (cid:107) · (cid:107) T V is the total variation norm between probability measures. Let N ( ds dt ) be a Poisson random measure with intensity ds ⊗ dt on [0 , ∞ ) × [0 , ∞ ). Deﬁne on thesame probability space two random processes ( Y t ) t ≥ and ( Z t ) t ≥ such that Y t = x + t − (cid:90) t (cid:90) ∞ Y s − { z ≤ H B ( Y s − ) } N ( dz ds ) , t ≥ ,Z t = Z + t − (cid:90) t (cid:90) ∞ Z s − { z ≤ H B ( Z s − ) } N ( dz ds ) , t ≥ , where Z is a random variable with distribution µ B . We have that both ( Y t ) t ≥ and ( Z t ) t ≥ areMarkov processes driven by the same Poisson random measure with generator A H B . Moreover, if N has a jump in [0 , t ) × [0 , inf x H B ( x )], then Y t and Z t both necessarily start from 0 after thisjump and coincide further on. It follows that P ( Y t (cid:54) = Z t ) ≤ P (cid:16) (cid:90) t (cid:90) inf x H B ( x )0 N ( ds dt ) = 0 (cid:17) = exp( − inf x H B ( x ) t ) = exp( − ρ B t ) . Observing that Y t and Z t have distribution Q x,tB and µ B respectively, we conclude thanks to thefact that (cid:107) Q x,tB − µ B (cid:107) T V ≤ P ( Y t (cid:54) = Z t ). (cid:3) ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 17

Proof of Theorems 3 and 4.

In order to ease notation, when no confusion is possible, weabbreviate B b,C by B and L C by L . Proof of Theorem 3.

Writing e min { λ B / ,ρ B } T (cid:0) E T ( ∂ T T , g ) − ∂ E B ( g ) (cid:1) = e λ B T (cid:12)(cid:12) ∂ T T (cid:12)(cid:12) e (min { λ B / ,ρ B }− λ B ) T (cid:88) u ∈ ∂ T T (cid:0) g ( ζ Tu ) − ∂ E B ( g ) (cid:1) , Theorem 3 is then a consequence of the following two facts: ﬁrst we claim that(28) e λ B T (cid:12)(cid:12) ∂ T T (cid:12)(cid:12) − → W B in probability as T → ∞ , uniformly in B ∈ B , where the random variable W B satisﬁes P ( W B >

0) = 1, and second, for B ∈ B and g ∈ L , we claim that the following estimate holds:(29) E (cid:104)(cid:16) (cid:88) u ∈ ∂ T T (cid:0) g ( ζ Tu ) − ∂ E B ( g ) (cid:1)(cid:17) (cid:105) (cid:46) e (2 λ B − min { λ B , ρ B } ) T , where (cid:46) means up to a constant (possibly varying from line to line) that only depends on B and L and up to a multiplicative slow term of order T in the case λ B = 2 ρ B . Step 1 . The convergence (28) is a consequence of the following lemma:

Lemma 13.

For every B ∈ B , there exists (cid:102) W B with P ( (cid:102) W B >

0) = 1 such that (30) E (cid:104)(cid:16) | ∂ T T | E (cid:2) | ∂ T T | (cid:3) − (cid:102) W B (cid:17) (cid:105) → as T → ∞ , uniformly in B ∈ B and (31) κ − B e λ B T E (cid:2) | ∂ T T | (cid:3) → as T → ∞ , uniformly in B ∈ B , where κ − B = λ B mm − (cid:82) ∞ exp( − (cid:82) x H B ( y ) dy ) dx . Lemma 13 is well known, and follows from classical renewal arguments, see Chapter 6 in thebook of Harris [12]. Only the uniformity in B ∈ B requires an extra argument, but with a uniformversion of the key renewal theorem of [23], it readily follows from the proof of Harris, so we omitit. Note that (30) and (31) entail the convergence e λ B T | ∂ T T | − → κ B (cid:102) W − B = W B in probabilityas T → ∞ uniformly in B ∈ B , and this entails (28). Step 2 . We now turn to the proof of (29). Without loss of generality, we may (and will) assumethat ∂ E B ( g ) = 0. We have E (cid:2)(cid:0) (cid:88) u ∈ ∂ T T g ( ζ Tu ) (cid:1) (cid:3) = E (cid:2) (cid:88) u ∈ ∂ T T g ( ζ Tu ) (cid:3) + E (cid:2) (cid:88) u,v ∈ ∂ T T ,u (cid:54) = v g ( ζ Tu ) g ( ζ Tv ) (cid:3) = I + II, say. By (23) in Proposition 10, we write I = e λ B T m E (cid:104) g ( (cid:101) χ T ) B ( (cid:101) χ T ) − H B ( (cid:101) χ T ) (cid:105) ≤ e λ B T m (cid:90) ∞ g ( x ) H B ( x ) B ( x ) µ B ( x ) dx + e λBT m (cid:12)(cid:12)(cid:12) P TH B (cid:0) g H B B (cid:1) (0) − (cid:90) ∞ g ( x ) H B ( x ) B ( x ) µ B ( x ) dx (cid:12)(cid:12)(cid:12) . Since g ∈ L and B ∈ B , we successively have m − (cid:90) ∞ g ( x ) H B ( x ) B ( x ) µ B ( x ) dx (cid:46) g ( x ) H B ( x ) B ( x ) (cid:46) . Note that for B ∈ B , we have H B ( x ) = B ( x ) (cid:82) ∞ x B ( y ) e − λ B ( y − x ) exp( − (cid:82) yx B ( u ) du ) dy ≤ b max { C, } λ B + b max { C, } . We also have λ B ≤ λ (cid:101) B as soon as B ( x ) ≤ (cid:101) B ( x ) for all x (see for instance the proof of Propo-sition 9) so inf B ∈B λ B > (cid:12)(cid:12)(cid:12) P TH B (cid:0) g H B B (cid:1) (0) − (cid:90) ∞ g ( x ) H B ( x ) B ( x ) µ B ( x ) dx (cid:12)(cid:12)(cid:12) (cid:46) , and we conclude that I (cid:46) e λ B T ≤ e (2 λ B − min { λ B , ρ B } ) T . By (25) of Proposition 11 we have II = ¯ me λ B T m (cid:90) T e − λ B s (cid:16) P T − sH B (cid:0) g H B B (cid:1) (0) (cid:17) P sH B H B (0) ds. Since B ∈ B and g ∈ L , the estimates P sH B H B (0) (cid:46) | g ( x ) | H B ( x ) B ( x ) (cid:46) g ( x ) H B ( x ) B ( x ) which has vanishing integral under µ B , we obtain (cid:12)(cid:12) P T − sH B (cid:0) g H B B (cid:1) (0) (cid:12)(cid:12) (cid:46) e − ρ B ( T − s ) hence | II | (cid:46) e λ B T (cid:90) T e − λ B s e − ρ B ( T − s ) ds (cid:46) (cid:26) e λ B T if 2 ρ B ≥ λ B e λ B − ρ B ) T if 2 ρ B < λ B , up to a multiplicative slow term of order T when 2 ρ B = λ B . Note also that the estimate is uniformin B ∈ B since inf B ∈B λ B > B ∈B ρ B >

0. We conclude | II | (cid:46) e (2 λ B − min { λ B , ρ B } ) T . (cid:3) Proof of Theorem 4.

The proof goes along the same line but is slightly more intricate. First, weimplicitly work on the event { (cid:12)(cid:12) ˚ T T (cid:12)(cid:12) ≥ } which has probability that goes to 1 as T → ∞ , uniformlyin B ∈ B . We again write e min { λ B / ,ρ B } T (cid:0) E ( ˚ T T , g ) − ˚ E B ( g ) (cid:1) = e λ B T (cid:12)(cid:12) ˚ T T (cid:12)(cid:12) e (min { λ B / ,ρ B }− λ B ) T (cid:88) u ∈ ˚ T T (cid:0) g ( ζ Tu ) − ˚ E B ( g ) (cid:1) , and we claim that(32) e λ B T (cid:12)(cid:12) ˚ T T (cid:12)(cid:12) − → W (cid:48) B > T → ∞ , uniformly in B ∈ B , where W (cid:48) B satisﬁes P ( W (cid:48) B >

0) = 1 and that the following estimate holds:(33) E (cid:104)(cid:16) (cid:88) u ∈ ˚ T T (cid:0) g ( ζ Tu ) − ˚ E B ( g ) (cid:1)(cid:17) (cid:105) (cid:46) e (2 λ B − min { λ B , ρ B } ) T , uniformly in B ∈ B and g ∈ L . In the same way as in the proof of Theorem 3, (32) is a consequenceof the following classical result, which can be obtained in the same way as for Lemma 13 and proofwhich we omit. Lemma 14.

For every B ∈ B , there exists (cid:102) W (cid:48) B > with P ( (cid:102) W (cid:48) B >

0) = 1 such that E (cid:104)(cid:16) | ˚ T T | E (cid:2) | ˚ T T | (cid:3) − (cid:102) W (cid:48) B (cid:17) (cid:105) → as T → ∞ , uniformly in B ∈ B and ( κ (cid:48) B ) − e λ B T E (cid:2) | ˚ T T | (cid:3) → as T → ∞ , ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 19 uniformly in B ∈ B , where ( κ (cid:48) B ) − = λ B m (cid:82) ∞ exp( − (cid:82) x H B ( y ) dy ) dx . It remains to prove (33). We again assume without loss of generality that ˚ E B ( g ) = 0 and weplan to use the following decomposition:(34) E (cid:2)(cid:0) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:1) (cid:3) = I + II + III, with I = E (cid:2) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:3) ,II = E (cid:2) (cid:88) ( u,v ) ∈FT ∩ ˚ T T g ( ζ u ) g ( ζ v ) (cid:3) and III = 2 E (cid:2) (cid:88) u,v ∈ ˚ T T ,u ≺ v g ( ζ u ) g ( ζ v ) (cid:3) . Step 1 . By (24) of Proposition 10, we have I = 1 m (cid:90) T e λ B s E (cid:2) g ( (cid:101) χ s ) H B ( (cid:101) χ s ) (cid:3) ds, In the same way as for the term I in the proof of Theorem 3, we readily check that g ∈ L and B ∈ B guarantee that E (cid:2) g ( (cid:101) χ s ) H B ( (cid:101) χ s ) (cid:3) (cid:46) I (cid:46) e λ B T ≤ e (2 λ B − min { λ B , ρ B } ) T . Step 2 . By (26) of Proposition 11, we have II = ¯ mm (cid:90) T e λ B s (cid:16) (cid:90) T − s e λ B t P tH B ( gH B )(0) dt (cid:17) P sH B ( H B )(0) ds. We work as for the term II in the proof of Theorem 3: we successively have P sH B ( H B )(0) (cid:46) (cid:12)(cid:12) P tH B ( gH B )(0) | (cid:46) exp( − ρ B t ) by Proposition 12 and the fact that gH B has vanishing integralunder µ B . Therefore | II | (cid:46) (cid:90) T e λ B s (cid:16) (cid:90) T − s e ( λ B − ρ B ) t dt (cid:17) ds (cid:46) (cid:26) e λ B T if 2 ρ B ≥ λ B e λ B − ρ B ) T if 2 ρ B < λ B up to a multiplicative slow term of order T when 2 ρ B = λ B . We conclude | II | (cid:46) e (2 λ B − min { λ B , ρ B } ) T likewise. Step 3 . By (27) of Proposition 11, we have | III | ≤ (cid:90) T e λ B s (cid:12)(cid:12)(cid:12) (cid:90) T − s e λ B t P tH B (cid:0) gH B (cid:1) (0) dt (cid:12)(cid:12)(cid:12) P sH B ( | g | H B )(0) ds. In the same way as for the term II , we have | P sH B ( | g | H B )(0) | (cid:46) | P tH B ( gH B )(0) | (cid:46) exp( − ρ B t ).Therefore | III | (cid:46) (cid:90) T e λ B s (cid:0) (cid:90) T − s e ( λ B − ρ B ) t dt (cid:1) ds (cid:46) e λ B T ≤ e (2 λ B − min { λ B , ρ B } ) T . (cid:3) Proof of Proposition 5.

Conditional on ˚ T T , the random variables ( ν u , u ∈ ˚ T T ) are inde-pendent, with common distribution p k . It follows that E (cid:2) ( (cid:98) m T − m ) | ˚ T T (cid:3) ≤ | ˚ T T | − (cid:88) k k p k . Since e λ B T | ˚ T T | − is B -tight thanks to Lemma 14, we obtain the result for e λ B T/ ( (cid:98) m T − m ). The B -tightness of T − v T ( B ) − ( (cid:98) λ T − λ B ) is a consequence of Theorem 3 and 4, together with theconvergence of the preliminary estimators (cid:98) m T . For M > E T (Id , ∂ T T ) − ∂ E B (Id) = (cid:0) E T (min { Id , M } , ∂ T T ) − ∂ E B (min { Id , M } ) (cid:1) + (cid:16) E T (cid:0) ( · − M ) {· >M } , ∂ T T (cid:1) − ∂ E B (cid:0) ( · − M ) {· >M } (cid:1)(cid:17) = I + II say. We choose M = M T = 2 T and we apply Theorem 3 for the test functions g T ( x ) =min { x, M T } /M T which are uniformly bounded in T to get the B -tightness of T − v T ( B ) − I . Since ζ u ≤ T when u ∈ ∂T T , we also have | II | = ∂ E B (cid:0) ( · − T ) {· > T } (cid:1) and we deduce that v T ( B ) − II is B -tight. We study in the same way E T (Id , ˚ T T ) to conclude.5.4. Proof of Theorem 7.

The proof of Theorem 7 goes along the classical line of a bias-varianceanalysis in nonparametrics (see for instance the classical textbook [24]). However, we have twokind of extra diﬃculties: ﬁrst we have to get rid of the random bandwidth (cid:98) h T = exp( − (cid:98) λ T β +1 T )deﬁned in (18) (actually the most delicate part of the proof) and second, we have to get rid of thepreliminary estimators (cid:98) m T and (cid:98) λ T .The point x ∈ (0 , ∞ ) where we estimate B ( x ) is ﬁxed throughout, and further omitted in thenotation. We ﬁrst need a slight extension of Theorem 4 – actually of the estimate (33) – in orderto accommodate test functions g = g T such that g T → δ x weakly as T → ∞ . For a function g : [0 , ∞ ) → R let | g | = (cid:90) ∞ | g ( y ) | dy, | g | = (cid:90) ∞ g ( y ) dy and | g | ∞ = sup y | g ( y ) | denote the usual L p -norms over [0 , ∞ ) for p = 1 , , ∞ . Deﬁne also(35)Φ T ( B, g ) =  | g | + inf ≤ v ≤ T (cid:0) | g | e λ B v + | g | ∞ e (2( λ B − ρ B ) + − λ B ) v (cid:1) + | g | | g | ∞ if λ B ≤ ρ B | g | + | g | ∞ e ( λ B − ρ B ) T + | g | | g | ∞ if λ B > ρ B . Proposition 15.

In the same setting as Theorem 4, we have, for any g ∈ L , E (cid:104)(cid:16) (cid:88) u ∈ ˚ T T (cid:0) g ( ζ Tu ) − ˚ E B ( g ) (cid:1)(cid:17) (cid:105) (cid:46) e ( λ B − ρ B ) + T | g | ∞ + e λ B T Φ T (cid:0) B, g (cid:1) , (36) where the symbol (cid:46) means here uniformly in B ∈ B and independently of g . Let us brieﬂy comment on Proposition 15. If g is bounded and compactly supported with (cid:82) g = 1, consider the function g h T ( y ) = h − T g (cid:0) h − T ( x − y ) (cid:1) that mimics the Dirac function δ x for h T →

0. It is noteworthy that in the left-hand side of (36), g h T ( ζ Tu ) is of order h − T while theright-hand side is of order e λ B T h − T if we pick ω = h − T (allowed as soon as h − T ≤ e λ B T ). We canthus expect to gain a crucial factor h T thanks to averaging over ˚ T T . ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 21

Proof.

We carefully revisit the estimate (33) in the proof of Theorem 4 keeping up with the samenotation and assuming with no loss of generality that ˚ E B ( g ) = 0. Recall decomposition (34). Step 1 . For the term I , we insert (cid:82) ∞ g ( y ) H B ( y ) µ B ( y ) dy = mc B (cid:82) ∞ g ( y ) e − λ B y f B ( y ) dy to obtain I = IV + V, where IV (cid:46) e λ B T (cid:90) ∞ g ( y ) e − λ B y f B ( y ) dy and | V | ≤ m (cid:90) T e λ B s (cid:12)(cid:12)(cid:12) P sH B (cid:0) g H B (cid:1) (0) − (cid:90) ∞ g ( y ) H B ( y ) µ B ( y ) dy (cid:12)(cid:12)(cid:12) ds. Clearly, | IV | (cid:46) e λ B T | g | . By Proposition 12, we further infer | V | (cid:46) | g | ∞ (cid:90) T e λ B s e − ρ B s ds (cid:46) | g | ∞ e ( λ B − ρ B ) + T . Step 2 . For the term II , using P sH B ( H B )(0) (cid:46) II (cid:46) e λ B T (cid:90) T e − λ B s (cid:16) (cid:90) s e λ B t P tH B ( gH B )(0) dt (cid:17) ds. A new diﬃculty appears here, since the crude bound(37) | P tH B ( gH B )(0) | (cid:46) | g | ∞ exp( − ρ B t )given by Proposition 12 does not yield to the correct order for small value of t because of the term | g | ∞ . We need the following reﬁnement (for small values of t ), based on a renewal argument andproved in Appendix: Lemma 16.

For every t ≥ and g ∈ L , we have (cid:12)(cid:12) P tH B (cid:0) gH B (cid:1) (0) (cid:12)(cid:12) (cid:46) | g ( t ) | e − λ B t + | g | uniformly in B ∈ B . Let v ∈ [0 , T ] be arbitrary. For 0 ≤ s ≤ v , by Lemma 16 we obtain I s = (cid:16) (cid:90) s e λ B t | P tH B ( gH B )(0) | dt (cid:17) (cid:46) (cid:16) (cid:90) s | g ( t ) | dt + | g | (cid:90) s e λ B t dt (cid:17) (cid:46) | g | e λ B s . For s ≥ v , we have by (37) I s (cid:46) I v + | g | ∞ (cid:16) (cid:90) sv e ( λ B − ρ B ) t dt (cid:17) (cid:46) I v + | g | ∞ (cid:0) e λ B − ρ B ) s { λ B >ρ B } + ( s − v ) { λ B ≤ ρ B } { s ≥ v } (cid:1) . On the one hand, (cid:82) v e − λ B s I s ds (cid:46) | g | e λ B v and on the other hand (cid:82) Tv e − λ B s I s ds is less than I v (cid:90) Tv e − λ B s ds + | g | ∞ (cid:16) (cid:90) Tv e − λ B s e λ B − ρ B ) + s ds + (cid:90) Tv e − λ B s ( s − v ) ds { λ B ≤ ρ B } (cid:17) (cid:46)  | g | e λ B v + | g | ∞ e − λ B v if λ B ≤ ρ B | g | e λ B v + | g | ∞ e ( λ B − ρ B ) v if ρ B ≤ λ B ≤ ρ B | g | e λ B v + | g | ∞ e ( λ B − ρ B ) T if λ B ≥ ρ B , whence for every v ∈ [0 , T ], we derive | II | (cid:46) e λ B T (cid:16) | g | e λ B v + | g | ∞ (cid:0) e ( − λ B +2( λ B − ρ B ) + ) v { λ B ≤ ρ B } + e ( λ B − ρ B ) T { λ B > ρ B } (cid:1)(cid:17) . Step 3 . Finally going back to Step 3 in the proof of Theorem 4 we readily obtain | III | (cid:46) (cid:90) T e λ B s P sH B (cid:0) | g | H B (cid:1) (0) (cid:90) T − s e λ B t (cid:12)(cid:12) P tH B (cid:0) gH B (cid:1) (0) (cid:12)(cid:12) dtds (cid:46) (cid:90) T e λ B s ( | g ( s ) | e − λ B s + | g | ) | g | ∞ (cid:90) T − s e λ B t e − ρ B t dtds by applying Lemma 16 for the term involving P sH B and the estimate (37) for the term involving P tH B , therefore | III | (cid:46) e λ B T | g | | g | ∞ . (cid:3) Proposition 15 enables us to obtain the next result which is the key ingredient to get rid ofthe random bandwidth (cid:98) h T , thanks to the fact that it is concentrated around its estimated value h T ( β ) = e − β +1 λ B T . To that end, deﬁne, for C > C C = (cid:8) g : R → R , supp( g ) ⊂ [0 , C ] and sup y | g ( y ) | ≤ C (cid:9) . Denote by C C (later abbreviated by C ) the subset of C C of functions that are moreover diﬀer-entiable, with derivative uniformly bounded by C . For h > g h ( y ) = h − g (cid:0) h − ( x − y ) (cid:1) .Finally, for a, b ≥ a ± b ] = [( a − b ) + , a + b ]. Recall from Section 3.2 that v T ( B ) = e − min { ρ B ,λ B / } T if λ B (cid:54) = 2 ρ B and T / e − λ B T/ otherwise. Proposition 17.

Assume that β ≥ / . Deﬁne (cid:36) B = min { max { , λ B /ρ B } , } . For every κ > , v T ( B ) − sup h ∈ [ h T ( β )(1 ± κT v T ( B ))] (cid:12)(cid:12) E T (cid:0) ˚ T T , h (cid:36) B / f g h ) − ˚ E B ( h (cid:36) B / f g h ) (cid:12)(cid:12) is B × L × C -tight for the parameter ( B, f, g ) .Proof. Step 1 . Deﬁne f g h = f g h − ˚ E B ( f g h ). Writing v T ( B ) − (cid:16) E T (cid:0) ˚ T T , h (cid:36) B / f g h ) − ˚ E B ( h (cid:36) B / f g h ) (cid:17) = e λ B T | ˚ T T | e (min { ρ B ,λ B / }− λ B ) T ( T − / ) { λB =2 ρB } (cid:88) u ∈ ˚ T T h (cid:36) B / f g h ( ζ u ) , we see as in the proof of Theorem 4 that thanks to Lemma 14, it is enough to prove the B -tightnessof sup h ∈ [ h T ( β )(1 ± κT v T ( B ))] | V Th | = sup s ∈ [0 , | V Th s | , where V Th = e (min { ρ B ,λ B / }− λ B ) T ( T − / ) { λB =2 ρB } (cid:88) u ∈ ˚ T T h (cid:36) B / f g h ( ζ u ) , and h s = h T ( β ) (cid:0) − κT v T ( B ) (cid:1) + 2 sκh T ( β ) T v T ( B ) , s ∈ [0 , . Step 2 . We claim that(38)  sup

T > E (cid:2) ( V Th ) (cid:3) < ∞ E (cid:2)(cid:0) V Th t − V Th s (cid:1) (cid:3) ≤ C (cid:48) ( t − s ) for s, t ∈ [0 , , ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 23 for some constant C (cid:48) > T nor B ∈ B . Then, by Kolmogorov continuitycriterion, this implies in particular thatsup T > sup B ∈B E (cid:2) sup s ∈ [0 , | V Th s | (cid:3) < ∞ hence the result (see for instance [22] to track the constant and obtain a uniform version of thecontinuity criterion). We have V Th t − V Th s = e (min { ρ B , λ B }− λ B ) T ( T − { λB =2 ρB } ) (cid:88) u ∈ ˚ T T (cid:16) ∆ s,t ( h (cid:36) B / f g h )( ζ u ) − ˚ E B (cid:0) ∆ s,t ( h (cid:36) B / f g h ) (cid:1)(cid:17) where ∆ s,t ( h (cid:36) B / f g h )( y ) = h (cid:36) B / t f ( y ) g h t ( y ) − h (cid:36) B / s f ( y ) g h s ( y ) . By Proposition 15, we derive that E [( V Th t − V Th s ) ] is less than  e − λ B T | ∆ s,t ( h (cid:36) B / f g h ) | ∞ + Φ T (cid:0) B, ∆ s,t ( h (cid:36) B / f g h ) (cid:1) if λ B ≤ ρ B e − ρ B T | ∆ s,t ( h (cid:36) B / f g h ) | ∞ + Φ T (cid:0) B, ∆ s,t ( h (cid:36) B / f g h ) (cid:1) if ρ B ≤ λ B ≤ ρ B e − ( λ B − ρ B ) T | ∆ s,t ( h (cid:36) B / f g h ) | ∞ + e − ( λ B − ρ B ) T Φ T (cid:0) B, ∆ s,t ( h (cid:36) B / f g h ) (cid:1) if λ B ≥ ρ B (we ignore the slow term in the limiting case λ B = 2 ρ B ) and the remainder of the proof amountsto check that each term in the estimate above has order ( t − s ) uniformly in T and B ∈ B . Step 3 . For every y , we have∆ s,t (cid:0) h (cid:36) B / f g h (cid:1) ( y ) = ( h t − h s ) ∂ h (cid:0) h (cid:36) B / f ( y ) g h ( y ) (cid:1) | h = h (cid:63) ( y ) for some h (cid:63) ( y ) ∈ [min { h t , h s } , max { h t , h s } ]. Observe now that since g ∈ C and f ∈ L , we have ∂ h (cid:0) h (cid:36) B / f g h ( y ) (cid:1) = ( (cid:36) B − h (cid:36) B − f ( y ) g (cid:0) h − ( x − y ) (cid:1) − h (cid:36) B − ( x − y ) f ( y ) g (cid:48) (cid:0) h − ( x − y ) (cid:1) therefore, for small enough h (which is always the case for T large enough, uniformly in B ∈ B )and since | x − y | (cid:46) h thanks to the fact that g is compactly supported, we obtain | ∂ h (cid:0) h (cid:36) B / f g h ( y ) (cid:1) | (cid:46) h (cid:36) B / − [0 ,C ] (cid:0) h − ( x − y ) (cid:1) . Assume with no loss of generality that s ≤ t so that h s ≤ h ( y ) (cid:63) ≤ h t . It follows that (cid:12)(cid:12) ∆ s,t (cid:0) h (cid:36) B / f g h )( y ) (cid:12)(cid:12) (cid:46) ( h t − h s ) h (cid:63) ( y ) (cid:36) B / − [0 ,C ] (cid:0) h (cid:63) ( y ) − ( x − y ) (cid:1) ≤ ( h t − h s ) h (cid:36) B / − s [0 ,C ] (cid:0) h − t ( x − y ) (cid:1) . Using that h t − h s = 2( t − s ) κT h T ( β ) v T ( B ), we successively obtain (cid:12)(cid:12) ∆ s,t ( h (cid:36) B / f g h ) (cid:12)(cid:12) ∞ (cid:46) ( h t − h s ) h (cid:36) B − s (cid:46) ( t − s ) T v T ( B ) h T ( β ) (cid:36) B − , (cid:12)(cid:12) ∆ s,t ( h (cid:36) B / f g h ) (cid:12)(cid:12) (cid:46) ( h t − h s ) h (cid:36) B − s h t (cid:46) ( t − s ) T v T ( B ) h T ( β ) (cid:36) B − , (cid:12)(cid:12) ∆ s,t ( h (cid:36) B / f g h ) (cid:12)(cid:12) (cid:46) ( h t − h s ) h (cid:36) B − s h t (cid:46) ( t − s ) T v T ( B ) h T ( β ) (cid:36) B , (cid:12)(cid:12) ∆ s,t ( h (cid:36) B / f g h ) (cid:12)(cid:12) (cid:12)(cid:12) ∆ s,t (cid:0) h (cid:36) B / f g h ) (cid:12)(cid:12) ∞ (cid:46) ( t − s ) T v T ( B ) h T ( β ) (cid:36) B − . Step 4 . Recall that h T ( β ) = e − λ B T/ (2 β +1) . When λ B ≤ ρ B , we have v T ( B ) = e − λ B T/ and (cid:36) B = 1. By deﬁnition of Φ T in (35) together with the estimates of Steps 2 and 3, we obtain E (cid:2) ( V Th t − V Th s ) (cid:3) (cid:46) e − λ B T | ∆ s,t ( h / f g h ) | ∞ + Φ T (cid:0) B, ∆ s,t ( h / f g h ) (cid:1) (cid:46) ( t − s ) T (cid:0) e λ B ( 12 β +1 − T + e − λ B T + e − λ B ( 12 β +1 +1) T e λ B v + e λ B ( 12 β +1 − T e − λ B v (cid:1) which is of order ( t − s ) uniformly in T > v = 0 for instance. When ρ B ≤ λ B ≤ ρ B ,we still have v T ( B ) = e − λ B T/ but now (cid:36) B = λ B /ρ B . It follows that E [( V Th t − V Th s ) ] is of order e − ρ B T | ∆ s,t ( h λ B / ρ B f g h ) | ∞ + Φ T (cid:0) B, ∆ s,t ( h λ B / ρ B f g h ) (cid:1) (cid:46) ( t − s ) T (cid:0) e λ B ( 2 − λ B /ρ B β +1 − T ( e − ρ B T + e ( λ B − ρ B ) v ) + e λ B ( 1 − λ B /ρ B β +1 − T + e − λ B ( λ B /ρ B β +1 +1) T e λ B v (cid:1) and this last term is again of order ( t − s ) uniformly in T > ≤ λ B /ρ B ≤ v = 0 for instance. Finally, when 2 ρ B ≤ λ B , we have v T ( B ) = e − ρ B T and (cid:36) B = 2. Thisentails E (cid:2) ( V Th t − V Th s ) (cid:3) (cid:46) e − ( λ B − ρ B ) T | ∆ s,t ( hf g h ) | ∞ + e − ( λ B − ρ B ) T Φ T (cid:0) B, ∆ s,t ( hf g h ) (cid:1) (cid:46) ( t − s ) T (cid:0) e − ( λ B + ρ B ) T + e − λ B ( 12 β +1 +1) T + e − ρ B T (cid:1) and these terms are all again of order ( t − s ) uniformly in T . Step 5 . It remains to show sup

T > E (cid:2) ( V Th ) (cid:3) < ∞ in order to complete the proof of (38). By Step2 and the deﬁnition of (cid:36) B , we readily have E (cid:2) ( V Th ) (cid:3) (cid:46)  e − λ B T | h / f g h | ∞ + Φ T ( B, h / f g h ) if λ B ≤ ρ B e − ρ B T | h λ B / ρ B f g h | ∞ + Φ T ( B, h λ B / ρ B f g h ) if ρ B ≤ λ B ≤ ρ B e − ( λ B − ρ B ) T | h f g h | ∞ + e − ( λ B − ρ B ) T Φ T ( B, h f g h ) if λ B ≥ ρ B . When λ B ≤ ρ B , since h is of order h T ( β ), we have E (cid:2) ( V Th ) (cid:3) (cid:46) e − λ B T h T ( β ) − + 1 + h T ( β ) e λ B v + h T ( β ) − e − λ B v for every v ∈ [0 , T ], and the choice v = β +1 T entails E [( V Th ) ] (cid:46)

1. When ρ B ≤ λ B ≤ ρ B , wehave E (cid:2) ( V Th ) (cid:3) (cid:46) e − ρ B T h T ( β ) λ B ρ B − + h T ( β ) λ B ρ B − + h T ( β ) λ B ρ B e λ B v + h T ( β ) λ B ρ B − e ( λ B − ρ B ) v . The ﬁrst term is bounded as soon as β ≥ / v = λ B ρ B (2 β +1) T for the last two termsentails E [( V Th ) ] (cid:46)

1. Finally, when 2 ρ B ≤ λ B we have E (cid:2) ( V Th ) (cid:3) (cid:46) e − ( λ B − ρ B ) T + 1and this term is bounded likewise. Eventually (38) is established and Proposition 17 is proved. (cid:3) We now get rid of the preliminary estimators (cid:98) m T and (cid:98) λ T . Remember that the target rate ofconvergence for (cid:98) B T ( x ) is w T ( B ) = T { λB =2 ρB } exp (cid:0) − min { λ B , ρ B } β − ( λ B /ρ B − + / β +1 T (cid:1) . Lemma 18.

Assume that β > . Let either G T ( y ) = g (cid:98) h T ( y ) with g ∈ C or G T ( y ) = { y ≤ x } for y ∈ [0 , ∞ ) . Then w T ( B ) − (cid:0) E T (˚ T T , (cid:98) m − T e (cid:98) λ T · G T ) − E T (cid:0) ˚ T T , m − e λ B · G T ) (cid:1) ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 25 is B -tight for the parameter B .Proof. For u ∈ ˚ T T and its lifetime ζ u , deﬁne γ T ( u ) = w T ( B ) − (cid:0) (cid:98) m − T e (cid:98) λ T ζ u − m − e λ B ζ u (cid:1) G T ( ζ u ) . Lemma 18 amounts to show that | ˚ T T | − (cid:80) u ∈ ˚ T T γ T ( u ) is B -tight. Set h T ( β ) = exp( − λ B β +1 T )and note that w T ( B ) − = ( T − / ) { λB =2 ρB } e min { ρ B ,λ B / } T h T ( β ) (cid:36) B / = v T ( B ) − h T ( β ) (cid:36) B / , where (cid:36) B = min { max { , λ B /ρ B } , } . We ﬁrst treat the case G T ( y ) = g (cid:98) h T ( y ). Step 1.

By Proposition 5, we have (cid:98) λ T = λ B + T v T ( B ) r T and (cid:98) m − T = m − + e − λ B T/ r (cid:48) T , where both r T and r (cid:48) T are B -tight. We then have the decomposition γ T ( u ) = w T ( B ) − (cid:98) m − T ( e (cid:98) λ T ζ u − e λ B ζ u ) g (cid:98) h T ( ζ u ) + w T ( B ) − ( (cid:98) m − T − m − ) e λ B ζ u g (cid:98) h T ( ζ u )= T h T ( β ) (cid:36) B / (cid:98) m − T r T ζ u e ϑ T ζ u g (cid:98) h T ( ζ u ) + w T ( B ) − e − λ B T/ e λ B ζ u r (cid:48) T g (cid:98) h T ( ζ u )= I + II, say, with ϑ T ∈ [min { λ B , (cid:98) λ T } , max { λ B , (cid:98) λ T } ]. Since g ∈ C ⊂ C and (cid:98) m − T , ϑ T and (cid:98) h T are B -tight,we can write | I | ≤ T h T ( β ) (cid:36) B / (cid:98) m − T r T ( C (cid:98) h T + x ) e ϑ T ( C (cid:98) h T + x ) | g (cid:98) h T ( ζ u ) | = T h T ( β ) (cid:36) B / | g (cid:98) h T ( ζ u ) | (cid:101) r T and | II | ≤ h T ( β ) (cid:36) B / e λ B ( C (cid:98) h T + x ) r (cid:48) T | g (cid:98) h T ( ζ u ) | = h T ( β ) (cid:36) B / | g (cid:98) h T ( ζ u ) | (cid:101) r (cid:48) T where (cid:101) r T and (cid:101) r (cid:48) T are B ∈ B -tight. Step 2 . We are left to proving the tightness of

T h T ( β ) (cid:36) B / | g (cid:98) h T ( ζ u ) | when averaging over ˚ T T thatis to say the tightness of T h T ( β ) (cid:36) B / E T (˚ T T , | g (cid:98) h T | ). We plan to use Proposition 17. For κ >

0, onthe event A T,κ = (cid:8)(cid:98) h T ∈ I T,κ (cid:9) , I T,κ = (cid:2) h T ( β )(1 ± κT v T ( B )) (cid:3) , we have T h T ( β ) (cid:36) B / E T (˚ T T , | g (cid:98) h T | ) ≤ III + IV, with

III = T h T ( β ) (cid:36) B / sup h ∈I T,κ ˚ E B ( | g h | )and IV = T h T ( β ) (cid:36) B / (cid:0) h T ( β )(1 − κT v T ( B )) (cid:1) − (cid:36) B / sup h ∈I T,κ (cid:12)(cid:12) E T (cid:0) ˚ T T , h (cid:36) B / | g h | ) − ˚ E B ( h (cid:36) B / | g h | ) (cid:12)(cid:12) ≤ T sup h ∈I T,κ (cid:12)(cid:12) E T (cid:0) ˚ T T , h (cid:36) B / | g h | ) − ˚ E B (cid:0) h (cid:36) B / | g h | (cid:1)(cid:12)(cid:12) . Concerning the main term

III , we write˚ E B ( | g h | ) = m (cid:90) ∞ h − | g (cid:0) h − ( x − y ) (cid:1) | e − λ B y f B ( y ) dy ≤ m sup y (cid:0) e − λ B y f B ( y ) (cid:1) (cid:90) ∞ | g ( y ) | dy (cid:46) B ∈ B , so we have a bound that does not depend on h and we readily conclude III (cid:46) A T,κ . For the remainder term IV , we apply Proposition 17 and obtain the B -tightness of IV (that actually goes to 0 at a fast rate) on A T,κ . Step 3 . It remains to control the probability of A T,κ . By Proposition 5, we have (cid:98) λ T = λ B + T v T ( B ) r T , where r T is B -tight. It follows that P ( A cT,κ ) = P (cid:0) | (cid:98) h T − h T ( β ) | ≥ κh T ( β ) T v T ( B ) (cid:1) = P (cid:0)(cid:12)(cid:12) − e − ( (cid:98) λ T − λ B ) β +1 T (cid:12)(cid:12) ≥ κT v T ( B ) (cid:1) = P (cid:0) | β +1 r T e − ϑ T β +1 T | ≥ κ (cid:1) where both | ϑ T | ≤ | (cid:98) λ T − λ B | and r T are tight, and this term can be made arbitrarily small bytaking κ large enough.The case G T ( y ) = { y ≤ x } is obtained in the same way and is actually much simpler, since thereis no factor (cid:98) h − T in the Step 2 which is therefore straightforward and there is also no need for aStep 3. We omit the details. (cid:3) Proof of Theorem 7.

We are ready to prove the main result of the paper. The key ingredient isProposition 17.

Step 1.

In view of Lemma 18 with test function g = K , it is now suﬃcient to prove Theorem 7replacing (cid:98) B T ( x ) by (cid:101) B T ( x ), where (cid:101) B T ( x ) = E T (cid:0) ˚ T T , m − e λ B · K (cid:98) h T ( x − · ) (cid:1) − E T (˚ T T , m − e λ B · {·≤ x } ) . Since ( x, y ) (cid:59) x/ (1 − y ) is Lipschitz continuous on compact sets that are bounded away from { y = 1 } , this simply amounts to show the B -tightness of(39) w T ( B ) − (cid:16) E T (˚ T T , m − e λ B · {·≤ x } ) − ˚ E B ( m − e λ B · {·≤ x } (cid:1)(cid:17) and(40) w T ( B ) − (cid:0) E T (cid:0) ˚ T T , m − e λ B · K (cid:98) h T ( x − · ) (cid:1) − f B ( x ) (cid:1) , where w T ( B ) − = ( T − / ) { λB =2 ρB } e min { ρ B ,λ B / } T h T ( β ) (cid:36) B / = v T ( B ) − h T ( β ) (cid:36) B / . We readilyobtain the B -tightness of (39) by applying Theorem 4 with test function g ( y ) = m − e λ B y { y ≤ x } since v T ( B ) (cid:28) w T ( B ) (we even have convergence to 0). Step 2 . We turn to the main term (40). For h >

0, introduce the notation K h f B ( x ) = ˚ E B ( m − e λ B · K h ) = (cid:90) ∞ K h ( x − y ) f B ( y ) dy. For κ > A T,κ = (cid:8)(cid:98) h T ∈ I T,κ (cid:9) with I T,κ = (cid:2) h T ( β )(1 ± κT v T ( B )) (cid:3) introducing theapproximation term K h f B ( x ), we obtain a bias-variance bound that reads (cid:12)(cid:12) E T (˚ T T , m − e λ B · K (cid:98) h T ) − f B ( x ) (cid:12)(cid:12) ≤ I + II, with I = sup h ∈I T,κ (cid:12)(cid:12) K h f B ( x ) − f B ( x ) (cid:12)(cid:12) ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 27 and II = sup h ∈I T,κ (cid:12)(cid:12) E T (˚ T T , m − e λ B · K h ) − ˚ E B ( m − e λ B · K h ) (cid:12)(cid:12) . The term I is treated by the following classical argument in nonparametric estimation: since B ∈ H β D ( L ) we also have f B ∈ H β D ( L (cid:48) ) for another constant L (cid:48) that only depends on D , L and β .Write β = (cid:98) β (cid:99) + { β } with (cid:98) β (cid:99) a non-negative integer, { β } >

0. By a Taylor expansion up to order (cid:98) β (cid:99) (recall that the number n of vanishing moments of K in Assumption 6 satisﬁes n > β ), weobtain I (cid:46) sup h ∈I T,κ h β = (cid:0) h T ( β )(1 + κT v T ( B ) (cid:1) β (cid:46) w T ( B )see for instance, Proposition 1.2 in Tsybakov [24]. This term has the right order whenever λ B ≤ ρ B and is negligible otherwise. Step 3 . We further bound the term II on A T,κ as follows: | II | ≤ (cid:0) h T ( β )(1 − κT v T ( B )) (cid:1) − (cid:36) B / sup h ∈I T,κ (cid:12)(cid:12) E T (˚ T T , h (cid:36) B / m − e λ B · K h ) − ˚ E B ( h (cid:36) B / m − e λ B · K h ) (cid:12)(cid:12) . By assumption, we have β ≥ /

2, so by Proposition 17 applied to f ( y ) = m − e λ B y { y ≤ x + C } ∈L C + x and g = K ∈ C C + x we conclude that v T ( B ) − h T ( β ) (cid:36) B / | II | is B -tight. The fact that v T ( B ) − h T ( β ) (cid:36) B / = w T ( B ) − enables us to conclude. Step 4 . It remains to control the probability of A T,κ . This is done exactly in the same way as forStep 3 in the proof of Lemma 18. (cid:3)

Proof of Theorem 8.

We will prove actually a slightly stronger result, by restricting thesupremum in B over a neighbourhood of an arbitrary function B , provided B is an element ofthe set B b,m deﬁned in (20) and slightly smoother in H β D norm (and not identically equal to themaximal element of B b,m ). (Remember also that B b,m ⊂ B + b,m/ ( m − by Proposition 9.)Remember that the evolution of the Bellman-Harris model can be described by a piecewisedeterministic Markov process X ( t ) = (cid:0) X ( t ) , X ( t ) , . . . (cid:1) , t ≥ S = (cid:83) k ≥ [0 , ∞ ) k and where the X i ( t ) denote the (ordered) ages of the livingparticles at time t . Following L¨ocherbach [18], we set D ([0 , ∞ ) , S ) for the Skorokhod space ofc`adl`ag functions ϕ : [0 , ∞ ) → S and introduce the subset Ω ⊂ D ([0 , ∞ ) , S ) of functions ϕ suchthat:(i) There is an increasing sequence of jump times T = 0 < T < T < · · · such that therestriction ϕ (cid:12)(cid:12) [ T k ,T k +1 ) is continuous with values in [0 , ∞ ) l k,ϕ for some l k,ϕ ≥ k ≥ (cid:96) (cid:0) ϕ ( T k ) (cid:1) (cid:54) = (cid:96) (cid:0) ϕ ( T k +1 ) (cid:1) for every k ≥

0, where we set (cid:96) ( x ) = (cid:80) k ≥ k { x ∈ [0 , ∞ ) k } for x ∈ S .We endow Ω with its Borel sigma-ﬁeld F , its canonical process X t ( ϕ ) = ( ϕ ( t ) , ϕ ( t ) , . . . ) andits canonical ﬁltration ( F t ) t ≥ (modiﬁed in order to be right-continuous). By Proposition 3.3of L¨ocherbach [18], there is a unique probability measure P B on (Ω , F , ( F t ) t ≥ ) such that X is strongly Markov under P B with P B ( X (0) = 0) = 1 ( i.e. we start with one common ancestor withage 0 at time 0) and such that the random continuous time rooted tree associated to X via (cid:88) i ≥ { X i ( t ) > } δ X i ( t ) = (cid:88) u ∈T { t ∈ [ b u ,d u ) } δ t − b u is a Harris-Bellman process according to Deﬁnition 1. The strategy for proving the lower boundis a classical two point information inequality: we nevertheless need to be careful since the targetlower bound rate e − λ B β β +1 T is parameter dependent in a non-trivial way. Step 1.

Let δ >

0. Fix B ∈ B b,m ∩ H β D ( L − δ ) and x ∈ D . Then, for large enough T , setting h T ( B ) = e − λ B β +1 T , we construct a perturbation B T of B deﬁned by B T ( y ) = B ( y ) + ah T ( B ) β +1 K h T ( B ) (cid:0) y − x (cid:1) , y ∈ [0 , ∞ ) , for some nonnegative smooth kernel K with compact support such that K (0) = 1 and for some a = a δ,K > B T ∈ B b,m ∩ H β D ( L ) for every T ≥

0. Such a choiceis always possible (if B (cid:54) = max { C, } identically in a neighbourhood of x , which we may andwill assume from now on) thanks to the assumption (cid:107) B (cid:107) H β D ≤ L − δ ; it suﬃces then to impose (cid:107) ah β +1 T K h T ( · − x ) (cid:107) H β D ≤ δ which is easily obtained by picking a δ,K suﬃciently small.Also, by construction, we have B ( y ) ≤ B T ( y ) for every y ≥ λ B ≤ λ B T , compare theproof of Proposition 12 (ii) and at y = x , the lower estimate | B ( x ) − B T ( x ) | = a δ,K h βT ( B ) holds,and this quantity is of order e − λ B β β +1 T . Step 2.

Abusing notation slightly, we further write P B for P B |F T , i.e. the measure in restrictionto the σ -ﬁeld generated by the observation ( X ( t )) ≤ t ≤ T . Since B , B T ∈ B b,m ∩ H β D ( L ), for anarbitrary estimator (cid:98) B T ( x ) and any constant C (cid:48) > B ∈{ B ,B T } P B (cid:0) e λ B β β +1 T | (cid:98) B T ( x ) − B ( x ) | ≥ C (cid:48) (cid:1) ≥ (cid:16) P B (cid:0) e λ B β β +1 T | (cid:98) B T ( x ) − B ( x ) | ≥ C (cid:48) (cid:1) + P B T (cid:0) e λ BT β β +1 T | (cid:98) B T ( x ) − B T ( x ) | ≥ C (cid:48) (cid:1)(cid:17) ≥ E B (cid:20) (cid:8) e λB β β +1 T | (cid:98) B T ( x ) − B ( x ) |≥ C (cid:48) (cid:9) + (cid:8) e λBT β β +1 T | (cid:98) B T ( x ) − B T ( x ) |≥ C (cid:48) (cid:9)(cid:21) − (cid:107) P B − P B T (cid:107) T V . By triangle inequality, we have e λ B β β +1 T | (cid:98) B T ( x ) − B ( x ) | + e λ BT β β +1 T | (cid:98) B T ( x ) − B T ( x ) |≥ e min { λ B ,λ BT } β β +1 T | B ( x ) − B T ( x ) | ≥ a K,δ by Step 1, so if we pick C (cid:48) < a K,δ /

2, one of the two indicators within the expectation above mustbe equal to one with full P B -probability. In that casemax B ∈{ B ,B T } P B (cid:0) e λ B β β +1 T | (cid:98) B T ( x ) − B ( x ) | ≥ C (cid:48) (cid:1) ≥ (1 − (cid:107) P B − P B T (cid:107) T V )and Theorem 8 is thus proved if lim sup T →∞ (cid:107) P B − P B T (cid:107) T V < ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 29

Step 3.

By Pinsker’s inequality, we have (cid:107) P B − P B T (cid:107) T V ≤ √ (cid:16) E B (cid:104) log d P B d P BT (cid:105)(cid:17) / . By Theo-rem 3.5 in [18], the measures P B and P B T are equivalent on F T and we havelog (cid:16) d P B T d P B (cid:17) = (cid:88) u ∈ ˚ T T log (cid:18) B T B ( ζ u ) (cid:19) − (cid:90) T (cid:88) u ∈ ∂ T s ( B T − B )( ζ su ) ds, where ζ tu denotes the age of the cell u at time t ∈ I u = [ b u , d u ). Using − log(1 + x ) ≤ x − x if x ≥ − / ε T ( y ) = a K,δ h T ( B ) β +1 K h T ( B ) (cid:0) y − x (cid:1) , we further infer (cid:107) P B − P B T (cid:107) T V ≤ (cid:0) E B (cid:2) (cid:88) u ∈ ˚ T T ε T B ( ζ u ) (cid:3) − E B (cid:2) (cid:88) u ∈ ˚ T T ε T B ( ζ u ) (cid:3) + (cid:90) T E B (cid:2) (cid:88) u ∈ ∂ T s ε T ( ζ su ) (cid:3) ds (cid:1) = 12 m (cid:90) T e λ B s E B (cid:104) ε T B ( (cid:101) χ s ) H B ( (cid:101) χ s ) (cid:105) ds by (23) and (24) in Proposition 10 and the fact that the last two terms cancel. We now use thesame kind of estimates as in the proof of Proposition 15, Step 1 with test function g = ε T /B toﬁnally get (cid:107) P B − P B T (cid:107) T V (cid:46) e λ B T (cid:12)(cid:12) B − ε T (cid:12)(cid:12) + (cid:12)(cid:12) B − ε T (cid:12)(cid:12) ∞ (cid:46) a K,δ and this term can be made arbitrarily small by picking a K,δ small enough.5.6.

Proof of Proposition 9.

Pick B ∈ B b,m . We need to prove that λ B ≤ ρ B = inf x H B ( x ).By representation (3), we have H B ( x ) = me − λ B x f B ( x )1 − m (cid:82) x e − λ B y f B ( y ) dy = me − λ B x B ( x ) e − (cid:82) x B ( y ) dy − m (cid:82) x e − λ B y B ( y ) e − (cid:82) y B ( u ) du dy . Set G B ( x ) = me − λ B x B ( x ) e − (cid:82) x B ( y ) dy − λ B (cid:0) − m (cid:90) x e − λ B y B ( y ) e − (cid:82) y B ( u ) du dy (cid:1) . The statement λ B ≤ ρ B is equivalent to proving that inf x ≥ G B ( x ) ≥

0. We ﬁrst claim that B ( x ) ≤ (cid:101) B ( x ) for every x ∈ (0 , ∞ ) implies λ B ≤ λ (cid:101) B . Indeed, in that case, one can construct on the same probability space two random variables τ B withdensity f B and τ (cid:101) B with density f (cid:101) B such that τ B ≥ τ (cid:101) B . It follows that φ B ( λ ) = E [ e − λτ B ] ≤ φ (cid:101) B ( λ ) = E [ e − λτ (cid:101) B ] for every λ ≥

0. Also, φ B and φ (cid:101) B are both non-increasing, vanishing at inﬁnity, and φ B (0) = φ (cid:101) B (0) = 1 > m . Consequently, the values λ B and λ (cid:101) B such that φ B ( λ B ) = φ (cid:101) B ( λ (cid:101) B ) = m necessarily satisfy λ B ≤ λ (cid:101) B hence the claim. Now, for constant functions B ( x ) = α , we clearlyhave λ B = ( m − α and this enables us to infer λ B ≤ ( m −

1) sup x B ( x ) . Remember now that B ∈ B b,m implies b ≤ B ( x ) ≤ mm − b for every x ≥

0. Therefore(41) λ B ≤ ( m − mm − b = mb ≤ mB (0) and G B (0) = mB (0) − λ B ≥ G (cid:48) B ( x ) = me − λ B x e − (cid:82) x B ( y ) dy (cid:0) B (cid:48) ( x ) − B ( x ) (cid:1) ≤ B (cid:48) ( x ) − B ( x ) ≤ B ∈ B b,m . So G B is non-increasing, G B (0) ≥ x → ∞ . Since G B ( ∞ ) = 0, we conclude inf x ≥ G B ( x ) ≥ B − b,C is non-trivial when C > mb/ ( m − < x ≤ x , mb/ ( m − < c ≤ C and let B ( x ) = b for x ≤ x , B ( x ) = c for x ≥ x andany smooth continuation between x and x bounded above by C and below by b . Then, having b, c such that 2 m ( m + 2) b/ ( m − < c and suitable choices for x and x implies ρ B < λ B /

2. Having2 mb/ ( m − > c and suitable choices for x , x implies ρ B < λ B ≤ ρ B . The computations, basedon the same kind of estimates, are rather tedious but not diﬃcult. We omit the details.6. Appendix

Heuristics for the convergences to the limits (9) and (8) . Information from E T ( ∂ T T , g ) . Heuristically, we postulate for large T the approximation E T ( ∂ T T , g ) ∼ E [ | ∂ T T | ] E (cid:104) (cid:88) u ∈ ∂ T T g ( ζ Tu ) (cid:105) . Then, a classical result based on renewal theory (see Theorem 17.1 pp 142-143 of [12]) gives theestimate(42) E (cid:2) | ∂ T T | (cid:3) ∼ κ B e λ B T , where λ B > κ B > m , see [12] and also Lemma 13 below). As for the numerator, call χ t the age of a particle at time t along a branch of the tree picked at random uniformly at eachbranching event. The process ( χ t ) t ≥ is Markov process with values in [0 , ∞ ) with inﬁnitesimalgenerator(43) A B g ( x ) = g (cid:48) ( x ) + B ( x ) (cid:0) g (0) − g ( x ) (cid:1) densely deﬁned on continuous functions vanishing at inﬁnity. Assume for simplicity that each cell u ∈ U has exactly m children at each division. It is then relatively straightforward to obtain theidentity(44) E (cid:104) (cid:88) u ∈ ∂ T T g ( ζ Tu ) (cid:105) = E (cid:2) m N T g ( χ T ) (cid:3) , where N t = (cid:80) s ≤ t { χ s − χ s − < } is the counting process associated to ( χ t ) t ≥ , see Proposition 10 ina general setting. Putting together (42) and (44), we thus expect E T ( ∂ T T , g ) ∼ κ − B e − λ B T E (cid:2) m N T g ( χ T ) (cid:3) , and we anticipate that the term e − λ B T should somehow be compensated by the term m N T withinthe expectation. To that end, following Cloez [5] (and also in Bansaye et al. [3] when B is constant)one introduces an auxiliary “biased” Markov process ( (cid:101) χ t ) t ≥ , with generator A H B for a biasingfunction H B ( x ) characterised by(45) f H B ( x ) = me − λ B x f B ( x ) , x ≥ , ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 31 where f B ( x ) = B ( x ) exp( − (cid:82) x B ( y ) dy ) denotes the density associated to the division rate B , asfollows from (3) or (5). This implies H B ( x ) = me − λ B x f B ( x )1 − m (cid:82) x e − λ B y f B ( y ) ds . Furthemore, this choice (and this choice only, see Proposition 10) enables us to obtain(46) e − λ B T E (cid:2) m N T g ( χ T ) (cid:3) = m − E (cid:104) g ( (cid:101) χ T ) B ( (cid:101) χ T ) − H B ( (cid:101) χ T ) (cid:105) with (cid:101) χ = 0 under P . Moreover ( (cid:101) χ t ) t ≥ is geometrically ergodic, with invariant probability c B exp( − (cid:82) x H B ( y ) dy ) dx (see Proposition 12). We further anticipate E (cid:104) g ( (cid:101) χ T ) B ( (cid:101) χ T ) − H B ( (cid:101) χ T ) (cid:105) ∼ c B (cid:90) ∞ g ( x ) B ( x ) − H B ( x ) e − (cid:82) x H B ( y ) dy dx = mc B (cid:90) ∞ g ( x ) e − λ B x B ( x ) − f B ( x ) dx assuming everthing is well-deﬁned, since H B ( x ) exp( − (cid:82) x H B ( y ) dy ) = f H B ( x ) = me − λ B f B ( x ) by(45). Finally, we have κ − B c B = λ B mm − by Lemma 13 which enables us to conclude E T ( ∂ T T , g ) ∼ ∂ E B ( g ) , where ∂ E B ( g ) = λ B mm − (cid:90) ∞ g ( x ) e − λ B x e − (cid:82) x B ( y ) dy dx. Unfortunately, the statistical information extracted from E T ( ∂ T T , g ) does not enable us to obtainclassical optimal rates of convergence, since the form of ∂ E B ( g ) involves an antiderivative of B leading to so-called ill-posedness. This is discussed at length in Section 3.3. We thus investigatein a second step the statistical information we can get from ˚ T T . Information from E T (˚ T T , g ) . The situation is a bit diﬀerent if we allow for data in ˚ T T . Note ﬁrstthat ζ Tu = ζ u on the event u ∈ ˚ T T . We also have in that case a many-to-one formula that nowreads(47) E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ Tu ) (cid:105) = E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:105) = m − (cid:90) T e λ B s E (cid:2) g ( (cid:101) χ s ) H B ( (cid:101) χ s ) (cid:3) ds, where ( (cid:101) χ t ) t ≥ is the one-dimensional auxiliary Markov process with generator A H B , see (43), where H B is characterised by (45) above. Assuming again ergodicity, we approximate the right-hand sideof (47) and obtain E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:105) ∼ c B m − e λ B T λ B (cid:90) ∞ g ( x ) H B ( x ) e − (cid:82) x H B ( u ) du dx = c B e λ B T λ B (cid:90) ∞ g ( x ) e − λ B x f B ( x ) dx since H B ( x ) exp( − (cid:82) x H B ( y ) dy ) = f H B ( x ) = me − λ B x f B ( x ) by (45). We again have an approxi-mation of the type (42) with another constant κ (cid:48) B , see Lemma 14 and we eventually expect E T (˚ T T , g ) ∼ ˚ E B (cid:0) g (cid:1) , where ˚ E B (cid:0) g (cid:1) = c B λ B κ (cid:48) B (cid:90) ∞ g ( x ) e − λ B x f B ( x ) dx = m (cid:90) ∞ g ( x ) e − λ B x f B ( x ) dx as T → ∞ , where the last equality stems from the identity c B = λ B κ (cid:48) B m that can be readilyderived by picking g = 1 and using (45) together with the fact that f H B is a density function.6.2. Proof of Proposition 10.

We start with a continuous time rooted tree which is a BellmanHarris process in the sense of Deﬁnition 1, so we have random variables ( ζ u , ν u , u ∈ U ) satisfyingproperties (i), (ii) and (iii) of the deﬁnition. For u ∈ U , and t ≥

0, let Λ ut = (cid:80) v ≺ u ( t ) log( ν v ) , t ≥ u .Let ϑ = ( ϑ k ) k ≥ with ϑ k ∈ U be such that | ϑ k | = k for k ≥ ϑ = ∅ ) and ϑ k (cid:22) ϑ l for k ≤ l .We associate to ϑ a counting process ( N t ) t ≥ via the relationship b ϑ Nt ≤ t < d ϑ Nt , t ≥ . This enables us to further obtain a “tagged process of age” such that χ t = ζ tϑ Nt for t ∈ I ϑ Nt andalso a process (Λ t ) t ≥ that encodes the genealogy of the tagged branchΛ t = N t (cid:88) k =1 log( ν ϑ k ) , t ≥ . Step 1.

Let us pick ϑ at random along the genealogical tree T . This means that if H n denotes thesigma-ﬁeld generated by ( ζ u , ν u , u ∈ T , | u | ≤ n ), then on the event { t ∈ I u } ( i.e. the particle u isliving at time t ), we have (or rather, we set) P (cid:0) ϑ N t = u (cid:12)(cid:12) H | u | (cid:1) = (cid:89) v ≺ u ν v = e − Λ ut . It is not diﬃcult to see that ( χ t ) t ≥ is a Markov process with generator A B . By deﬁnition of( χ t ) t ≥ and (Λ t ) t ≥ , it follows that E [ e Λ T g ( χ T )] can be rewritten as (cid:88) u ∈U E [ e Λ T g ( χ T ) { T ∈ I u ,u = ϑ NT } ] = (cid:88) u ∈U E [ e Λ uT g ( ζ Tu ) { T ∈ I u ,u = ϑ NT } ] = (cid:88) u ∈U E [ g ( ζ Tu ) { T ∈ I u } ] , where the last equality is obtained by conditioning with respect to H | u | . Step 2 . For j ≥

1, let τ j = inf { t ≥ , N t ≥ j } − inf { t ≥ , N t ≥ j − } denote the durations betweenthe jumps of ( χ t ) t ≥ , so that e Λ T g ( χ T ) = ∞ (cid:88) k =0 e (cid:80) kj =1 log( ν ϑj ) g ( T − k (cid:88) j =1 τ j ) (cid:8) (cid:80) kj =1 τ j ≤ T < (cid:80) k +1 j =1 τ j (cid:9) . By properties (i)-(iii) of Deﬁnition 1, the τ i are independent with common distribution f B ( x ) dx ,and independent of the ν ϑ k that are independent with common distribution ( p k ) k ≥ . We thus inferthat E [ e Λ T g ( χ T )] is equal to ∞ (cid:88) k =0 (cid:88) h j ≥ ,j ≤ k e (cid:80) kj =1 log( h j ) k (cid:89) j =1 p h j (cid:90) [0 , ∞ ) k +1 g ( T − k (cid:88) j =1 t j ) { (cid:80) kj =1 t j ≤ T < (cid:80) k +1 j =1 t j } k +1 (cid:89) j =1 f B ( t j ) dt . . . dt k +1 . ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 33

We set F B ( x ) = 1 − (cid:82) x f B ( y ) dy and q k = m − kp k , so that ( q k ) k ≥ deﬁnes a probability distribution.Using f H B ( x ) = me − λ B x f B ( x ), we can rewrite the preceding formula so that e − λ B T E (cid:2) e Λ T g ( χ T ) (cid:3) = ∞ (cid:88) k =0 (cid:88) h j ≥ ,j ≤ k k (cid:89) j =1 q h j (cid:90) [0 , ∞ ) k g ( T − k (cid:88) j =1 t j ) { T − (cid:80) kj =1 t j ≥ } e − λ B ( T − (cid:80) kj =1 t j ) × F B ( T − k (cid:88) j =1 t j ) k (cid:89) j =1 f H B ( t j ) dt . . . dt k . Step 3 . Putting W B ( x ) = me − λ B x F B ( x ) /F H B ( x ), we ﬁnally obtain the representation e − λ B T E (cid:2) e Λ T g ( χ T ) (cid:3) = 1 m E (cid:104) g ( (cid:101) χ T ) W B ( (cid:101) χ T ) (cid:105) , where ( (cid:101) χ t ) t ≥ is a Markov process with generator A H B that can be constructed in the same wayas ( χ t ) t ≥ , substituting f B by f H B . Straightforward computations give W B ( x ) = H B ( x ) B ( x ) . Puttingtogether all the three steps, we have proved (cid:88) u ∈U E (cid:2) g ( ζ Tu ) { T ∈ I u (cid:3) = E (cid:2) e Λ T g ( χ T ) (cid:3) = e λ B T m E (cid:104) g ( (cid:101) χ T ) H B ( (cid:101) χ T ) B ( (cid:101) χ T ) (cid:105) . Noticing that (cid:80) u ∈U E [ g ( ζ Tu ) { T ∈ I u ] is nothing but E (cid:2) (cid:80) u ∈ ∂ T T g ( ζ Tu ) (cid:3) establishes (23). Step 4 . By deﬁnition of the set ˚ T T , E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:105) = (cid:88) u ∈U E (cid:104) g ( ζ u ) { b u + ζ u ≤ T } { u ∈T } (cid:105) . We denote by F t the sigma-ﬁeld generated by ( ζ su , u ∈ ∂ T s , s ≤ t ) and we note that d u { u ∈T } is astopping time for the ﬁltration ( F t ) t ≥ . Conditioning w.r.t F b u , using that the ζ u are independentof F b u , we successively obtain E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:105) = (cid:88) u ∈U E (cid:104) { u ∈T } (cid:90) ∞ g ( x ) { b u + x ≤ T } B ( x ) e − (cid:82) x B ( y ) dy dx (cid:105) = (cid:88) u ∈U E (cid:104) { u ∈T } (cid:90) ∞ (cid:16) (cid:90) y g ( x ) B ( x ) { b u + x ≤ T } dx (cid:17) B ( y ) e − (cid:82) y B ( z ) dz dy (cid:105) = (cid:88) u ∈U E (cid:104) { u ∈T } (cid:90) ζ u g ( x ) B ( x ) { b u + x ≤ T } dx (cid:105) = (cid:88) u ∈U E (cid:104) { u ∈T } (cid:90) d u b u g ( ζ su ) B ( ζ su ) { s ≤ T } ds (cid:105) . using that ζ su = s − b u for s ∈ I u in order to obtain the last equality. Finally, observing that { s ∈ I u } = { u ∈ ∂ T s } , we ﬁnally infer E (cid:104) (cid:88) u ∈ ˚ T T g ( ζ u ) (cid:105) = (cid:90) ∞ E (cid:104) (cid:88) u ∈ ∂ T s g ( ζ su ) B ( ζ su ) (cid:105) { s ≤ T } ds. Using (23) completes the proof of (24).

Proof of (26) of Proposition 11.

Whenever ( u, v ) ∈ FU there exist w, ˜ u and ˜ v ∈ U togetherwith integers i (cid:54) = j , such that u = wi ˜ u and v = wj ˜ v . Conditioning w.r.t F d w , using the branchingproperty between descendants of w and the strong Markov property at time d w , we have E (cid:104) (cid:88) ( u,v ) ∈FT ∩ ˚ T T g ( ζ u ) g ( ζ v ) (cid:105) = (cid:88) ( u,v ) ∈FU E (cid:104) g ( ζ u ) { d u

Let τ denote the ﬁrst jump time of the process ( (cid:101) χ t ) t ≥ . Conditioningon { τ > t } and applying the strong Markov property yields P tH B (cid:0) gH B (cid:1) (0) = g ( t ) H B ( t ) P ( τ > t ) + (cid:90) t P t − uH B (cid:0) gH B (cid:1) (0) f H B ( u ) du. The function t (cid:59) u ( t ) = P tH B (cid:0) gH B (cid:1) (0) satisﬁes a renewal equation of the form u = u + u (cid:63) f H B ,with locally bounded initial condition u = gH B P ( τ > · ) and renewal distribution f H B ( y ) dy . Itsunique solution is given by P tH B (cid:0) gH B (cid:1) (0) = g ( t ) H B ( t ) P ( τ > t ) + (cid:90) t g ( t − s ) H B ( t − s ) P ( τ > t − s ) d E [ (cid:101) N s ] , ONPARAMETRIC ESTIMATION IN AGE DEPENDENT BRANCHING PROCESSES 35 where (cid:101) N t = (cid:80) s ≤ t { (cid:101) χ s − (cid:101) χ s − < } is the counting process associated to ( (cid:101) χ t ) t ≥ . By construction, wehave E [ (cid:101) N t ] = E (cid:2) (cid:82) t H B ( (cid:101) χ s ) ds (cid:3) and P ( τ > t ) = (cid:82) ∞ t f H B ( y ) dy = m (cid:82) ∞ t e − λ B y f B ( y ) dy ≤ me − λ B t ,therefore | P tH B (cid:0) gH B (cid:1) (0) | ≤ | g ( t ) | e − λ B t m | H B | ∞ + | H B | ∞ (cid:90) t | g ( u ) | du and we obtain the desired estimate thanks to the fact that H B is uniformly bounded over B . Acknowledgements.

We are grateful to V. Bansaye. M. Doumic for helpful discussion andcomments. The illuminating coupling argument for proving Proposition 12 was indicated to usby N. Fournier. The suggestions of two referees helped to considerably improve a former versionof this work. Part of this work was completed while M.H. was visiting Humboldt-Universit¨at zuBerlin. The research of M.H. is partly supported by the Agence Nationale de la Recherche, (BlancSIMI 1 2011 project

CALIBRATION ). References [1] K. B. Athreya and N. Keiding.

Estimation theory for continuous-time branching processes . Sankhya: TheIndian Journal of Statistics, Series A 39 (1977), 101–123.[2] K. B. Athreya and P. Ney.

Branching processes . Springer-Verlag, New-York, 1972.[3] V. Bansaye, J.-F. Delmas, L. Marsalle and V. C. Tran.

Limit theorems for Markov processes indexed bycontinuous time Galton-Watson trees . The Annals of Applied Probability, 21 (2011) 2263–2314.[4] S. V. Bisteki-Penda, H. Djellout and A. Guillin.

Deviations inequalities, moderate deviations and some limitstheorems for bifurcating Markov chains with application . Annals of Applied Probability 24 (2014) 235-291.[5] B. Cloez.

Limit theorems for some branching measure-valued processes , hal-00598030 (2011).[6] J.-F. Delmas and L. Marsalle.

Detection of cellular aging in a Galton–Watson process . Stochastic Processesand their Applications, 120 (2010) 2495–2519.[7] M. Doumic, M. Hoﬀmann, N. Krell and L. Robert.

Statistical estimation of a growth-fragmentation modelobserved on a genealogical tree . Bernoulli, 21 (2015) 1760–1799.[8] M. Doumic, M. Hoﬀmann, P. Reynaud-Bouret and V. Rivoirard.

Nonparametric estimation of the division rateof a size-structured population . SIAM Journal on Numerical Analysis, 50 (2012) 925–950.[9] M. Doumic, B. Perthame and J. P. Zubelli.

Numerical solution of an inverse problem in size-structured popu-lation dynamics . Inverse Problems, 25 (2009) 25pp.[10] S. Efromovich.

Density estimation for biased data.

Annals of Statistics, 32 (2004), 1137–1161.[11] P. Guttorp.

Statistical Inference for Branching Processes . Wiley, 1991.[12] T. Harris.

The theory of branching processes . Springer-Verlag, New-York, 1963.[13] M. Hoﬀmann and N. Krell.

Statistical analysis of self-similar fragmentation chains . Bernoulli, 17 (2011) 395–423.[14] R. H¨opfner, M. Hoﬀmann and E. L¨ocherbach.

Nonparametric estimation of the death rate in branching diﬀu-sions . Scandinavian Journal of Statistics, 29 (2002) 665-690.[15] O. Hyrien.

Pseudo-likelihood estimation for discretely observed multitype Bellman-Harris branching processes .Journal of Statistical Planning and Inference, 137 (2007) 1375 – 1388.[16] R. Johnson, V. Susarla and J. van Ryzin.

Bayesian nonparametric estimation for age-dependent branchingprocesses . Stochastic Processes and their Applications, 9 (1979) 307 – 318.[17] J. Kiefer

Conditional inference . In: Encyclopedia of Statistical Science, Volume 2 (1972) 103–109, John Wiley,New York.[18] E. L¨ocherbach.

Likelihood ratio processes for Markovian particle systems with killing and jumps . Statisticalinference for stochastic processes, 5 (2002a) 153–177.[19] E. L¨ocherbach.

LAN and LAMN for systems of interacting diﬀusions with branching and immigration . Annalesof the Institute Henri Poincar´e, 38 (2002b) 59–90.[20] K. Oelschlager.

Limit Theorems for Age-Structured Populations . The Annals of Probability, 18 (1990), 290–318.[21] B. Perthame.

Transport equations arising in biology . Birckh¨auser Frontiers in mathematics edition, 2007.[22] R. L. Schilling.

Sobolev embeddings for stochastic processes . Expositiones Mathematicae, 18 (2000) 239–242.[23] B. Tsirelson

From uniform renewal theorem to uniform large and moderate deviations for renewal-rewardprocesses . Electronic Communication in Probability, 18 (2013) 1–13. [24] A. Tsybakov.

Introduction to nonparametric estimation . Springer series in statistics, Springer-Verlag, New-York, 2009.

Marc Hoffmann, CEREMADE, CNRS-UMR 7534, Universit´e Paris-Dauphine, Place du mar´echal DeLattre de Tassigny 75775 Paris Cedex 16, France.

E-mail address : [email protected] Ad´ela¨ıde Olivier, CEREMADE, CNRS-UMR 7534, Universit´e Paris-Dauphine, Place du mar´echal DeLattre de Tassigny 75775 Paris Cedex 16, France.

E-mail address ::