[PDF] Practical targeted learning from large data sets by survey sampling

Abstract

We address the practical construction of asymptotic confidence intervals for smooth (i.e., path-wise differentiable), real-valued statistical parameters by targeted learning from independent and identically distributed data in contexts where sample size is so large that it poses computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results. keywords: semiparametric inference; survey sampling; targeted minimum loss estimation (TMLE)

Full PDF

aa r X i v : . [ m a t h . S T ] J un Practical targeted learning from large data sets by survey sampling

P. Bertail, A. Chambaz, E. JolyModal’X, Universit´e Paris Ouest NanterreAugust 27, 2018

Abstract

We address the practical construction of asymptotic conﬁdence intervals for smooth ( i.e. , path-wise diﬀerentiable), real-valued statistical parameters by targeted learning from independent andidentically distributed data in contexts where sample size is so large that it poses computationalchallenges. We observe some summary measure of all data and select a sub-sample from thecomplete data set by Poisson rejective sampling with unequal inclusion probabilities based on thesummary measures. Targeted learning is carried out from the easier to handle sub-sample. Wederive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables theconstruction of the conﬁdence intervals. The inclusion probabilities can be optimized to reducethe asymptotic variance of the TMLE. We illustrate the procedure with two examples where theparameters of interest are variable importance measures of an exposure (binary or continuous) onan outcome. We also conduct a simulation study and comment on its results.keywords: semiparametric inference; survey sampling; targeted minimum loss estimation (TMLE)

Large data sets are ubiquitous nowadays. They pose computational and theoretical challenges. Weconsider the particular problem of carrying out inference based on semiparametric models by targetedlearning [19, 22] from large data sets. We mainly deal with the fact that the sample size N is, say,huge. Even if we also take advantage of easy to handle summary measures of the observations, we donot consider the speciﬁc diﬃculties yielded by the messiness of real big data. This is why we use theexpression “large data sets” instead of “big data”.Confronted with large data sets, many learning algorithms fail to provide an answer in a reasonabletime if at all. Following [3], we overcome this computational limitation by (i) selecting n among N observations with unequal probabilities and (ii) adapting targeted learning from this smaller, tameddata set.Speciﬁcally, our objective is to enable the construction of a conﬁdence interval with given asymp-totic level for a statistical parameter ψ ≡ Ψ( P ) based on a sample O , . . . , O N of a (huge number)1 of independent and identically distributed (i.i.d.) random variables drawn from P ∈ M , whereΨ : M → R maps a set M of measures including possible distributions of O to the real line. We focuson the case that the functional Ψ is smooth in the following sense. For every P ∈ M , there exists awide class of one-dimensional paths { P t : t ∈ ] − c, c [ } ⊂ M with P t | t =0 = P and an inﬂuence function D ( P ) ∈ L ( P ) such that, for all | t | < c ,Ψ( P t ) = Ψ( P ) + Z D ( P )( dP t − dP ) + o ( t ) (1)= Ψ( P ) + Z D ( P ) dP t + o ( t ) . Here, we denote L ( P ) the set of centered and square-integrable measurable functions relative to P .Condition (1) trivially holds when Ψ is linear. If, for instance, Ψ is given by Ψ( P ) ≡ R f dP forsome measurable function f integrable with respect to (wrt) all elements of M , then (1) holds with D ( P ) ≡ f − Ψ( P ) (without the o -term). Even in the very simple example where f is the identityand M consists of probability measures, hence Ψ( P ) = E P [ O ], it may be computationally diﬃcult, ifnot impossible, to build a conﬁdence interval for ψ = Ψ( P ) using all observations , merely becauseit may be very challenging to access to all of them in the context of large data sets.Typical examples of functionals satisfying (1) include pathwise diﬀerentiable functionals as intro-duced in [24, Section 25.3]. We will give two examples of such functionals. Pathwise diﬀerentiabilitydiﬀers slightly from Gateaux, Hadamard and Fr´echet diﬀerentiability. It is one the of key notions inthe theory of semiparametric inference.We overcome the computational hurdle by resorting to survey sampling, speciﬁcally to rejectivesampling based on Poisson sampling with unequal inclusion probabilities. It is a particular case ofsampling without replacement (we refer to [15] for an overview on sampling without replacement).Survey sampling can also rely on the so called sampling entropy [2, 7, 13], but we do not follow thispath. Also known as Sampford sampling, rejective Poisson sampling has been thoroughly studied forthe last ﬁve decades since the publication of the seminal articles [14, 18]. The key object in the analysisof Sampford sampling is the Horvitz-Thompson (HT) empirical measure. Asymptotic normality ofestimators based on the HT empirical measure was ﬁrst established in [14]. A functional version forthe cumulative distribution function was obtained by [26] . Our analysis hinges on the recent studyof the HT empirical measure from the viewpoint of empirical processes theory carried out in [3] (werefer the reader to this article for additional references).For instance [8, 9] show practically how to implement conﬁdence bands for model-assisted es-timators of the mean when the variable of interest is functional and storage capacities are limited(with applications to electricity consumption curves). In that case, survey sampling techniques areinteresting alternative to signal compression techniques.The joint use of survey sampling techniques in conjunction with semiparametric models for in-ference is not new [5, 6]. To the best of our knowledge, however, this is the ﬁrst attempt to takeadvantage of survey sampling to enable targeted learning when the data set is so large that computa-tional problems arise. In contrast to naive sub-sampling, sampling designs with unequal probabilities2ﬀer a control over the eﬃciency of estimators. In this light, we propose an alternative to the so calledonline version of targeted learning [21]. Organization.

Section 2 presents our procedure for practical targeted learning from large data setsby survey sampling and the central limit theorem which enables the construction of conﬁdence inter-vals. Section 3 illustrates Section 2 with two examples, where the parameters of interest are variableimportance measures of a (binary or continuous) exposure on an outcome. Section 4 summarizes theresults of a simulation study. The proofs are given in appendix.

Throughout the article, we denote µf ≡ R f dµ and k f k ,µ ≡ ( µf ) / for any measure µ and function f (measurable and integrable wrt µ ). Rejective sampling.

Let n ( N ) be a deterministic, user-supplied number of observations to select bysurvey sampling. It is a practical, computationally tractable sample size as opposed to the unpractical,huge N . Because our results are asymptotic we impose that, as N → ∞ , n ( N ) → ∞ and n ( N ) N → . In the rest of this article, we will simply denote n for n ( N ).We employ a speciﬁc survey sampling scheme called rejective sampling [14, 3]. The random selec-tion of observations from the complete data set can depend on easily accessible summary measures V , . . . , V N ∈ V attached to O , . . . , O N . Typically, V , . . . , V N take ﬁnitely many diﬀerent values orare low-dimensional, and the implementation of the database is structured/organized based on thevalues of V , . . . , V N .Let h be a (measurable) function on V such that h ( V ) ⊂ [ c ( h ) , ∞ ) for some constant c ( h ) >

0. Foreach 1 ≤ i ≤ N , deﬁne p i ≡ nh ( V i ) N .

For N large enough, p , . . . , p N ∈ (0 , • ε , . . . , ε N independently drawn, conditionally on V , . . . , V N , from the Bernoulli distributionswith parameters p , . . . , p N , respectively; • ( η , . . . , η N ) drawn, conditionally on V , . . . , V N , from the conditional distribution of ( ε , . . . , ε N )given P Ni =1 ε i = n . 3he subset of n observations randomly selected by rejective sampling is { O i : η i = 1 , ≤ i ≤ N } .It is associated with the so-called HT empirical measure deﬁned by P p R N ≡ N N X i =1 η i p i Dirac( O i ) . (2)Note that P p R N is not necessarily a probability measure. However, if h ≡ P p R N is a probabilitymeasure, and rejective sampling is equivalent to selecting n observations among O , . . . , O N uniformly.For computational reasons, it is not desirable that the event “ P Ni =1 ε i = n ” be too unlikely.Lemma 3.1 in [14] shows that the conditional probability of the event “ P Ni =1 ε i = k ” is maximized when k equals the conditional expectation of P Ni =1 ε i , in which case the conditional probability is asymp-totically equivalent to (2 π P Ni =1 p i (1 − p i )) − / . Because the conditional expectation of n − P Ni =1 ε i equals n − P Ni =1 p i = N − P Ni =1 h ( V i ), which converges P -almost surely to E P [ h ( V )], it is thus goodpractice to choose function h in such a way that E P [ h ( V )] be close to 1. When V , . . . , V N takeﬁnitely many diﬀerent values, it is easy to estimate accurately E P [ h ( V )] on an independent sampleand, therefore, to adapt h so that E P [ h ( V )] ≈ Practical, targeted estimator.

Assume that we have constructed P ∗ n ∈ M targeted to ψ in thesense that P p R N D ( P ∗ n ) = o P (1 / √ n ) . (3)We deﬁne ψ ∗ n ≡ Ψ( P ∗ n ) as our substitution estimator. This construction frames ψ ∗ n in the paradigm ofthe targeted minimum loss estimation methodology [23, 22]. Consider a class F of functions mapping a measured space X to R . Set δ > d or a norm. We denote N ( ε, F , d ) the ε -covering number of F wrt d , i.e. , the minimum number of d -balls of radius ε needed to cover F . The corresponding entropy integral for F evaluated at δ is J ( δ, F , d ) ≡ R δ p log N ( ε, F , d ) dε .Let R : M → R be given by R ( P, P ′ ) ≡ Ψ( P ′ ) − Ψ( P ) − Z D ( P )( dP ′ − dP ) (4)where the inﬂuence function D ( P ) is deﬁned before (1). The real number R ( P ∗ n , P ) can be interpretedas a second-order term in an expansion of ψ ∗ n = Ψ( P ∗ n ) around P . By (1), we focus on functionals Ψsuch that R ( P, P t ) = o ( t ) for a wide class of one-dimensional paths { P t : t ∈ ] − c, c [ } ⊂ M such that P t | t =0 = P . This statement is clariﬁed in the examples of Section 3.We suppose the existence of F ⊂ { D ( P ) : P ∈ M } satisfying the three following assumptions: A1 (complexity) F is separable, for every f ∈ F , P f h − < ∞ , and J (1 , F , k · k ,P ) < ∞ .4 For every f, f ′ ∈ F , if ρ N ( f, f ′ ) ≡ N N X i =1 ( f ( O i ) − f ′ ( O i )) (5)then, P -almost surely, sup f,f ′ ∈F (cid:12)(cid:12)(cid:12)(cid:12) ρ N ( f, f ′ ) k f − f ′ k ,P − (cid:12)(cid:12)(cid:12)(cid:12) −→ N →∞ . A3 (ﬁrst order convergence)

With P -probability tending to 1, D ( P ∗ n ) ∈ F , and there exists f ∈ F such that k D ( P ∗ n ) − f k ,P = o P (1). Moreover, one knows a conservative estimator Σ n of σ ≡ P f h − .Under A1 , we can deﬁne Σ : F → R given byΣ( f, f ′ ) ≡ P f f ′ h − . (6)In particular, σ in A3 equals Σ( f , f ). An additional assumption is needed: A4 (second order term)

There exists a real-valued random variable γ n converging in probability to γ = 1 and such that γ n ( ψ ∗ n − ψ ) + R ( P ∗ n , P ) = o P (1 / √ n ). Moreover, one knows an estimatorΓ n such that Γ n − γ n = o P (1).We can now state our main theorem. Theorem 1.

Assume that A1 , A2 , A3 and A4 are met. Then it holds that (1 − γ n ) √ n ( ψ ∗ n − ψ ) converges in law to the centered Gaussian distribution with variance σ . Consequently, for any α ∈ (0 , , " ψ ∗ n ± ξ − α/ √ Σ n (1 − Γ n ) √ n is a conﬁdence interval with asymptotic coverage no less than (1 − α ) . Comments.

Assumption A1 is typical in semiparametric inference, and should be interpreted as aconstraint on the complexity of F . Theorem 1 relies on the convergence of an empirical process, seeTheorem 2. The proof of Theorem 2 uses a chaining argument, and A2 allows to upper-bound theresulting random term J ( δ, F , ρ N ) by a deterministic term J ( δ, F , k · k ,P ). We say that a class C hasﬁnite uniform entropy integral if it admits an envelope function F and Z ∞ sup ρ q log N ( ǫ k F k ,ρ , C , k · k ,ρ ) dǫ < ∞ , where the supremum is over all probability measures ρ on O such that k F k ,ρ >

0. Assumption A2 can be replaced by the alternative A2*

The class F has a ﬁnite uniform entropy integral.5C-classes of uniformly bounded functions satisfy A2* [25, Section 2.6]. Finally, A3 and A4 aretechnical conditions required by the TMLE procedure. The former is not as mild as one may think atﬁrst sight, because the conservative estimation of σ is not trivial. For instance, it is not guaranteedin general that the substitution estimatorΣ n ≡ P p R N D ( P ∗ n ) h − (7)estimates conservatively σ . Relying on the non-parametric bootstrap is not a solution either ingeneral.We argued that R ( P ∗ n , P ) should be interpreted as a second order term. In the simplest examples,this is literally the case and assuming R ( P ∗ n , P ) = o P (1 / √ n ) is natural, see for instance Section 3.1.Sometimes, R ( P ∗ n , P ) must be corrected by adding γ n ( ψ ∗ n − ψ ) so that it becomes natural to assumethat the corrected expression is o P (1 / √ n ), see for instance Section 3.2.and A4 is met with γ n = 0, see for instance Section 3.1. Allowing γ n to diﬀer from 0 gives moreﬂexibility. In Section 3, we give additional conditions which imply A4 .Knowing the asymptotic variance of (1 − γ n ) √ n ( ψ ∗ n − ψ ) allows to discuss further the choice of h . Introduce f ( V ) ≡ q E P [ f ( O ) | V ] , (8)which satisﬁes σ = P f h − = P f h − . The Cauchy-Schwarz inequality yields( P f ) ≤ P f h − × P h = σ × P h, (9)and equality occurs when f and h are linearly dependent. Moreover, it should hold that P h = 1. Inview of (9), the optimal h is f /P f , assuming that P f > σ = 0). This argumentneglects the second-order dependence of γ n on h . In practice, we would ﬁrst sample n data using h ≡

1, use them to estimate f and P f with f ,n and Z ,n , then ﬁnally deﬁne h ≡ f ,n /Z ,n andexclude the sampled data from { O , . . . , O N } .The following expansion taken from the proof of Theorem 1 partly explains why σ is the asymp-totic variance of (1 − γ n ) √ n ( ψ ∗ n − ψ ): denoting by P ε the shared distribution of ( O , ε ) , . . . , ( O N , ε N ),it holds for any f in F thatVar P ε N N X i =1 f ( O i ) ε i p i ! = 1 N Var P ε (cid:18) f ( O ) ε p (cid:19) = 1 N (cid:18) E P (cid:20) f ( O ) (cid:18) p − (cid:19)(cid:21) + Var P ( f ( O )) (cid:19) . If, contrary to facts, we could take p ≡ n ≡ N and h ≡ N − Var P ( f ( O )) for some limit f , astypically expected. In Section 2.1 p is chosen in such a way that 1 /p is typically much larger than 1.Actually, the above RHS expression at f ≡ f rewrites1 n (cid:16) P f h − + nN ( P f ) (cid:17) = 1 n (cid:0) σ + o (1) (cid:1) . (10)Note the absence of a centering term in P f h − . 6 Two examples

We illustrate Theorem 1 with the inference of two variable importance measures of an exposure, eitherbinary, in Section 3.1, or continuous, in Section 3.2. In both examples, the i th observation O i writes( W i , A i , Y i ) ∈ O ≡ W × A × [0 , W i ∈ W is the i th context, A i ∈ A is the i th exposureand Y i ∈ [0 ,

1] is the i th outcome. In the binary case, A ≡ { , } . In the continuous case, A ∋ R containing 0, which serves as a reference level of exposure. Typically, inbiostatistics or epidemiology, W i could be the baseline covariate describing the i th subject, A i coulddescribe her assignment ( e.g. , treatment or placebo when A = { , } or dose-level when A ⊂ R ) orexposure ( e.g. , exposed or not when A = { , } or level of exposure when A ⊂ R ), and Y i couldquantify her biological response. In this section,

A ≡ { , } and ψ equals ψ b ≡ E P [ E P [ Y | A = 1 , W ] − E P [ Y | A = 0 , W ]] (11)(the superscript “ b ” stands for “binary”). Now, let M be the subset of the set of ﬁnite measureson O ≡ W × { , } × [0 ,

1] equipped with the Borel σ -ﬁeld such that every P ∈ M puts mass onall events of the form B × { a } × B ( a = 0 , B and B Borel sets of W and [0 , O such that the conditional distribution of A given W is not deterministic, including P . For each P ∈ M , we denote P W , P A | W and P Y | A,W the marginal measure of W and conditional measures of A and Y given W and ( A, W ), respectively.(The conditional measure P A | W is P ( O ) times the conditional law of A given W under the probabilitydistribution P/P ( O ). The conditional measure P Y | A,W is deﬁned analogously.) We see ψ b as the valueat P of the functional Ψ b characterized over M byΨ b ( P ) ≡ Z W Z [0 , y (cid:0) dP Y | A =1 ,W = w ( y ) − dP Y | A =0 ,W = w ( y ) (cid:1)! dP W ( w ) . (12)In particular, if P is a possible data-generating distribution for O ( i.e. , if P ( O ) = 1), thenΨ b ( P ) = E P [ E P [ Y | A = 1 , W ] − E P [ Y | A = 0 , W ]] . Moreover, under additional causal assumptions, Ψ b ( P ) can be interpreted as the additive causal eﬀectof the exposure on the response, see [17, 22].Two inﬁnite-dimensional features of every P ∈ M will play an important role in the analysis.Namely, for each P ∈ M and ( w, a ) ∈ W × A , we introduce and denote g P (0 | w ) ≡ P A | W = w ( { } ), g P (1 | w ) ≡ P A | W = w ( { } ), and Q P ( a, w ) ≡ R [0 , ydP Y | A = a,W = w ( y ). In particular if P ( O ) = 1, then g P (1 | W ) = P ( A = 1 | W ) is the conditional probability that the binary exposure equal one and Q P ( A, W ) = E P [ Y | A, W ] is the conditional expectation of the response given exposure and context.7 athwise diﬀerentiability.

The functional Ψ b is pathwise diﬀerentiable at each P ∈ M wrt themaximal tangent space L ( P ) (the space of functions s : O → R such that P s = 0 and

P s < ∞ ) inthe following sense [22, Chapter 5 and Section A.3]: Lemma 1.

Fix P ∈ M and introduce the inﬂuence curve D b ( P ) ∈ L ( P ) given by D b ( P ) ≡ D b ( P ) + D b ( P ) with D b ( P )( O ) ≡ Q P (1 , W ) − Q P (0 , W ) − Ψ b ( P ) ,D b ( P )( O ) ≡ ( Y − Q P ( A, W )) 2 A − g P ( A | W ) . For every uniformly bounded s ∈ L ( P ) and every t ∈ ] − k s k − ∞ , k s k − ∞ [ , deﬁne P s,t ∈ M by setting dP s,t dP = 1 + ts. It holds that t Ψ b ( P s,t ) is diﬀerentiable at 0 (as a function from R to R ) with a derivative at 0 equalto P D b ( P ) s .The asymptotic variance of any regular estimator of Ψ b ( P ) is larger than the Cram´er-Rao lower-bound P D b ( P ) . Moreover, for any P, P ′ ∈ M , P D b ( P ′ ) = Ψ b ( P ) − Ψ b ( P ′ ) + P (2 A − Q P ′ − Q P ) (cid:18) g P − g P ′ (cid:19) . (13) Consequently if

P D b ( P ′ ) = 0 , then Ψ b ( P ′ ) = Ψ b ( P ) whenever g P ′ = g P or Q P ′ = Q P . The last statement is called a “double-robustness property”. Let R b : M → R be given by R b ( P, P ′ ) ≡ Ψ b ( P ′ ) − Ψ b ( P ) − ( P ′ − P ) D b ( P ) , (14)as in (4). In particular, R b ( P, P s,t ) = Ψ b ( P s,t ) − Ψ b ( P ) − ( P s,t − P ) D b ( P )= Ψ b ( P s,t ) − Ψ b ( P ) − tP D b ( P ) s = o ( t ) , showing that (1) is met.Furthermore, (13) and P D b ( P ) = 0 imply R b ( P, P ′ ) = P ′ (2 A − Q P ′ − Q P ) (cid:18) g P ′ − g P (cid:19) . In the context of this example, A4 is fulﬁlled with γ n ≡ n ≡ γ = 0) when R b ( P ∗ n , P ) = P (2 A − Q P ∗ n − Q P ) (cid:18) g P ∗ n − g P (cid:19) = o P (1 / √ n ) . (15)Through the product, we will draw advantage of the synergistic convergences of Q P ∗ n to Q P and g P ∗ n to g P (by the Cauchy-Schwarz inequality for example). Note that if g P is known, then we can imposethat g P ∗ n = g P and R b ( P ∗ n , P ) = 0 exactly. 8 onstruction of the targeted estimator. Let Q w and G w be two user-supplied classes of functionsmapping A × W to [0 , Q w are uniformly bounded away from 0and 1. Similarly, we impose that the elements of G w are uniformly bounded away from 0. Let ℓ bethe logistic loss function given by − ℓ ( u, v ) ≡ u log( v ) + (1 − u ) log(1 − v )(all u, v ∈ [0 ,

1] with conventions log(0) = −∞ and 0 log(0) = 0).We ﬁrst estimate Q P and g P with Q n and g n built upon P p R N , Q w and G w . For instance, onecould simply minimize (weighted) empirical risks and deﬁne Q n ≡ argmin Q ∈Q w P p R N ℓ ( Y, Q ( A, W )) = argmin Q ∈Q w N X i =1 η i p i ℓ ( Y i , Q ( A i , W i )) ,g n ≡ argmin g ∈G w P p R N ℓ ( A, g ( A | W )) = argmin g ∈G w N X i =1 η i p i ℓ ( A i , g ( A i | W i ))(assuming that the argmins exist). Alternatively, one could prefer minimizing cross-validated (weighted)empirical risks. This is beyond the scope of this article but will be studied in future work. We alsoestimate the marginal distribution P ,W of W under P with P p R N ,W ≡ N N X i =1 η i p i Dirac( W i ) . (16)Let P n be a measure such that Q P n = Q n and P n,W = P p R N ,W . ThenΨ b ( P n ) = 1 N N X i =1 η i p i ( Q n (1 , W i ) − Q n (0 , W i )) (17)is an estimator of ψ b , whose construction is not tailored/targeted to ψ b . It is now time to target theinference procedure.Targeting the inference procedure consists in modifying P n in such a way that the resulting P ∗ n satisﬁes (3) with D b substituted for D . We ﬁrst note that, by construction of P n , P p R N D b ( P n ) = P p R N ,W D b ( P n ) = 0 . This equality is equivalent to (17).The construction of P ∗ n based on P n reduces to ensuring P p R N D b ( P ∗ n ) = o P (1 / √ n ). We achievethis objective by ﬂuctuating the conditional measure of Y given ( A, W ) only. For this, we introducethe one-dimensional parametric model { Q n ( t ) : t ∈ R } given bylogit Q n ( t )( A, W ) = logit Q n ( A, W ) + t A − g n ( A | W ) . This parametric model ﬂuctuates Q n in the direction of A − g n ( A | W ) in the sense that Q n (0) = Q n and ddt ℓ ( Y, Q n ( t )( A, W )) = ( Y − Q n ( t )( A, W )) 2 A − g n ( A | W ) (18)9or all t ∈ R . The optimal move along the ﬂuctuation is indexed by t n ≡ arg min t ∈ R P p R N ℓ ( Y, Q n ( t )( A, W )) (19)(note that the random function t P p R N ℓ ( Y, Q n ( t )( A, W )) is strictly convex).Deﬁne Q ∗ n ≡ Q n ( t n ) and let P ∗ n be any element P of M such that Q P = Q ∗ n , g P = g n and P W = P n,W = P p R N ,W . Our ﬁnal estimator is ψ ∗ n ≡ Ψ b ( P ∗ n ) = 1 N N X i =1 η i p i ( Q ∗ n (1 , W i ) − Q ∗ n (0 , W i )) . By deﬁnition of t n and (18), we have P p R N D b ( P ∗ n ) = 0 (just like P p R N D b ( P n ) = 0) and P p R N ddt ℓ ( Y, Q n ( t )( A, W )) (cid:12)(cid:12)(cid:12)(cid:12) t = t n P p R N D b ( P ∗ n ) = 0(whereas it is very unlikely that P p R N D b ( P n ) be equal to zero). Consequently, (3) is met because P p R N D b ( P ∗ n ) = 0 . Theorem 1 is tailored to the present setting in Section 3.3.

In this section,

A ⊂ R is a bounded subset of R containing 0, which serves as a reference value.Moreover, we assume that P ,A | W ( A = 0 | W ) > P ,W -almost surely and the existence of a constant c ( P ) > P ,A | W ( A = 0 | W ) ≥ c ( P ) P ,W -almost surely. Introduced in [12, 10], the trueparameter of interest is ψ c ≡ arg min β ∈ R E P h ( Y − E P [ Y | A = 0 , W ] − βA ) i = arg min β ∈ R E P h ( E P [ Y | A, W ] − E P [ Y | A = 0 , W ] − βA ) i (20)(the superscript “ c ” stands for “continuous”).Let M be the set of ﬁnite measures P on O ≡ W × A × [0 ,

1] equipped with the Borel σ -ﬁeldsuch that there exists a constant c ( P ) > { w ∈ W : P A | W = w ( A \ { } ) > P A | W = w ( { } ) ≥ c ( P ) } under P W equals P ( O ). In particular, P ∈ M bythe above assumption.We see ψ c as the value at P of the functional Ψ c characterized over M byΨ c ( P ) ≡ arg min β ∈ R Z A×W ( Q P ( a, w ) − Q P (0 , w ) − βa ) dP A | W = w ( a ) dP W ( w ) , (21)using the notation of Section 3.1. By Proposition 1 in [12], for each P ∈ M ,Ψ c ( P ) = R A×W a ( Q P ( a, w ) − Q P (0 , w )) dP A | W = w ( a ) dP W ( w ) R A×W a dP A | W = w ( a ) dP W ( w ) . P is a distribution , then Ψ c ( P ) = E P [ A ( Q P ( A, W ) − Q P (0 , W ))] E P [ A ] . For clarity, we introduce some notation. For each P ∈ M and ( w, a ) ∈ W × A , µ P ( w ) ≡ R A adP A | W = w ( a ), and g P (0 | w ) ≡ P A | W = w ( { } ), ζ ( P ) ≡ R A a dP A | W = w ( a ). If P ( O ) = 1, then µ P ( W ) = E P [ A | W ], g P (0 | W ) = P ( A = 0 | W ), and ζ ( P ) = E P (cid:2) A (cid:3) . Pathwise diﬀerentiability.

A result similar to Lemma 1 [see 12, Proposition 1] guarantees that Ψ c is pathwise diﬀerentiable like Ψ b with inﬂuence curves D c ( P ) ≡ D c ( P ) + D c ( P ) ∈ L ( P ), ζ ( P ) D c ( P )( O ) ≡ A ( Q P ( A, W ) − Q P (0 , W ) − A Ψ c ( P )) ,ζ ( P ) D c ( P )( O ) ≡ ( Y − Q P ( A, W )) (cid:18) A − µ P ( W ) { A = 0 } g P (0 | W ) (cid:19) (all P ∈ M ). Let R c : M → R be characterized by R c ( P, P ′ ) ≡ Ψ c ( P ′ ) − Ψ c ( P ) − ( P ′ − P ) D c ( P ) . as in (4) and (14). As in the previous example, R c satisﬁes (1) and, for every P, P ′ ∈ M , R c ( P, P ′ ) = (cid:18) − ζ ( P ′ ) ζ ( P ) (cid:19) (cid:0) Ψ c ( P ′ ) − Ψ c ( P ) (cid:1) + 1 ζ ( P ) P ′ (cid:18) ( Q P ′ (0 , · ) − Q P (0 , · )) (cid:18) µ P ′ − µ P g P ′ (0 |· ) g P (0 |· ) (cid:19)(cid:19) . (22)Introduce γ n ≡ − ζ ( P ) ζ ( P ∗ n ) and Γ n ≡ − ζ n ( P ) ζ n ( P ∗ n )where ζ n ( P ) and ζ n ( P ∗ n ) estimate ζ ( P ) and ζ ( P ∗ n ). With these choices, (22) guarantees that A4 isfulﬁlled in the context of this example when ζ ( P ∗ n ) converges in probability to a ﬁnite real numbersuch that γ = 1 and1 ζ ( P ∗ n ) P (cid:18) ( Q P (0 , · ) − Q P ∗ n (0 , · )) (cid:18) µ P − µ P ∗ n g P (0 |· ) g P ∗ n (0 |· ) (cid:19)(cid:19) = o P (1 / √ n ) . Through the product, we will draw advantage of the synergistic convergences of Q P ∗ n (0 , · ) to Q P (0 , · )and ( µ P ∗ n , g P ∗ n ) to ( µ P , g P ) (by the Cauchy-Schwarz inequality for example). Construction of the targeted estimator.

Let Q w , M w and G w be three user-supplied classesof functions mapping A × W , W and W to [0 , Q P , µ P and g P with Q n and µ n and g n built upon P p R N , Q w , M w and G w . For instance, one could simply minimize(weighted) empirical risks and deﬁne Q n ≡ argmin Q ∈Q w P p R N ℓ ( Y, Q ( A, W )) = argmin Q ∈Q w N X i =1 η i p i ℓ ( Y i , Q ( A i , W i )) , n ≡ argmin µ ∈M w P p R N ℓ ( A, µ ( W )) = argmin µ ∈M w N X i =1 η i p i ℓ ( A i , µ ( W i )) ,g n ≡ argmin g ∈G w P p R N ℓ ( { A = 0 } , g (0 | W )) = argmin g ∈G w N X i =1 η i p i ℓ ( { A i = 0 } , g (0 | W i ))(assuming that the argmins exist). Alternatively, one could prefer minimizing cross-validated (weighted)empirical risks. We also estimate the marginal distribution P ,W of W under P with P p R N ,W ≡ N N X i =1 η i p i Dirac( W i ) , (23)and the real-valued parameter ζ ( P ) with ζ ( P p R N ,X ) where P p R N ,X is deﬁned as in (23) with X and X i substituted for W and W i .Let P n be a measure such that Q P n = Q n , µ P n = µ n , g P n = g n , ζ ( P n ) = ζ ( P p R N ,X ), P n,W = P p R N ,W , and from which we can sample A conditionally on W . Picking up such a P n is an easy technicaltask, see [12, Lemma 5] for a computationally eﬃcient choice. Then the initial estimator Ψ b ( P n ) of ψ b can be computed with high accuracy by Monte-Carlo. It suﬃces to sample a large number B (say B = 10 ) of independent ( A ( b ) , W ( b ) ) by (i) sampling W ( b ) from P n,W = P p R N ,W then (ii) sampling A ( b ) from the conditional distribution of A given W = W ( b ) under P n repeatedly for b = 1 , . . . , B andto make the approximationΨ c ( P n ) ≈ B − P Bb =1 A ( b ) ( Q n ( A ( b ) , W ( b ) ) − Q n (0 , W ( b ) )) ζ ( P n ) . (24)However, the construction Ψ c ( P n ) is not tailored/targeted to ψ c yet. It is now time to target theinference procedure.Targeting the inference procedure consists in modifying P n in such a way that the resulting P ∗ n satisﬁes (3) with D c substituted for D . We proceed iteratively. Suppose that P kn has been constructedfor some k ≥

0. We ﬂuctuate P kn with the one-dimensional parametric model { P kn ( t ) : t ∈ R , t ≤ c ( P kn ) / k D c ( P kn ) k ∞ } characterized by dP kn ( t ) dP kn = 1 + tD c ( P kn ) . Lemma 1 in [12] shows how Q P kn ( t ) , µ P kn ( t ) , g P kn ( t ) , ζ ( P kn ( t )) and P kn,W ( t ) depart from their counterpartsat t = 0. The optimal move along the ﬂuctuation is indexed by t kn ≡ arg max t P p R N log (cid:16) tD c ( P kn ) (cid:17) , i.e. , the maximum likelihood estimator of t (note that the random function t P p R N log(1 + tD c ( P kn ))is strictly concave). It results in the ( k + 1)-th update of P n , P k +1 n ≡ P kn ( t kn ).Contrary to what happened in the ﬁrst example, see Section 3.1, there is no guarantee that a P k +1 n will coincide with its predecessor P kn . In this light, the updating procedure in Section 3.1 converged inone single step. Here, we assume that the iterative updating procedure converges (in k ) in the sense12hat, for k n large enough, P p R N D c ( P k n n ) = o P (1 / √ n ). We set P ∗ n ≡ P k n n . It is actually possible tocome up with a one-step updating procedure ( i.e. , an updating procedure such that P kn = P k +1 n for all k ≥

1) in this example too by relying on so-called universally least favorable models [20]. We adoptthis multi-step updating procedure for simplicity.We can assume without loss of generality that we can sample A conditionally on W from P ∗ n . Theﬁnal estimator is computed with high accuracy like Ψ c ( P n ) previously: with Q ∗ n ≡ Q P ∗ n , we sample B independent ( A ( b ) , W ( b ) ) by (i) sampling W ( b ) from P ∗ n,W then (ii) sampling A ( b ) from the conditionaldistribution of A given W = W ( b ) under P ∗ n repeatedly for b = 1 , . . . , B and make the approximation ψ ∗ n ≡ Ψ c ( P ∗ n ) ≈ B − P Bb =1 A ( b ) ( Q ∗ n ( A ( b ) , W ( b ) ) − Q ∗ n (0 , W ( b ) )) ζ ( P ∗ n ) . (25)Theorem 1 is tailored to the present setting in Section 3.3. Consider the following assumptions for the study of ψ ∗ n in the setting of Section 3.1: A1 b The classes Q w and G w are separable, P Q h − < ∞ and P g h − < ∞ for all ( Q, g ) ∈ Q w × G w ,and J (1 , Q w , k · k ,P ) < ∞ , J (1 , G w , k · k ,P ) < ∞ . Moreover, A2 ∗ is met by Q w and G w . A2 b There exists P ∈ M such that k D b ( P ∗ n ) − D b ( P ) k ,P = o P (1). Moreover, k Q ∗ n − Q P k ,P ×k g n − g P k ,P = o P (1 / √ n ) and one knows a conservative estimator Σ n of P D b ( P ) h − .The assumptions required for the study of ψ ∗ n in the setting of Section 3.2 are very similar: A1 c There exists a set

F ⊂ { D ( P ) : P ∈ M } such that A1 and A2 are veriﬁed. A2 c There exist ζ − > P ∈ M with ζ ( P ) ≥ ζ − > ζ ( P ∗ n ) = ζ ( P ) + O P (1 / √ n ) , k D c ( P ∗ n ) − D c ( P ) k ,P = o P (1) , k Q ∗ n − Q P k ,P × ( k µ ∗ n − µ P k ,P + k g n − g P k ,P ) = o P (1 / √ n ) . Moreover, Γ n − γ n = o P (1) and one knows a conservative estimator Σ n of P D b ( P ) h − .In A2 b , Q P and g P should be interpreted as the limits of Q P ∗ n and g P ∗ n . Likewise, Q P , µ P and g P in A2 c should be interpreted as the limits of Q P ∗ n , µ P ∗ n and g P ∗ n . Corollary 1.

Set α ∈ (0 , . In the setting of Section 3.1 and under A1 b , A2 b , " ψ ∗ n ± ξ − α/ √ Σ n √ n s a conﬁdence interval for ψ b with asymptotic coverage no less than (1 − α ) . In the setting of Section 3.2and under A1 c , A2 c , " ψ ∗ n ± ξ − α/ √ Σ n (1 − Γ n ) √ n is a conﬁdence interval for ψ c with asymptotic coverage no less than (1 − α ) . We illustrate the methodology with the inference of the variable importance measure of a continuousexposure presented in Section 3.2. We consider three data-generating distributions P , , P , and P , of a data-structure O = ( W, A, Y ). The three distributions diﬀer only in terms of the conditionalvariance of Y given ( A, W ), but do so drastically. Speciﬁcally, O = ( W, A, Y ) drawn from P ,j ( j = 1 , ,

3) is such that • W ≡ ( V, W , W ) with P ( V = 1) = 1 / P ( V = 2) = 1 / P ( V = 3) = 1 / V , ( W , W ) is a Gaussian random vector with mean (0 ,

0) and variance (cid:0) − . − . (cid:1) (if V = 1),(1 , /

2) and ( . . . . ) (if V = 2), (1 / ,

1) and ( ) (if V = 3); • conditionally on W , A = 0 with probability 80% if W ≥ . W ≥ . W and A = 0, A − χ -distribution with 1 degreeof freedom and non-centrality parameter p ( W − . + ( W − . ; • conditionally on ( W, A ), Y is a Gaussian random variable with mean E P [ Y | A, W ] ≡ A ( W + W ) / W + W / W + W ) /

10) and standard deviation- 1.5 (if V = 1), 1 (if V = 2) and 0.5 (if V = 3) for j = 1;- 1 (if V = 1), 5 (if V = 2) and 10 (if V = 3) for j = 2;- 50 (if V = 1), 10 (if V = 2) and 1 (if V = 3) for j = 3.The unique true parameter is ψ c = Ψ c ( P , ) = Ψ c ( P , ) = Ψ c ( P , ). It equals approximately 0.1204.For B = 10 and each j = 1 , ,

3, we repeat independently the following steps:1. simulate a data set of N = 10 independent observations drawn from P ,j ;2. extract n ≡ observations from the data set by survey sampling with h ≡

1, and based onthese observations:(a) apply the procedure described in Section 3.2 and retrieve D c ( P k n n );(b) set f n , ≡ D c ( P k n n ) and regress f n , ( O ) on V , call f n , the square root of the resultingconditional expectation, see (8);(c) estimate the marginal distribution of V , estimate P f n , with π n , and set h ≡ f n , /π n , ;14. for each n in { , × , , × , } , successively, extract by survey sampling with h asub-sample of n observations from the data set (deprived of the observations extracted in step 2)and, based on these observations, apply the procedure described in Section 3.2. We use Σ n givenin (7) to estimate σ , although we are not sure in advance that it is a conservative estimator.We thus obtain 15 × B estimates of ψ c and their respective conﬁdence intervals.To give an idea of what is the optimal h in each case, we save the result of step 2 in the above listin the ﬁrst of the B simulations under P , , P , and P , . So, the optimal h equals approximately- h given by ( h (1) , h (2) , h (3)) ≈ (1 . , . , .

21) under P , ;- h given by ( h (1) , h (2) , h (3)) ≈ (0 . , . , .

50) under P , ;- h given by ( h (1) , h (2) , h (3)) ≈ (4 . , . , .

09) under P , Note how diﬀerent are h , h and h (to facilitate the comparisons, h , h and h are renormalized tosatisfy P ,j h j = 1 for j = 1 , , R package called tmle.npvi [11, 10].Note, however, that it is necessary to compute Γ n and Σ n . Speciﬁcally, we ﬁne-tune the TMLEprocedure by setting iter (the maximum number of iterations of the targeting step) to 7 and stoppingCriteria to list(mic=0.01, div=0.01, psi=0.05) . Moreover, we use the default flavor called "learning" , thus notably rely on parametric linear models for the estimation of the inﬁnite-dimensional parameters Q P , µ P and g P and their ﬂuctuation. We refer the interested reader to thepackage’s manual and vignette for details.Sampford’s sampling method [18] implements the survey sampling described in Section 2.1. How-ever, when the ratio n/N is close to 0 or 1, this acceptance-rejection algorithm typically takes toomuch time to succeed. In our setting, this is the case when n/N diﬀers from 10 − . To circumventthat issue, we approximate the survey sampling described in Section 2.1 with a Pareto sampling [seeAlgorithm 2 in 4, Section 5].The results are summarized in Table 1. We ﬁrst focus on the empirical bias of the TMLE and p -values of the Shapiro-Wilk test of normality of its distribution. In all settings, the empirical biasdecreases as n grows (under P , , the empirical biases for n = 5 × and n = 10 equal 0.0044 and0.0036 when relying on h or h ). Under each P ,j and for every sub-sample size, the empirical biasis smaller when relying on h j than on h , approximately twice smaller under P , . As expected dueto our choices of conditional standard deviations of Y given ( A, W ), the empirical bias is larger under P , than under P , and larger under P , than under P , . Except under P , when relying on h , forevery n ≥ × , the p -values of the Shapiro-Wilk test of normality are coherent with the convergencein law of the TMLE to a Gaussian distribution. Under P , and when relying on h , there is moreevidence of a departure from a Gaussian distribution. Inspecting the results of the simulations studiesreveals that this is mostly due to slightly too heavy tails.We now focus on the empirical coverage, empirical variance and mean of the estimated varianceof the TMLE. Consider the table about the simulation under P , ﬁrst. For n ∈ { , × , } ,15 , , optimal h P , , h ≡ n b. p -val. c. v. e. v. b. p -val. c. v. e. v.1 × × × × × P , , optimal h P , , h ≡ n b. p -val. c. v. e. v. b. p -val. c. v. e. v.1 × × × × × P , , optimal h P , , h ≡ n b. p -val. c. v. e. v. b. p -val. c. v. e. v.1 × × × × × P , , P , and P , . Each of them reports the empirical bias of theestimators (b., B − P Bb =1 | ψ ∗ n,b − ψ c | ), p -value of a Shapiro-Wilk test of normality ( p -val.), empiricalcoverage of the conﬁdence intervals (c., B − P Bb =1 { ψ c ∈ I n,b } ), n times the empirical variance of theestimators (v., n [ B − P Bb =1 ψ ∗ n,b − ( B − P Bb =1 ψ ∗ n,b ) ]) and empirical mean of n times the estimatedvariance of the estimators (e. v., B − P Bb =1 Σ n,b ), for every sub-sample size n and for both h optimaland h = h ≡

1. 16he empirical coverage is satisfying when relying on both h and h . At each of these sub-samplesizes, it does seem that we achieve the conservative estimation of σ . However, the empirical coveragedeteriorates sharply for n ∈ { × , } . It appears that, concomitantly, the empirical variance ofthe estimators increases strongly. This may be due to the fact that, here, n is not that small comparedto N , so that neglecting the second LHS term in (10) is inadequate, so that σ is not the limitingvariance. In conclusion, note that resorting to the optimal h does not yield much gain in terms ofempirical variance of the estimators.We now turn to the two remaining tables. The ﬁrst striking feature is that the empirical coverageexceeds largely the nominal coverage of 95%. The comparison of the empirical variance with the meanof the estimated variance reveals that we do achieve the conservative estimation of σ . The secondstriking feature is that the empirical variance stabilizes for n larger than 10 , contrary to what happensunder P , . It still holds that n may not be small compared to N . Perhaps this is counterbalancedby the fact that, by increasing starkly the conditional variance of Y given ( A, W ) under P , and P , relative to P , , we make P f h − , the ﬁrst LHS term in (10), much larger than the second LHS term n ( P f ) /N . Finally, resorting to the optimal h yields, both under P , and P , , considerable gainsin terms of empirical variance of the estimators and in terms of the width of the resulting conﬁdenceintervals. Acknowledgements.

The authors acknowledge the support of the French Agence Nationale de laRecherche (ANR), under grant ANR-13-BS01-0005 (project SPADRO).

A Proof of Theorem 1

Throughout the proofs, “ a . b ” means that there exists a universal constant L > a ≤ Lb .We start with a central limit theorem for the empirical process ( √ n ( P p R N − P ) f ) f ∈F . Its proof isgiven at the end of this section. Recall that a random process G in ℓ ∞ ( F ) is k · k ,P -equicontinuous iffor each ξ >

0, there exists δ > f, f ′ ∈ F , k f − f ′ k ,P ≤ δ implies P ( | G ( f − f ′ ) | ) ≤ ξ . Theorem 2.

Under A1 and A2 there exists a k · k ,P -equicontinuous Gaussian process G h ∈ ℓ ∞ ( F ) with covariance operator Σ such that ( √ n ( P p R N − P ) f ) f ∈F converges weakly in ℓ ∞ ( F ) towards G h .The same result holds with F replaced by { f − f : f ∈ F } . We now turn to the proof of Theorem 1. Since P ∗ n D ( P ∗ n ) = 0 (by deﬁnition, the inﬂuence function D ( P ∗ n ) is centered under P ∗ n ), A4 rewrites R ( P ∗ n , P ) = ψ − ψ ∗ n − P D ( P ∗ n ) = − γ n ( ψ ∗ n − ψ ) + o P (1 / √ n ) , hence (1 − γ n ) √ n ( ψ ∗ n − ψ ) = −√ nP D ( P ∗ n ) + o P (1) . Moreover, (3) implies that the above equality also yields(1 − γ n ) √ n ( ψ ∗ n − ψ ) = √ n ( P p R N − P ) D ( P ∗ n ) + o P (1)17 √ n ( P p R N − P ) f + √ n ( P p R N − P )( D ( P ∗ n ) − f ) + o P (1) . Theorem 2 implies in particular that √ n ( P p R N − P ) f converges in law to the centered Gaussiandistribution with variance Σ( f , f ).Let us prove now that √ n ( P p R N − P )( D ( P ∗ n ) − f ) = o P (1). This is a consequence of Theorem 2and the concentration inequality of [25, Corollary 2.2.8].Let k · k , Σ be the norm on F given by k f k , Σ ≡ Σ( f, f ). For every δ >

0, introduce F δ ≡ { f ∈ F : P ( f − f ) ≤ δ } ⊂ F . The diameter of F δ wrt k · k , Σ is at most δ/ p c ( h ). By [25, Corollary 2.2.8], E " sup f ∈F δ G h ( f − f ) . Z δ/ √ c ( h )0 q log N ( ǫ, F δ , k · k , Σ ) dǫ . Z δ/ √ c ( h )0 q log N ( ǫ, F , k · k , Σ ) dǫ. (26)Set arbitrarily α, β >

0, and choose δ > Z δ/ √ c ( h )0 q log N ( ǫ, F , k · k , Σ ) dǫ ≤ αβ. By Markov’s inequality, (26) and choice of δ , it holds that P sup f ∈F δ G h ( f − f ) ≥ α ! ≤ α − E " sup f ∈F δ G h ( f − f ) . β. Hence, Theorem 2 implies that, for n large enough, P (cid:16) √ n ( P p R N − P )( D ( P ∗ n )) − f ) ≥ α (cid:17) ≤ β. (27)Furthermore, by A3 , P ( D ( P ∗ n )

6∈ F δ ) ≤ β for n large enough. Combining this inequality and (27)ﬁnally yields P (cid:16) √ n ( P p R N − P )( D ( P ∗ n )) − f ) ≥ α (cid:17) ≤ P (cid:16) √ n ( P p R N − P )( D ( P ∗ n )) − f ) ≥ α , D ( P ∗ n )) ∈ F δ (cid:17) + P ( D ( P ∗ n )) / ∈ F δ ) ≤ P sup f ∈F δ √ n ( P p R N − P )( f − f ) ≥ α ! + P ( D ( P ∗ n )) / ∈ F δ ) ≤ β for n large enough.Consequently, (1 − γ n )( ψ ∗ n − ψ ) converges in law to the centered Gaussian distribution with varianceΣ( f , f ). Applying Slutsky’s lemma completes the proof.18 roof of Theorem 2. The proof relies on results from [14, 1]. For each f ∈ F , deﬁne Z N ( f ) ≡ P p R N f and G hn ( f ) ≡ √ n ( P p R N − P ) f = √ n ( Z N ( f ) − P f ) . We ﬁrst state and prove the following lemma, by using [14, Lemma 4.3 and Theorem 7.1]:

Lemma 2.

For every (measurable) real-valued function f on O such that P f /h is ﬁnite, G hn ( f ) converges in law to the centered Gaussian distribution with variance σ ( f ) ≡ E P (cid:2) f ( O ) h ( V ) − (cid:3) .Proof of Lemma 2. This is a three-step proof.

Step 1: preliminary.

Set arbitrarily a measurable function f : O → R such that P f /h is ﬁniteand deﬁne T N ( f ) ≡ N N X i =1 f ( O i ) ε i p i . The only diﬀerence between T N ( f ) and Z N ( f ) is the substitution of ( ε , . . . , ε N ) for ( η , . . . , η N ). Since( O , ε ) , . . . , ( O N , ε N ) are independently sampled (from P ε ), it holds that E P ε [ T N ( f )] = P f andVar P ε ( T N ( f )) = 1 N Var P ε (cid:18) f ( O ) ε p (cid:19) = 1 n E P (cid:20) f ( O ) h ( V ) (cid:21) − N ( P f ) = σ ( f ) n + o (1 /n ) . (28)Thus, √ n ( T N ( f ) − P f ) converges in law to the centered Gaussian distribution with variance σ ( f ).The challenge is now to derive another central limit theorem for Z N ( f ) from this convergence in law. Step 2: coupling.

The rest of the proof mainly hinges on coupling. We may assume without lossof generality that there exist U , . . . , U N independently drawn from the uniform distribution on [0 , O , . . . , O N ) such that, for each 1 ≤ i ≤ N , ε i = { U i ≤ p i } . We now deﬁne ℓ N ≡ n/ P Ni =1 p i and, for each 1 ≤ i ≤ N , ε i ( ℓ N ) = { U i ≤ ℓ N p i } . This is the ﬁrst coupling used inthe proof.The second coupling is more elaborate. Due to Hajek, it gives rise to two random subsets s K and s n of { , . . . , N } that we characterize now, in three successive steps. In the rest of this step of theproof, we work conditionally on O , . . . , O N .1. Drawing s n ⊂ { , . . . , N } :(a) sample ( η ′ , . . . , η ′ N ) from the conditional distribution of ( ε ′ , . . . , ε ′ N ) given P Ni =1 ε ′ i = n when ε ′ , . . . , ε ′ N are independently drawn from the Bernoulli distributions with parameters ℓ N p , . . . , ℓ N p N , respectively; 19b) deﬁne s n = { ≤ i ≤ N : η ′ i = 1 } and D n ≡ P i ∈ s n (1 − ℓ N p i ) for future use.We say simply that s n is drawn from the rejective sampling scheme on { , . . . , N } with parameter( ℓ N p i : i ∈ { , . . . , N } ) (see Section 2).2. Drawing K ∈ { , . . . , N } :(a) sample ε ′′ , . . . , ε ′′ N independently from the Bernoulli distributions with parameters ℓ N p , . . . ,ℓ N p N , respectively;(b) deﬁne K ≡ P Ni =1 ε ′′ i .3. Drawing s K :(a) if K = n , then set s K ≡ s n ;(b) if K > n , then draw s K − n from the rejective sampling scheme on { , . . . , N } \ s n withparameter (( K − n ) ℓ N p i /D n : i ∈ { , . . . , N } \ s n ) and set s K ≡ s n ∪ s K − n ;(c) if K < n , then draw s n − K from the rejective sampling scheme on s n with parameter(( K − n ) ℓ N p i /D n : i ∈ s n ) and set s K ≡ s n \ s n − K .We denote by S the joint law of ( s K , s n ). Obviously, S is such that s K ⊂ s n or s n ⊂ s K S -almostsurely. We denote by P the law of the Poisson sampling scheme, i.e. , the law of { ≤ i ≤ N : ε ′ i = 1 } from the description of how s n is drawn. Law S is a coupling of the rejective sampling scheme andan approximation to the Poisson sampling scheme P in the sense of the following corollary of [14,Lemma 4.3]. Proposition 1 (Hajek) . If d N ≡ P Ni =1 p i (1 − p i ) goes to inﬁnity as N goes to inﬁnity, then themarginal distribution of s K when ( s K , s n ) is drawn from S converges to P in total variation. The condition on d N is met for our choice of ( p , . . . , p N ). Step 3: concluding.

Introduce T ℓ N N ( f ) ≡ N N X i =1 f ( O i ) ε i ( ℓ N ) ℓ N p i ,T s K N ( f ) ≡ X i ∈ s K f ( O i ) p i ,T s n N ( f ) ≡ X i ∈ s n f ( O i ) p i . The random variables Z N ( f ), T ℓ N N ( f ), T s K N ( f ) and T s n N ( f ) satisfy the following properties. • Z N ( f ) and T s n N ( f ) share a common law.This is a straightforward consequence of Proposition 1.20 √ n ( T s n N ( f ) − T s K N ( f )) = o P (1).Indeed, it is shown in the proof of [14, Theorem 7.1] that the convergence of d N (deﬁned inProposition 1) to inﬁnity implies, conditionally on O , . . . , O N , √ n ( T s K N ( f ) − T s n N ( f )) = o P (1).The unconditional result readily follows. • T s K N ( f ) and T ℓ N N ( f ) have asymptotically the same law, in the sense that the total variationdistance between their laws goes to 0 as N goes to inﬁnity.This is a consequence of Proposition 1. • √ n ( T ℓ N N ( f ) − T N ( f )) = o P (1).It suﬃces to show that E P ε h ( T N ( f ) − T ℓ N N ( f )) i = o (1 /n ). Observe now that nE P ε (cid:20)(cid:16) T ℓ N N ( f ) − T N ( f ) (cid:17) (cid:21) = nN E P ε "(cid:18) ε ( ℓ N ) ℓ N − ε (cid:19) f ( O ) p = E P ε (cid:20)(cid:18) ℓ N − , ℓ N ) ℓ N + 1 (cid:19) f ( O ) h ( V ) (cid:21) . The strong law of large numbers yields that ℓ N converges to 1 almost surely, which implies1 /ℓ N − , ℓ N ) /ℓ N + 1 converges to 0 almost surely hence the result by the dominatedconvergence theorem.Consequently, G hn ( f ) ≡ √ n ( Z N ( f ) − P f ) and √ n ( T N ( f ) − P f ) have asymptotically the same law.The same arguments are valid when { f − f : f ∈ F } is substituted for F . Thus, the proof iscomplete.We can now prove Theorem 2. We ﬁrst note that Lemma 2 implies the asymptotic tightness of thereal-valued random variable G hn ( f ) for all f ∈ F . Moreover, Lemma 2 and the Cram´er-Wold deviceyield the convergence in law of ( G hn f , . . . , G hn f M ) to ( G h f , . . . , G h f M ) for all ( f , . . . , f M ) ∈ F M .Indeed, for each ( f , . . . , f M ) ∈ F M and any ( λ , . . . , λ M ) ∈ R M , ¯ f ≡ P Mm =1 λ m f m is measurableand P ¯ f /h is ﬁnite hence, by Lemma 2, P Mm =1 λ m G hn f m = G hn ( ¯ f ) converges in law to G h ( ¯ f ) = P Mm =1 λ m G hn f m . In addition, A1 implies that the diameter of F wrt k · k ,P is ﬁnite. Therefore, by[25, Theorems 1.5.4 and 1.5.7], if for all α, β >

0, there exists δ > N →∞ P sup f,f ′ : k f − f ′ k ,P <δ (cid:12)(cid:12)(cid:12) G hn f − G hn f ′ (cid:12)(cid:12)(cid:12) > α ! ≤ β, (29)then Theorem 2 is valid.Set arbitrarily α, β, δ > F δ ≡ { f − f ′ : f, f ′ ∈ F , k f − f ′ k ,P ≤ δ } . It is shown in[16] (see also [1]) that η , . . . , η N are negatively associated in the following sense. For each A , A ⊂{ , . . . , N } with A ∩ A = ∅ and all (measurable) f : R d → R and g : R d → R ( d ≡ card( A ) and d ≡ card( A )), if f and g are increasing in every coordinate, thencov ( f ( η i : i ∈ A ) , g ( η i : i ∈ A )) ≤ . O , . . . , O N , for all t > P (cid:16) | G hn ( f ) | > t (cid:12)(cid:12)(cid:12) O , . . . , O N (cid:17) ≤ exp (cid:18) − t ρ N ( f ) (cid:19) . Therefore, a classical chaining argument [25, Corollary 2.2.8, for instance]) yields E " sup f,f ′ ∈F δ | G hn ( f ) − G hn ( f ′ ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) O , . . . , O N . Z ∞ p log N ( ǫ, F δ , ρ N ) dǫ. (30)By A2 , there exists a deterministic sequence { a N } N ≥ tending to 0 such that, for all f, g ∈ F , ρ N ( f, g ) ≤ (1 + a N ) ρ ( f, g ) P -almost surely. Consequently, for every ǫ >

0, it holds P -almost surelythat N ( ǫ, F δ , ρ N ) ≤ N ( ǫ/ (1 + a N ) , F δ , k · k ,P ) . Plugging the previous upper-bound in (30), taking the expectation, using Markov’s inequality andletting N go to inﬁnity then givelim sup N →∞ P sup f,f ′ ∈F δ (cid:12)(cid:12)(cid:12) G hn ( f ) − G hn ( f ′ ) (cid:12)(cid:12)(cid:12) > α ! . α − Z ∞ q log N ( ǫ, F δ , k · k ,P ) dǫ . α − J ( δ, F , k · k ,P ) . By A1 , it is possible to choose δ > β , hence (29) holds.It only remains to determine the covariance of G h . By adapting the proof of Lemma 2, it appearsthat cov( G h ( f ) , G h ( f ′ )) = P f f ′ /h = Σ( f, f ′ ) for all f, f ′ ∈ F . B Tailoring the main theorem in the setting of Section 3.1

Let us show that A1 b and A2 b imply A1 – A4 in the setting of Section 3.1. Since Q w and G w areuniformly bounded away from 0 and 1, t n (19) necessarily belongs to a deterministic, compact subset T of R . Deﬁne e Q w ≡ (cid:26) expit (cid:18) logit Q + t A − g ( A | W ) (cid:19) : Q ∈ Q w , g ∈ G w , t ∈ T (cid:27) then F ≡ { D ( P ) : P ∈ M s.t. Q P ∈ e Q w , g P ∈ G w } . Obviously, D ( P ∗ n ) ∈ F and sup f ∈F k f k ∞ is ﬁnite. Furthermore, because expit is a 1-Lipschitz andlogit is Lipschitz on any compact subset of (0 , e Q, e Q ′ ∈ e Q w respectively parametrizedby ( Q, g, t ) and ( Q ′ , g ′ , t ′ ) satisfy k e Q − e Q ′ k ,P . k Q − Q ′ k ,P + k g − g ′ k ,P + | t − t ′ | . Therefore, the ﬁniteness of J (1 , Q w , k · k ,P ), J (1 , G w , k · k ,P ) and J (1 , T , | · | ) implies the ﬁnitenessof J (1 , e Q w , k · k ,P ). Moreover, the separability of Q w and G w yields that e Q w is also separable.22urthermore, for every P, P ′ ∈ M such that D b ( P ) , D b ( P ′ ) ∈ F , it holds that k D b ( P ) − D b ( P ′ ) k ,P . k Q P − Q P ′ k ,P + k g P − g P ′ k ,P + | Ψ b ( P ) − Ψ b ( P ′ ) | . (31)We will prove this at the end of the section. By (31), the separability of e Q w and G w implies that of F .In addition, the ﬁniteness of J (1 , e Q w , k · k ,P ), J (1 , G w , k · k ,P ), J (1 , [0 , , | · | ) and (31) imply that J (1 , F , k · k ,P ) is ﬁnite. We prove likewise based on (31) that F has a ﬁnite uniform entropy integralbecause Q w and G w do. Finally, (15) and A2 b imply A3 (by Cauchy-Schwarz’s inequality) and A4 . Proof of (31).

For any P ∈ M , denote q P ( W ) = Q P (1 , W ) − Q P (0 , W ). Set P, P ′ ∈ M b . It holds that k D b ( P ) − D b ( P ′ ) k ,P ≤ k q P − q P ′ k ,P + | Ψ b ( P ) − Ψ b ( P ′ ) | . Moreover, k D b ( P ) − D b ( P ′ ) k ,P = (cid:13)(cid:13)(cid:13)(cid:13) ( Y − q P ( W )) 2 A − g P − ( Y − q P ′ ( W )) 2 A − g P ′ (cid:13)(cid:13)(cid:13)(cid:13) ,P ≤ (cid:13)(cid:13)(cid:13)(cid:13) ( Y − q P ( W ))(2 A − (cid:18) g P − g P ′ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ,P + (cid:13)(cid:13)(cid:13)(cid:13) ( q P − q P ′ ) 2 A − g P ′ ( W ) (cid:13)(cid:13)(cid:13)(cid:13) ,P . k g P − g P ′ k ,P + k q P − q P ′ k ,P , where the last inequality relies on the uniform boundedness of ( Y − q P ( W ))(2 A −

1) and g − P . Theresult follows since k q P − q P ′ k ,P ≤ k Q − Q ′ k ,P .The same kind of arguments allow to verify that A1 c and A2 c also imply A1 – A4 . References [1] A. D. Barbour. Poisson approximation and the Chen-Stein method.

Statistical Science , 5(4):425–427, 1990.[2] Y. Berger. Rate of convergence to normal distribution for the Horvitz-Thompson estimator.

Journal of Statistical Planning and Inference , 67(2):209–226, 1998.[3] P. Bertail, E. Chautru, and S. Cl´emen¸con. Empirical processes in survey sampling.

ScandinavianJournal of Statistics , October 2016. To appear.[4] L. Bondesson, I. Traat, and A. Lundqvist. Pareto sampling versus Sampford and conditionalPoisson sampling.

Scandinavian Journal of Statistics. Theory and Applications , 33(4):699–720,2006.[5] N. E. Breslow and J. A. Wellner. Weighted likelihood for semiparametric models and two-phasestratiﬁed samples, with application to Cox regression.

Scandinavian Journal of Statistics , 34(1):86–102, 2007. 236] N. E. Breslow and J. A. Wellner. A Z-theorem with estimated nuisance parameters and correctionnote for “Weighted likelihood for semiparametric models and two-phase stratiﬁed samples, withapplication to Cox regression”.

Scandinavian Journal of Statistics , 35, 2008.[7] K. R. W. Brewer and M. E. Donadio. The high entropy variance of the Horvitz-Thompsonestimator.

Survey Methodology , 29(2):189–196, 2003.[8] H. Cardot, D. Degras, and E. Josserand. Conﬁdence bands for Horvitz–Thompson estimatorsusing sampled noisy functional data.

Bernoulli , 19(5A):2067–2097, 2013.[9] H. Cardot, A. Dessertaine, C. Goga, E. Josserand, and P. Lardin. Comparison of diﬀerentsample designs and construction of conﬁdence bands to estimate the mean of functional data: Anillustration on electricity consumption.

Survey Methodology/Techniques d’enquˆetes , 39:283–301,2013.[10] A. Chambaz and P. Neuvial. tmle.npvi: targeted, integrative search of associations betweenDNA copy number and gene expression, accounting for DNA methylation.

Bioinformatics , 31(18):3054–3056, 2015.[11] A. Chambaz and P. Neuvial.

Targeted Learning of a Non-Parametric Variable Importance Measureof a Continuous Exposure , 2016. URL http://CRAN.R-project.org/package=tmle.npvi . Rpackage version 0.10.0.[12] A. Chambaz, P. Neuvial, and M. J. van der Laan. Estimation of a non-parametric variableimportance measure of a continuous exposure.

Electronic Journal of Statistics , 6:1059–1099,2012.[13] A. Grafstr¨om. Entropy of unequal probability sampling designs.

Statistical Methodology , 7(2):84–97, 2010.[14] J. Hajek. Asymptotic theory of rejective sampling with varying probabilities from a ﬁnite popu-lation.

The Annals of Mathematical Statistics , 35(4):1491–1523, 12 1964.[15] M. Hanif and K. R. W. Brewer. Sampling with unequal probabilities without replacement: areview.

International Statistical Review/Revue Internationale de Statistique , pages 317–335, 1980.[16] K. Joag-Dev and F. Proschan. Negative association of random variables with applications.

TheAnnals of Statistics , 11(1):286–295, 03 1983.[17] J. Pearl.

Causality: models, reasoning and inference , volume 29. Cambridge University Press,2000.[18] M. R. Sampford. On sampling without replacement with unequal probabilities of selection.

Biometrika , 54(3-4):499–513, 1967.[19] M. J. van der Laan. Statistical inference for variable importance.

International Journal ofBiostatistics , 2, 2006. 2420] M. J. van der Laan. One-step targeted minimum loss-based estimation based on universal leastfavorable one-dimensional submodels.

International Journal of Biostatistics , 2016. To appear.[21] M. J. van der Laan and S. D. Lendle. Online targeted learning. Technical Report 330, U.C. Berke-ley Division of Biostatistics, 2014. URL http://biostats.bepress.com/ucbbiostat/paper330 .[22] M. J. van der Laan and S. Rose.

Targeted learning . Springer, 2011. ISBN 978-1-4419-9781-4.[23] M. J. van der Laan and D. Rubin. Targeted maximum likelihood learning.

International Journalof Biostatistics , 2:Art. 11, 40, 2006. doi: 10.2202/1557-4679.1043.[24] A. W. van der Vaart.

Asymptotic statistics , volume 3 of

Cambridge Series in Statistical andProbabilistic Mathematics . Cambridge University Press, Cambridge, 1998.[25] A. W. van Der Vaart and J. A. Wellner.

Weak Convergence and empirical processes . Springer,1996.[26] J. C. Wang. Sample distribution function based goodness-of-ﬁt test for complex surveys.