Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso
EEfficient Estimation of Pathwise Differentiable TargetParameters with the Undersmoothed Highly AdaptiveLasso
Mark J. van der Laan ∗ , David Benkeser and Weixin Cai Division of Biostatistics, University of California, Berkeley Department of Biostatistics and Bioinformatics, Emory University
August 16, 2019
Abstract
We consider estimation of a functional parameter of a realistically modeled data dis-tribution based on observing independent and identically distributed observations. Wedefine an m -th order Spline Highly Adaptive Lasso Minimum Loss Estimator (SplineHAL-MLE) of a functional parameter that is defined by minimizing the empirical riskfunction over an m -th order smoothness class of functions. We show that this m -th or-der smoothness class consists of all functions that can be represented as an infinitesimallinear combination of tensor products of ≤ m -th order spline-basis functions, and in-volves assuming m -derivatives in each coordinate. By selecting m with cross-validationwe obtain a Spline-HAL-MLE that is able to adapt to the underlying unknown smooth-ness of the true function, while guaranteeing a rate of convergence faster than n − / ,as long as the true function is cadlag (right-continuous with left-hand limits) and hasfinite sectional variation norm. The m = 0-smoothness class consists of all cadlag func-tions with finite sectional variation norm and corresponds with the original HAL-MLEdefined in van der Laan (2015).In this article we establish that this Spline-HAL-MLE yields an asymptotically ef-ficient estimator of any smooth feature of the functional parameter under an easilyverifiable global undersmoothing condition. A sufficient condition for the latter con-dition is that the minimum of the empirical mean of the selected basis functions issmaller than a constant times n − / , which is not parameter specific. Therefore, theundersmoothing condition enforces the selection of the L -norm in the lasso to be largeenough so that the fit includes sparsely supported basis functions. We demonstrateour general result for the m = 0-HAL-MLE of the average treatment effect and of theintegral of the square of the data density. We also present simulations for these twoexamples confirming the theory. Key words:
Asymptotically efficient estimator, cadlag, canonical gradient, cross-validation,efficient influence curve, Highly-Adaptive-Lasso MLE, loss-function, pathwise differentiableparameter, risk, sectional variation norm, splines, undersmoothing. ∗ email: [email protected] a r X i v : . [ m a t h . S T ] A ug Introduction
We consider the estimation problem in which we observe n independent and identicallydistributed copies of a random variable with probability distribution known to be an elementof an infinite-dimensional statistical model, while the goal is to estimate a particular smoothfunctional of the data distribution. It is assumed that the target parameter is a pathwisedifferentiable functional of the data distribution so that its derivative is characterized by itsso called canonical gradient.A regular asymptotically linear estimator is asymptotically efficient if and only if it isasymptotically linear with influence curve the canonical gradient (Bickel et al., 1997) and anumber of general methods for efficient estimation have been developed in the literature. Ifthe model is not too large, then a regularized or sieve maximum likelihood estimator or min-imum loss estimator (MLE) generally results in an efficient substitution estimator (Newey,2014; van der Laan, 2006; van der Vaart, 1998). For a general theory on sieve estimationthat also demonstrates sieve-based maximum likelihood estimators that are asymptoticallyefficient in large models, we refer to Shen (1997, 2007). These results generally require asieve-based MLE that overfits the data (or equivalently, undersmooths the estimated func-tional parameter) and are only applicable for certain type of sieves.An alternative to undersmoothing is to use targeted estimator based on the canonicalgradient, such as: the one-step estimator, which adds to an initial plug-in estimator theempirical mean of the canonical gradient at the estimated data distribution (Bickel et al.,1997); an estimating equations-based estimator, which defines the estimator of the targetparameter as the solution of an estimating equation with the estimated canonical gradientas estimating function (Robins and Rotnitzky, 1992; van der Laan and Robins, 2003); andtargeted minimum loss-estimation, which updates an initial estimator of the data distributionwith an MLE of a least favorable parametric submodel through the initial estimator (van derLaan and Rubin, 2006; van der Laan, 2008; van der Laan and Rose, 2011; van der Laan andGruber, 2015). By using an initial estimator of the relevant parts of the data distribution thatconverges w.r.t. L -type norm to the truth at a rate faster than n − / , such as achieved withthe HAL-MLE (van der Laan, 2015; Benkeser and van der Laan, 2016), in great generality,these three general procedures will result in an efficient estimator.In this article we focus on a particular sieve MLE, which we call the HAL-MLE. TheHAL-MLE is defined as the minimizer of an empirical mean of the loss function (e.g, log-likelihood loss) over a class of functions that can be arbitrarily well approximated by linearcombinations of tensor products of univariate spline-basis functions, but where the L -normof the coefficient vector is constrained. The target parameter is defined as a particularsmooth real- or Euclidean-valued function of the functional parameter estimated by HAL-MLE, so that the HAL-MLE results in a plug-in estimator of the target parameter. In thiscase the sieve is indexed by a bound on the L -norm. By increasing this bound up to alarge, finite value, the sieve approximates the total parameter space for the true functionalparameter. If the goal is to estimate the functional itself, then the constraint on the L -norm is optimally chosen with cross-validation. In particular, the HAL-MLE described invan der Laan (2015); Benkeser and van der Laan (2016) selects the tuning parameter thatminimizes the empirical mean of the loss function over the class of cadlag functions withfinite sectional variation norm which can be approximated by an infinite linear combination2f tensor product of indicator basis functions (i.e., a 0-th order spline-basis). In this case the L -norm of the coefficients equals the sectional variation norm of the function (Gill et al.,1995; van der Laan, 2015).The contributions of this article are two-fold. First, we generalize the 0-th order HAL-MLE to an m -th order Spline HAL-MLE in the class of m -times differentiable functionsthat can be approximated as a linear combination of tensor product of ≤ m -th order splinebasis functions with a finite L -norm of the coefficient vector. In this case, we refer tothe L -norm of the coefficients as an m -th order sectional variation norm. The algorithmsfor implementing these m -th order Spline HAL-MLEs are identical across m (just differentbasis functions) and can be based on implementations of the Lasso in the machine learningliterature. One can now select both the bound on the L -norm and the smoothness degree m with cross-validation, resulting in an estimator we call the smoothness-adaptive spline-HAL-MLE of the functional parameter.Second, we investigate whether and how an appropriately undersmoothed m -th orderSpline HAL-MLE can be used to produce an efficient plug-in estimator of smooth functionsof the functional parameter. There are essentially three key ingredients to establishingefficiency of a plug-in estimator: negligibility of the empirical mean of the canonical gradient,control of the second-order remainder, and asymptotic equicontinuity. For the first point,we argue that since the canonical gradient is a score, we essentially require that HAL-MLEsolves a particular score equation. Because HAL-MLE is an MLE, it solves a large class ofscore equations, and we investigate whether these score equations might also approximatethe particular score equation implied by the canonical gradient. In particular, we find thatthe larger the L -norm of the HAL-MLE, the more such score equations are generated andsolved by the HAL-MLE. Therefore, one expects that by increasing the L -norm of the HAL-MLE, the linear span of equations solved by the HAL-MLE will approximate in first-orderthe canonical gradient score equation. However, another crucial condition for efficiency of aplug-in estimator is that a second-order remainder is o P ( n − / ), and we want to preserve the n − / -rate of convergence of achieved by the HAL-MLE when the L -norm is selected withcross-validation. Fortunately, the rate of the HAL-MLE is not affected by the size of the L -norm as long as it remains bounded and, for n large enough, exceeds the m -th order sectionalvariation norm of the true function. Similarly, the asymptotic equicontinuity condition forefficiency of a plug-in estimator will also be satisfied for any bounded L -norm, since theclass of cadlag functions with a finite 0-order sectional variation norm is a Donsker class.In fact, one can prove that this L -norm is allowed to slowly converge to infinity as samplesize increases without affecting the asymptotic equicontinuity condition and the n − / -rateof convergence of the HAL-MLE. Taken together, our analysis highlights that when selectedthe level of undersmoothing of a HAL-MLE, one wants to undersmooth enough to solvethe efficient score equation up to an appropriate level of approximation, but in order toreasonable finite-sample performance one should not undersmooth beyond that level.This discussion highlights the need to establish empirical criterion by which the level ofundersmoothing may be chosen to appropriately satisfy the conditions required of an efficientplug-in estimator. In particular, we an easily verifiable global undersmoothing condition,which is satisfied for example, if the minimum of the empirical mean of the basis functionsthat receive non-zero coefficient is smaller than a constant times n − / . This conditionessentially enforces the selection of the L -norm in the Lasso to be large enough so that3he fit includes sparsely supported basis functions. We also discuss alternative practicalcriterion for selecting the level of undersmoothing. We demonstrate our result in practicefor the m = 0-HAL-MLE of the average treatment effect in a nonparametric model, and forestimation of the integral of the square of the data density.This article is organized as follows. In the next Section 2 we define the m -th order HAL-MLE; a formal proof of the representation theorem is provided in the Appendix. In Section3 we establish our main theorem providing the undersmoothing conditions under whichthe m -th order Spline-HAL-MLE is asymptotically efficient for any pathwise differentiableparameter. In Section 4 we apply our theorem to the ATE example providing a theorem forthis particular nonparametric estimation problem. In Section 5 we apply our theorem to anonparametric estimation problem with target parameter the integral of the square of thedata density. In Section 6 we demonstrate a simulation study for both examples, providinga practical verification of our theoretical results. Suppose we observe O , . . . , O n ∼ iid P ∈ M , where O is a Euclidean random variable ofdimension k with support contained in [0 , τ o ] ⊂ IR k . Let Q : M → Q ( M ) = { Q ( P ) : P ∈ M} be a functional parameter. It is assumed that there exists a loss function L ( Q )so that P L ( Q ( P )) = min P ∈M P L ( Q ( P )), where we use the notation P f ≡ (cid:82) f ( o ) dP ( o ).Thus, Q ( P ) can be defined as the minimizer of the risk function Q → P L ( Q ) over all Q in the parameter space. Let d ( Q, Q ) ≡ P L ( Q ) − P L ( Q ) be the loss-based dissimilarity.We assume that M ≡ sup P ∈M P { L ( Q ( P )) − L ( Q ) } /d ( Q ( P ) , Q ) < ∞ , and M ≡ sup ,P ∈M | L ( Q ( P ))( o ) | < ∞ , thereby guaranteeing good behavior of the cross-validationselector (van der Laan and Dudoit, 2003; van der Vaart et al., 2006; van der Laan et al.,2006, 2007; Polley et al., 2011). Parameter space for functional parameter Q : Cadlag and uniform bound onsectional variation norm. We assume that the parameter space Q ( M ) is a collection ofmultivariate real valued cadlag functions on a cube [0 , τ ] ⊂ IR k with finite sectional variationnorm (cid:107) Q ( P ) (cid:107) ∗ v < C u for some C u < ∞ . That is, for all P , Q ( P ) is a k -variate real valuedcadlag function on [0 , τ ] ⊂ IR k ≥ with (cid:107) Q ( P ) (cid:107) ∗ v < C u , where the sectional variation norm isdefined by (cid:107) Q (cid:107) ∗ v ≡ Q (0) + (cid:88) s ⊂{ ,...,k } (cid:90) [0 s ,τ s ] | dQ s ( u s ) | . For a given subset s ⊂ { , . . . , k } , Q s : (0 s , τ s ] → IR is defined by Q s ( x s ) = Q ( x s , − s ). Thatis, Q s is the s -specific section of Q which sets the coordinates in the complement of subset s ⊂ { , . . . , k } equal to 0. Since Q s is right-continuous with left-hand limits and has a finitevariation norm over (0 s , τ s ], it generates a finite measure, so that the integrals w.r.t. Q s areindeed well defined. For a given vector x ∈ [0 , τ ], we define x s = ( x ( j ) : j ∈ s ). Sometimes,we will also use the notation x ( s ) for x s . 4ote also that [0 , τ ] = { } ∪ ( ∪ s (0 s , τ s ]) is partitioned in the singleton { } , the s -specificleft-edges (0 s , τ s ] × { − s } of cube [0 , τ ], and, in particular, the full-dimensional inner set (0 , τ ](corresponding with s = { , . . . , k } ). Therefore, the above sectional variation norm equalsthe sum over all subsets s of the variation norm of the s -specific section over its s -specificedge. It is also important to note that any cadlag function Q with finite sectional variationnorm can be represented as Q ( x ) = Q (0) + (cid:88) s ⊂{ ,...,k } (cid:90) (0 s ,x s ] dQ s ( u s ) . That is, Q ( x ) is a sum of integrals up till x s over all the s -specific edges w.r.t. the measuregenerated by the corresponding s -specific section Q s . We will refer to Q s as a cadlag functionas well as a measure. We note that this representation represents Q as an infinitesimal linearcombination of indicator basis functions x → φ s,u s ( x ) ≡ I ( x s ≥ u s ) indexed by knot-point u s with coefficient dQ s ( u s ): Q ( x ) = Q (0) + (cid:88) s ⊂{ ,...,k } (cid:90) φ s,u s ( x ) dQ s ( u s ) . Note that the L -norm of the coefficients in this representation is precisely the sectionalvariation norm (cid:107) Q (cid:107) ∗ v . m -th order spline smoothness class and its m -th order splinerepresentation Iterative definition of relevant m -th order derivatives of Q : Our m -th order smooth-ness class relies on the existence of certain m -th order derivatives. We will now define these m -th order derivatives through recursion. For a function Q , we define the s -specific section Q s ( x s ) = Q ( x s , − s ). The first order derivative Q s of Q s is defined as a density of Q s w.r.t.Lebesgue measure so that dQ s ( u s ) = Q s ( u s ) du s . Given the set of first order derivatives { Q s : s ⊂ { , . . . , k }} indexed by all subsets s ⊂ { , . . . , k } , we will now define the setof second order derivatives { Q s,s : s, s ⊂ s } , indexed by all subsets s ⊂ { , . . . , k } andall its subsets s ⊂ s . Given the function Q s and s ⊂ s , we define its s -specific section Q s,s ( x s ) = Q s ( x s , − s ). The second order derivative Q s,s is defined as the density of Q s,s w.r.t. Lebesgue measure so that Q s,s ( du s ) = Q s,s ( u s ) du s . This defines now { Q s (1) : ¯ s (1) } ,where ¯ s (1) = ( s, s ) and it varies over all s ⊂ { , . . . , k } and subsets s ⊂ s .Let m = 2. Given the set of m -th order derivatives { Q m ¯ s ( m − : ¯ s ( m − } , we will nowdefine { ¯ Q m +1¯ s ( m ) : ¯ s ( m ) } . We are reminded again that ¯ s ( m ) = ( s, s , . . . , s m ) is a sequence ofnested subsets, s m ⊂ s m − ⊂ . . . ⊂ s . We also note that Q m ¯ s ( m − ( x s m − ) is only a functionof coordinates in s m − , since all other coordinates have been set to zero through the earliersections implied by s m − , . . . , s . Given the function Q m ¯ s ( m − , we define its s m -specific section Q m ¯ s ( m ) ( x s m ) = Q m ¯ s ( m − ( x s m , − s m ) that sets the coordinates in s m +1 /s m equal to zero. The m + 1-th order derivative Q m +1¯ s ( m ) is defined as the density of Q m ¯ s ( m ) w.r.t. Lebesgue measureso that Q m ¯ s ( m ) ( du s m ) = Q m +1¯ s ( m ) ( u s m ) du s m . 5 -th order sectional variation norm: The m -th order sectional variation norm isdefined as: (cid:107) Q (cid:107) ∗ ,mv ≡| Q (0) | + (cid:80) m − j =1 (cid:80) ¯ s ( j ) | Q j +1¯ s ( j ) (0 s j ) | + (cid:80) ¯ s ( m ) (cid:82) (0 sm ,τ sm ] | dQ m ¯ s ( m ) ( z s m ) | . Iterative definition of relevant m -th order spline basis functions: We first de-scribe an iterative procedure that allows us to define the relevant m -th order spline basisfunctions whose linear span generates the m -th order smoothness class defined below.For s ⊂ { , . . . , k } , we define the 0-order spline basis functions φ s,x s ( u s ) = I ( x s ≥ u s )indexed by knot point u s . For s ⊂ s , we define the first order spline basis functions φ s,s ,x s ( u s ) = (cid:81) j ∈ s ( x ( j ) − u ( j )) I ( u ( j ) ≤ x ( j )) (cid:81) j ∈ s/s x j , indexed by knot-point u s . Wealso define φ s, ∅ ,x s (0 s ) = (cid:89) j ∈ s x j , which corresponds with setting s equal to empty set in definition of φ s,s ,x s ( u s ).For a given ¯ s ( m ) and corresponding basis function φ ¯ s ( m ) ,x s ( u s ) = (cid:81) j ∈ s φ j, ¯ s ( m ) ,x j ( u j ), wenote that it is a tensor product of univariate basis functions φ j, ¯ s ( m ) ,x j ( u j ) over the components j ∈ s . Let m = 1. Given ¯ s ( m ), and given s m +1 ⊂ s m , we define the m + 1-th order splinebasis functions as φ ¯ s ( m +1) ,x s ( z s m +1 ) ≡ (cid:89) j ∈ s m +1 (cid:90) ( z j ,x j ] φ j, ¯ s ( m ) ,x j ( y j ) dy j (cid:89) j ∈ s m /s m +1 (cid:90) (0 ,x j ] φ j, ¯ s ( m ) ,x j ( y j ) dy j (cid:89) j ∈ s/s m φ j, ¯ s ( m ) ,x j (0) . We also define φ ¯ s ( m ) , ∅ ,x s (0 s m +1 ) = (cid:89) j ∈ s m (cid:90) (0 ,x j ] φ j, ¯ s ( m ) ,x j ( y j ) dy j (cid:89) j ∈ s/s m φ j, ¯ s ( m ) ,x j (0)by setting s m +1 equal to empty set, and knot point z s m +1 = 0 s m +1 in the definition of φ ¯ s ( m +1) ,x s ( z s m +1 ). Note that for each j ∈ s m +1 , the previous m -th order basis function issmoothed by integrating it from a knot point z j till x j ; for j ∈ s m /s m +1 the previous m -thorder basis function is smoothed by integrating it from 0 till x j ; and, finally, for j ∈ s/s m ,the m -th order basis function is untouched. m -th order spline smoothness class: Let D m [0 , τ ] be the space of cadlag functions f : [0 , τ ] → IR for which the m -th order derivatives { f m ¯ s ( m ) : ¯ s ( m ) } exist, and m -th ordersectional variation norm is bounded, m = 0 , , . . . . Let D mC [0 , τ ] = { f ∈ D m [0 , τ ] : (cid:107) f (cid:107) ∗ ,mv Suppose now that we select m n with thecross-validation selector. In addition, assume that each smoothness class allows for a uniquerate of convergence of the corresponding spline HAL-MLE w.r.t. loss-based dissimilarity.Due to the asymptotic equivalence of the cross-validation selector with the oracle selector,it then follows that P ( m n = m ) → m is the unknown true maximal smoothnessof the true Q . See Appendix B for a formal statement. As a consequence, this smoothnessadaptive spline HAL-MLE achieves the rate of convergence of the m -th smoothness class(i.e., it is minimax adaptive). The asymptotic efficiency of a smoothness adaptive splineHAL-MLE (using undersmoothing of each m -th order Spline HAL-MLE) follows from theasymptotic efficiency of the m -th order Spline HAL-MLE as established in the next section. m -th order Spline-HAL MLE for path-wise differentiable target parameters Let Ψ : M → IR d be the d -dimensional statistical target parameter of interest of the datadistribution. We assume that Ψ is pathwise differentiable at any P ∈ M with canonicalgradient D ∗ ( P ). For a pair P, P ∈ M , the exact second order remainder is defined by R ( P, P ) ≡ Ψ( P ) − Ψ( P ) + P D ∗ ( P ) . Relevant functional parameter and its loss function: Let Q : M → Q ( M ) = { Q ( P ) : P ∈ M} be a functional parameter such that Ψ( P ) = Ψ ( Q ( P )) for some Ψ . It isassumed that Q is a functional parameter with parameter space Q ( M ) ⊂ Q ( C u ) = D C u [0 , τ ]as defined above in Section 2. Note that the model M does not make any smoothnessassumptions on Q beyond that it is a cadlag function with sectional variation norm boundedby C u . In particular, we have an m -th order Spline-HAL-MLE with established rate ofconvergence for d ( Q mn , Q ) = o P ( n − / ) for m ≥ m , where m is the smoothness degree of Q . In addition, due to the asymptotic equivalence of the cross-validation selector with theoracle selector, for the smoothness adaptive Q m n n , where m n is the cross-validation selectorof m , we have P ( m n = m ) → d ( Q m n n , Q ) = o P ( n − / ). Nuisance parameter for canonical gradient: Let G : M → G be a functional nui-sance parameter so that D ∗ ( P ) only depends on P through ( Q ( P ) , G ( P )), and the remainder9 ( P, P ) only involves differences between ( Q, G ) and ( Q , G ): D ∗ ( P ) = D ∗ ( Q ( P ) , G ( P )), while R ( P, P ) = R (( Q, G ) , ( Q , G )).Here R could have some remaining dependence on P and P , and G = G ( M ) is theparameter space for G . Canonical gradient of target parameter in tangent space of loss function: Wealso assume that this loss function L ( Q ) is such that there exists a class of submodels { Q h(cid:15) : (cid:15) } ⊂ Q ( M ) indexed by a choice h ∈ H , through Q at (cid:15) = 0, so that for any G ∈ G ,one of these directions h generates a score that equals the canonical gradient D ∗ ( Q, G ) at( Q, G ): dd(cid:15) L ( Q h(cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = D ∗ ( Q, G ) . Since the canonical gradient is an element of the tangent space and thereby typically a scoreof a submodel, this generally holds for Q defined as the density of P and the log-likelihoodloss L ( Q ) = − log Q . However, for any Q there are typically more direct loss functions L ( Q ), so that the loss-based dissimilarity d ( Q, Q ) = P L ( Q ) − P L ( Q ) directly measuresa dissimilarity between Q and Q , for which this condition holds as well. m -th order HAL-MLE: Let m ∈ { , . . . } be fixed. In this section, we are concernedwith analyzing the plug-in estimator Ψ( Q mn ) of Ψ( Q ), where Q mn is the C mn -tuned m -th orderSpline-HAL-MLE ˆ Q m ( P n ) = ˆ Q mC mn ( P n ), which minimizes the empirical risk over Q m ( C mn ). Weassume here that m ≥ m , so that Q ∈ Q m ( C um ), even though that is not an assumption inour model M . Recall our assumptions on the parameter space Q = Q ( C u ). We assume that Q is defined such that Q mn is in the interior of the model based parameter space Q (so thatthere are submodels through Q mn that generate the tangent space and the canonical gradient),even though Q mn is typically on the edge of the parameter subspace Q m ( C mn ) ⊂ Q = Q ( M )over which the estimator is minimizing the empirical risk. It is understood that verificationof our conditions might require using a C mn different from the cross-validation selector. Remark: Target parameter could be component of real target parameter. Inmany situations the real target parameter is a P → Ψ( Q ( P ) , Q ( P )) for two (or more)functional parameters Q and Q . One could apply our efficiency theorem below to thetarget parameter Ψ Q ( Q ) = Ψ( Q , Q ) and Ψ Q ( Q ) = Ψ( Q , Q ) treating the indices Q and Q as known, and m -th order Spline-HAL-MLEs Q n and Q n of Q and Q ,respectively. Application of our theorem to these two cases then proves that Ψ( Q , Q n )and Ψ( Q n , Q ) are both asymptotically efficient, if both HAL-MLEs are appropriatelytuned w.r.t. m -th order sectional variation norm bound. SinceΨ( Q n , Q n ) − Ψ( Q , Q ) = Ψ( Q n , Q n ) − Ψ( Q , Q n ) + Ψ( Q , Q n ) − Ψ( Q , Q ) , this then also establishes asymptotic efficiency of Ψ( Q n , Q n ) as estimator of Ψ( Q , Q ),under the condition thatΨ( Q n , Q n ) − Ψ( Q , Q n ) − (Ψ( Q n , Q ) − Ψ( Q , Q )) = o P ( n − / ) . This latter term can be viewed as a second order difference of ( Q n , Q n ) and ( Q , Q )so that the latter condition will generally hold by using the already established rates of10onvergence o P ( n − / ) w.r.t. risk based dissimilarity for Q n and Q n . The above imme-diately generalizes to the case that the target parameter is a function of more than two Q -components. Let Q n = arg min Q ∈Q m ( C n ) P n L ( Q ) be the m -th order Spline HAL-MLE, suppressing thedependence of Q n and C n on m . The following theorem establishes that, if m ≥ m , thenΨ( Q n ) is asymptotically efficient for Ψ( Q ) for large enough C n , and some weak conditionsspecific towards the target parameter. It relies on the following definitions that also providethe basis of the proof of the theorem. Definitions: • Recall we can represent Q n = arg min Q ∈Q m ( C n ) P n L ( Q ) as follows: Q n ( x ) = I ( Q n )( x ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,x sm ] φ ¯ s ( m ) ,x s ( u s m ) dQ mn, ¯ s ( m ) ( u s m ) , where I ( Q n )( x ) = Q n (0) + (cid:80) m − j =0 (cid:80) ¯ s ( j ) Q j +1 n, ¯ s ( j ) (0 s j ) φ ¯ s ( j ) , ∅ ,x s (0 s ) . • Consider the family of paths { Q hn,(cid:15) : (cid:15) ∈ ( − δ, δ ) } through Q n at (cid:15) = 0 for arbitrarily small δ > 0, indexed by any uniformly bounded h , defined by Q hn,(cid:15) ( x ) = I ( Q hn,(cid:15) )( x ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,x sm ] φ ¯ s ( m ) ,x s ( u s m )(1 + (cid:15)h (¯ s ( m ) , u s m )) dQ mn, ¯ s ( m ) ( u s m ) , (3)where I ( Q hn,(cid:15) )( x ) = (1 + (cid:15)h (0)) Q n (0) + m − (cid:88) j =0 (cid:88) ¯ s ( j ) φ ¯ s ( j ) , ∅ ,x s (0 s )(1 + (cid:15)h (¯ s ( j ) , s j )) Q j +1 n, ¯ s ( j ) (0 s j ) . • Let r ( h, Q n ) ≡ I ( h, Q n ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,τ sm ] h (¯ s ( m ) , u s m ) | dQ mn, ¯ s ( m ) ( u s m ) | , where I ( h, Q n ) = h (0)) | Q n (0) | + (cid:80) m − j =0 (cid:80) ¯ s ( j ) h (¯ s ( j ) , s j ) | Q j +1 n, ¯ s ( j ) (0 s j ) | . • For any uniformly bounded h with r ( h, Q n ) = 0 we have that for a small enough δ > { Q hn,(cid:15) : (cid:15) ∈ ( − δ, δ ) } ⊂ Q m ( C n ). • Let S h ( Q n ) = dd(cid:15) L ( Q hn,(cid:15) ) (cid:12)(cid:12) (cid:15) =0 be the score of this h -specific submodel. • Consider the set of scores S ( Q n ) = { S h ( Q n ) = ddQ n L ( Q n )( f ( h, Q n )) : (cid:107) h (cid:107) ∞ < ∞} , (4)11here f ( h, Q n )( x ) ≡ dd(cid:15) Q hn,(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 ( x )= f ( h, Q n )( x ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,x sm ] φ ¯ s ( m ) ,x s ( u s m ) h (¯ s ( m ) , u s m ) dQ mn, ¯ s ( m ) ( u s m ) f ( h, Q n )( x ) = dd(cid:15) I ( Q hn,(cid:15) )( x ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = h (0)) Q n (0) + m − (cid:88) j =0 (cid:88) ¯ s ( j ) φ ¯ s ( j ) , ∅ ,x s (0 s ) h (¯ s ( j ) , s j ) Q j +1 n, ¯ s ( j ) (0 s j ) , This is the set of scores generated by the above class of paths if we do not enforce constraint r ( h, Q n ) = 0. • We have that Q n solves the score equations P n S h ( Q n ) = 0 for any uniformly bounded h satisfying r ( h, Q n ) = 0. • Let D ∗ n ( Q n , G ) ∈ S ( Q n ) be an approximation of D ∗ ( Q n , G ) that is contained in this setof scores S ( Q n ). • We also consider a special case in which D ∗ n ( Q n , G ) = D ∗ ( Q n , G n ) for an approximation G n ∈ G of G . Let G n = { G ∈ G : D ∗ ( Q n , G ) ∈ S ( Q n ) } be the set of G ’s for which D ∗ ( Q n , G ) equals a score S h ( Q n ) for some uniformly bounded h .One can then define G n ∈ G n as an approximation of G . • Let h ∗ ( Q n , G ) be the index so that D ∗ n ( Q n , G ) = S h ∗ ( Q n ,G ) ( Q n ). Remark: Understanding G n . It might appear that the class of paths { Q hn,(cid:15) : (cid:15) } forany bounded h above is rich enough to generate the full tangent space at Q n and thereby D ∗ ( Q n , G ), even for finite n . However, a special property of this class of paths is that itis contained in the linear span of (order n ) the basis functions have non-zero coefficients in Q n . On the other hand, if n increases, and thereby the number of basis functions convergesto infinity, this class of paths will indeed be able to approximate any function in the tangentspace. Since the true G or the relevant function of G is generally not contained in thislinear span of basis functions that make up Q n , D ∗ ( Q n , G ) (cid:54)∈ S ( Q n ) is not contained inits set S ( Q n ) of scores. For example, in the average treatment effect example, we wouldneed that 1 / ¯ G ( W ) is approximated by this linear span of spline basis functions that arepresent in the fit Q n . Therefore, indeed, there will be G ∈ G whose shape is such that1 /G ( W ) is in the linear span, which can then be used to define a G n so that D ∗ ( Q n , G n ) ∈S ( Q n ). Alternatively, one directly approximates 1 / ¯ G ( W ) with a linear span, without beingconcerned if it results in a representation 1 / ¯ G n , thereby determining an approximation D ∗ n ( Q n , G ). Since in this example ¯ G can be any function of W with values in (0 , / ¯ G is approximated by (cid:80) j α j φ j , then wecan solve for ¯ G n by setting 1 / ¯ G n = (cid:80) j α j φ j , giving G n = 1 / (cid:80) j α j φ j . This explains thatindeed this set G n will approximate G as n converges to infinity, so that G n will approximate G , presumably certainly as fast as Q n approximates Q . By increasing C n , the number of12elected basis functions in Q n will increase, thereby making the approximation G n betterand better.As is evident from Theorem 2, this approximation G n should aim to approximate G in the sense that R ( Q n , G n , Q , G ) = o P ( n − / ) while also arranging P { D ∗ ( Q n , G n ) − D ∗ ( Q , G ) } → p Convenient notation for representation of Q n : Due to finite support condition Q m (cid:28) ∗ µ n in the definition of m -th order Spline HAL-MLE, we have Q n ( x ) = β n (0) + m − (cid:88) j =0 (cid:88) ¯ s ( j ) β n (¯ s ( j )) φ ¯ s ( j ) , ∅ ,x s (0)+ (cid:88) ¯ s ( m ) (cid:88) j β n (¯ s ( m ) , u s m ,j ) φ ¯ s ( m ) ,x s ( u s m ,j ) . (5)Recall our notation φ s,j with s ⊂ { , . . . , k } and j ∈ J mn ( s ), where J mn ( s ) is the finitesubset of J m ( s ) implied by the support points u s m ,j . Let x s,j be the vector of knot points(one for each component in s ) corresponding with this basis function φ s,j , and we note that φ s,j ( X ) = I ( X ( s ) ≥ x s,j ) φ s,j ( X ): i.e., the support of φ s,j is limited to all x -values for which x ( s ) ≥ x s,j (and it only depends on x through x ( s )). Analogue to (1), we have the followingrepresentation for the m -th order HAL-MLE Q n = (cid:88) s,j ∈J n ( s ) β n ( s, j ) φ s,j , (6)where we know that φ s,j has support { x ( s ) : x ( s ) ≥ x s,j } for knot point x s,j .The following theorem establishes an undersmoothing condition (7) on C n that guarantees P n D ∗ n ( Q n , G ) = o P ( n − / ). Theorem 1 Consider an approximation D ∗ n ( Q n , G ) ∈ S ( Q n ) (i.e., scores of submodelsnot enforcing L -norm constant of HAL-MLE) of D ∗ ( Q n , G ) as defined above, and let h ∗ n be so that D ∗ n ( Q n , G ) = S h ∗ n ( Q n ) . Consider the representation (6) of Q n . Note that β n minimizes β → P n L (cid:16)(cid:80) s,j ∈J n ( s ) β n ( s, j ) φ s,j (cid:17) over all β = ( β ( s, j ) : s, j ∈ J n ( s )) with (cid:80) s,j ∈J n ( s ) | β ( s, j ) |≤ C n . This theorem applies to any Q n = (cid:80) s,j ∈J n ( s ) β n ( s, j ) φ s,j with β n a minimizer of the latter empirical risk.Assume (cid:107) h ∗ n (cid:107) ∞ = O P (1) , and min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 (cid:107) P n ddQ n L ( Q n )( φ s,j ) (cid:107) = o P ( n − / ) . (7) Then, P n D ∗ n ( Q n , G ) = o P ( n − . ) . Let ( s ∗ , j ∗ ) = arg min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P φ s,j . We can replace (7) by the following: Supposethat P S s ∗ ,j ∗ ( Q n ) → p (which will generally hold whenever P φ s ∗ ,j ∗ = o P (1) ); { S s,j ( Q ) : Q ∈ Q , ( s, j ) } is contained in a Donsker class (e.g., the class of cadlag functions withuniformly bounded sectional variation norm); (cid:107) P (cid:26) ddQ n L ( Q n )( φ s ∗ ,j ∗ ) − ddQ L ( Q )( φ s ∗ ,j ∗ ) (cid:27) (cid:107) = o P ( n − / ) , (8)13 nd P { ddQ n L ( Q n )( φ s ∗ ,j ∗ ) } → p .If we have (cid:107) P (cid:26) ddQ n L ( Q n )( φ s ∗ ,j ∗ ) − ddQ L ( Q )( φ s ∗ ,j ∗ ) (cid:27) = O P (cid:16) P / φ s ∗ ,j ∗ d / ( Q n , Q ) (cid:17) ; P ddQ n L ( Q n )( φ s ∗ ,j ∗ ) = O P ( P φ s ∗ ,j ∗ ) ; and d ( Q n , Q ) = O P ( n − / − α ( k ) ) (e.g., as we showedfor HAL-MLE with α ( k ) ≡ / { k + 2) } ), then (7) can be replaced by min s,j ∈J n ( s ) P φ s,j = o P ( n − / α ( k ) ) . (9)Condition (7) is directly verifiable on the data and can thus be used to select the m -th ordersectional variation norm bound C n for the m -th order Spline-HAL-MLE. For example, onecould select a constant a and set C to the smallest value (larger than the cross-validationselector) for which the left-hand side is smaller than a/ ( √ n log n ) for some constant a .The sufficient assumption (8) provides understanding of what it requires in terms of Q n and P . We note that P { ddQ n L ( Q n )( φ s ∗ ,j ∗ ) } → p φ s ∗ ,j ∗ converging to zero, and is thereby a non-condition, given ourundersmoothing condition (8). The latter (8) translates into the following important specialcase. In this lemma we also demonstrate that if we know that Q n − Q converges to zeroin supremum norm at a particular rate, then the support condition (25) can be significantlyweakened. The analogue of that could have been presented in the general theorem above aswell. Lemma 1 Consider the special case that O = ( Z, X ) , L ( Q )( O ) depends on Q through Q ( X ) only, and ddQ L ( Q )( φ ) = ddQ L ( Q ) × φ , i.e., the directional derivative dd(cid:15) L ( Q + (cid:15)φ ) (cid:12)(cid:12) (cid:15) =0 of L () at Q in the direction φ is just multiplication of a function ddQ L ( Q ) of O with φ ( X ) . As-sume lim sup n (cid:107) ddQ n L ( Q n ) (cid:107) ∞ < ∞ . Let ( s ∗ , j ∗ ) = arg min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P φ s,j . Assume P φ s ∗ ,j ∗ = o P (1) . Then, a sufficient condition for P n D ∗ n ( Q n , G ) = o P ( n − / ) is given by (8).Assume (cid:107) ddQ n L ( Q n ) − ddQ L ( Q ) (cid:107) ∞ = O ( (cid:107) Q n − Q (cid:107) ∞ ) . Then, P n D ∗ n ( Q n , G ) = o P ( n − / ) if (cid:107) Q n − Q (cid:107) ∞ min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P φ s,j = o P ( n − / ) . (10) The condition (10) can be replaced by (cid:107) Q n − Q (cid:107) ∞ min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P n φ s,j = o P ( n − / ) . Here P φ s,j and P n φ s,j can be bounded by P ( X ( s ) ≥ x s,j )) and P n ( X ( s ) ≥ x s,j ) , respectively. In (van der Laan and Bibaut) we proved that (cid:107) Q n − Q (cid:107) ∞ → p ( s,j ) ∈J n ,β n ( s,j ) (cid:54) =0 P φ s,j = O P ( n − / ). However, we expect (if m ≥ 1) that the rate of convergence w.r.t. supremumnorm to be n − / as achieved w.r.t. d / ( Q n , Q ), in which case this only requires thatmin s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P n φ s,j = O P ( n − / ). The above Lemma can also be straightforwardlytailored to exploiting convergence of d ( Q n , Q ), as in Theorem 1, instead of this supremumnorm convergence. 14 .3 Efficiency of the Spline-HAL MLE, by including sparse basisfunctions The typical general efficiency proof used to analyze the TMLE (e.g., (van der Laan, 2015))can be easily generalized to the condition that P n D ∗ n ( Q n , G ) = o P ( n − / ) for some approx-imation D ∗ n ( Q, G ) of the actual canonical gradient D ∗ ( Q, G ). This results in the followingtheorem. Theorem 2 Assume condition (7) so that P n D ∗ n ( Q n , G ) = o P ( n − / ) . Assume M , M < ∞ . We have d ( Q n , Q ) = O P ( n − / − α ( k ) ) .If D ∗ n ( Q n , G ) = D ∗ ( Q n , G n ) , then we assume • R (( Q n , G n ) , ( Q , G )) = o P ( n − / ) and P { D ∗ ( Q n , G n ) − D ∗ ( Q , G ) } → p . • { D ∗ ( Q, G ) : Q ∈ Q , G ∈ G} is contained in the class of k -variate cadlag functions ona cube [0 , τ o ] ⊂ IR k in a Euclidean space and that sup Q ∈Q ,G ∈G (cid:107) D ∗ ( Q, G ) (cid:107) ∗ v < ∞ .Otherwise, we assume • R (( Q n , G ) , ( Q , G )) = o P ( n − / ) , P { D ∗ n ( Q n , G ) − D ∗ ( Q n , G ) } = o P ( n − / ) , and P { D ∗ n ( Q n , G ) − D ∗ ( Q , G ) } → p . • { D ∗ n ( Q, G ) , D ∗ ( Q, G ) : Q ∈ Q} is contained in the class of k -variate cadlag functionson a cube [0 , τ o ] ⊂ IR k in a Euclidean space and that sup Q ∈Q max( (cid:107) D ∗ ( Q, G ) (cid:107) ∗ v , (cid:107) D ∗ n ( Q, G ) (cid:107) ∗ v ) < ∞ .Then, Ψ( Q n ) is asymptotically efficient. The proof is straightforward, analogue to typical efficiency proof for TMLE, and is presentedin the Appendix. Regarding the condition, P { D ∗ n ( Q n , G ) − D ∗ ( Q n , G ) } = o P ( n − / ), wenote the following. Since, for a typical choice D ∗ n ( Q n , G ) in the set of scores S ( Q n ), we have P D ∗ n ( Q , G ) = 0, so that P { D ∗ n ( Q n , G ) − D ∗ ( Q n , G ) } = P { D ∗ n ( Q n , G ) − D ∗ n ( Q , G ) } − P { D ∗ ( Q n , G ) − D ∗ ( Q , G ) } is indeed a second order remainder involving a product ofdifferences Q n − Q and D ∗ n − D ∗ . Data and statistical model: Let O = ( W, A, Y ) ∼ P , where Y ∈ { , } and A ∈ { , } are binary random variables. Let ( A, W ) have support [0 , τ ] ∈ IR k , where A ∈ [0 , 1] with onlysupport on the edges { , } . Similarly, certain components of W might be discrete so that itonly has a finite set of support points in its interval. Note O ∈ [0 , τ o ] = [0 , τ ] × [0 , , τ o ] is a cube in Euclidean space of same dimension as ( W, A, Y ). Let ¯ G ( W ) = E P ( A | W )15nd ¯ Q ( W ) = E P ( Y | A = 1 , W ). Assume the positivity assumption ¯ G ( W ) > δ > δ > 0; ¯ Q , ¯ G are cadlag functions with (cid:107) ¯ Q (cid:107) ∗ v ≤ C u and (cid:107) ¯ G (cid:107) ∗ v ≤ C u for some finiteconstants C u , C u ; δ < ¯ Q < − δ for some δ > 0. This defines the statistical model M for P . Target parameter, canonical gradient and exact second order remainder: LetΨ : M → IR be defined by Ψ( P ) = E P E P ( Y | W, A = 1). Let ˜ Q = ( Q W , ¯ Q ), where Q W isthe probability distribution of W . Note that Ψ( P ) = Ψ( ˜ Q ) = Q W ¯ Q ( · , P with canonical gradient given by D ∗ ( ˜ Q, G ) = A/ ¯ G ( W )( Y − ¯ Q ( W, A ))+ ¯ Q (1 , W ) − Ψ( ˜ Q ). Let L ( ¯ Q )( O ) = − (cid:8) Y log ¯ Q ( W, A ) + (1 − Y ) log(1 − ¯ Q ( W, A )) (cid:9) be the log-likelihood loss for ¯ Q , and note that by the above bounding assumptions on ¯ Q ,we have that this loss function has finite universal bounds M < ∞ and M < ∞ . Let D ∗ ( ¯ Q, ¯ G ) = A/ ¯ G ( Y − ¯ Q ) be the ¯ Q -component of the canonical gradient, D ∗ ( ˜ Q ) = ¯ Q (1 , W ) − Ψ( Q ) the Q W -component, and note that D ∗ ( ˜ Q, G ) = D ∗ ( ¯ Q, G ) + D ∗ ( ˜ Q ). We have Ψ( ˜ Q ) − Ψ( ˜ Q ) = − P D ∗ ( ˜ Q, G ) + R ( ¯ Q, ¯ G, ¯ Q , ¯ G ), where R ( ¯ Q, ¯ G, ¯ Q , ¯ Q ) = P ¯ G − ¯ G ¯ G ( ¯ Q − ¯ Q ) . Bounds on sectional variation norm and exact second order remainder: We havesup P ∈M (cid:107) D ∗ ( ˜ Q ( P ) , G ( P )) (cid:107) ∗ v < C ( C u , C u ) for some finite constant C implied by the uni-versal bounds ( C u , C u ) on the sectional variation norm of ¯ Q, ¯ G . We also note that, us-ing Cauchy-Schwarz inequality, R ( ¯ Q, ¯ G, ¯ Q , ¯ G ) ≤ δ (cid:107) ¯ Q − ¯ Q (cid:107) P (cid:107) ¯ G − ¯ G (cid:107) P , where (cid:107) f (cid:107) P = (cid:82) f ( o ) dP ( o ). Let Q = Logit ¯ Q and let’s L ( Q )( O ) = − A { Y log ¯ Q ( W ) + (1 − Y ) log(1 − ¯ Q ( W )) } be the log-likelihood loss restricted to the observations with A = 1. Let Q C,n = arg min Q, (cid:107) Q (cid:107) ∗ v Let r ( h, Q n ) ≡ h (0)) | Q n (0) | + (cid:80) s ⊂{ ,...,m } (cid:82) (0 s ,x s ] h ( s, u s )) | dQ n,s ( u s ) | . The HAL-MLEsolves P n S h ( Q n ) = 0 for all h with r ( h, Q n ) = 0. G n We define G n ≡ { ¯ G ∈ G : ¯ G (cid:28) ∗ ¯ Q n } . We note that if ¯ G s (cid:28) ¯ Q n,s , then we also have 1 / ¯ G s (cid:28) ¯ Q n,s as well. Here we use that if g ( x ) = 1 /f ( x ), then g s ( dx s ) = − /f s ( x s ) f s ( dx s ). Therefore, if ¯ G (cid:28) ∗ ¯ Q n , then we can finda h so that f ( h, Q n )( A, W ) = A/ ¯ G ( W ), and thereby that D ∗ ( Q n , ¯ G ) = S h ( Q n ).Let G n = arg min ¯ G ∈G n (cid:107) ¯ G − ¯ G (cid:107) P , where (cid:107) ¯ G − ¯ G (cid:107) P is the L ( P )-norm of ¯ G − ¯ G . Then, D ∗ ( Q n , ¯ G n ) ∈ { S h ( Q n ) : h } sothat we can find a h ∗ ( Q n , ¯ G ) so that D ∗ ( Q n , ¯ G n ) = S h ∗ ( Q n , ¯ G ) ( Q n ) . We need to assume R (( Q n , G n ) , ( Q , G )) = o P ( n − / ) and P { D ∗ ( Q n , G n ) − D ∗ ( Q , G ) } → p 0. The latter already holds if (cid:107) ¯ G n − ¯ G (cid:107) P → p 0. However, the first condition relies on arate of convergence. For example, this will hold if (cid:107) ¯ G n − ¯ G (cid:107) P = O P ( n − / ). This appearsto be a reasonable condition, since ¯ G n is the L ( P )-projection of ¯ G onto G n , so that theonly concern would be that the set G n does not approximate fast enough G as n convergesto infinity. However, if the set of basis functions is rich enough for ¯ Q n to converge at arate faster than n − / to ¯ Q (not allowing to choose the coefficients based on P ), then theresulting linear combination of indicator basis functions should generally also be rich enoughfor approximating the true G with a rate n − / (now allowing to select the coefficients ofthe basis functions in terms of G ). 17 erification of Assumption 7 of Theorem 2: Assumption (7) is stating thatmin ( s,j ) ∈J n P n ddQ n L ( Q n )( φ s,j )= 2 n (cid:80) i φ s,j (1 , W i ) I ( A i = 1) I ( W s,i > w s,j )( Y i − ¯ Q n (1 , W i ))= o P ( n − / ) . We apply the last part of Theorem 1. Since ddQ L ( Q )( φ ) = φ ( A, W )( Y − ¯ Q ( A, W )), it followsthat have (cid:107) ddQ n L ( Q n ) − ddQ L ( Q ) (cid:107) P = O ( (cid:107) Q n − Q (cid:107) P ) . (11)Given that we have d ( Q n , Q ) = O P ( n − / − α ( k ) ), it follows that the remaining condition is(25), or, equivalently, min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P n φ s,j = O P ( n − / α ( k ) ) . This reduces to the assumption that O (cid:0) min { s,j ∈J n ( s ): β n ( s,j ) (cid:54) =0 } P n ( W ( s ) ≥ w s,j ) (cid:1) = O P ( n − / α ( k ) ).We arrange this assumption to hold by selecting C n accordingly. Similarly, we can applyLemma 1 but now expressing the latter condition in terms of (cid:107) Q n − Q (cid:107) ∞ .This proves the following efficiency theorem for the HAL-MLE in this particular estima-tion problem. Theorem 3 Consider the formulation above of the statistical estimation problem. Let G n = { ¯ G ∈ G : ¯ G (cid:28) ∗ ¯ Q n } , and ¯ G n = arg min ¯ G ∈G n (cid:107) G − G (cid:107) P . Assumptions: • (cid:107) ¯ G n − ¯ G (cid:107) P = O P ( n − / ) , where we can use that (cid:107) Q n − Q (cid:107) P = o P ( n − / ) . • Given the fit Q n = (cid:80) s,j ∈J n ( s ) β n ( s, j ) φ s,j with support points the observations { W j ( s ) : j = 1 , . . . , n, s } and indicator basis functions φ s,j ( W ) = I ( W ( s ) > W j ( s )) , we assumethat C n < C u for some finite C u is chosen so that either (cid:107) Q n − Q (cid:107) ∞ min { s,j ∈J n ( s ): β n ( s,j ) (cid:54) =0 } P n ( W ( s ) ≥ W j ( s )) = o P ( n − / ) , or min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P n φ s,j = O P ( n − / α ( k ) ) . Then, Ψ( Q n ) is an asymptotically efficient estimator of Ψ( Q ) . Example: HAL-MLE for the integral of the squareof the data density Let O ∼ P be a k -variate random variable with Lebesgue density p that is assumed tobe bounded from below by a δ > M < ∞ . Let { P Q : Q ∈ Q} be a parametrization of the probability measure of O in terms of a functional parameter Q that varies over a class of multivariate real valued cadlag functions on [0 , τ ] with finitesectional variation norm. Below we will focus on the particular parameterization given by p Q = c ( Q ) { δ + ( M − δ )expit( Q ) } , where expit( x ) = 1 / (1 + exp( − x )), and c ( Q ) is thenormalizing constant defined by (cid:82) p Q do = 1. Note that in this parameterization Q can beany cadlag function with finite sectional variation norm, thereby allowing that the densities p Q are discontinuous (but cadlag). Another possible parametrization is obtained throughthe following steps: 1) modeling the density p ( x ) as a product (cid:81) kj =1 p j ( x j | ¯ x ( j − x j , given ¯ x ( j − p j in terms of its univariate conditional hazard λ j ; 3) modeling this hazard as λ ( x j | ¯ x ( j − Q j ( x j , ¯ x ( j − Q j ), and 4) setting Q = ( Q , . . . , Q k ). With this latter parametrization each Q j variesover a parameter space of cadlag functions with finite sectional variation norm.Let the statistical model M = { P Q : Q ∈ Q ( C u ) } for P be nonparametric beyond thateach probability distribution is dominated by the Lebesgue measure, Q varies over cadlagfunctions with sectional variation norm bounded by C u . The statistical target parameterΨ : M → IR is defined by Ψ( P ) = (cid:82) p ( o ) do . The canonical gradient of Ψ at P is given by D ∗ ( P )( O ) = 2( p ( O ) − Ψ( P )), and, the exact second order remainder R ( P, P ) = Ψ( P ) − Ψ( P ) + P D ∗ ( P ) is given by R ( P, P ) = − (cid:82) ( p − p ) ( o ) do .Let L ( Q ) = − log p Q be the log-likelihood loss function for Q . Let Q n be a m = 0-order HAL-MLE bounding the sectional variation norm by a C n < C u . We wish to establishconditions on C n so that Ψ( Q n ) = (cid:82) p Q n do is an asymptotically efficient estimator of Ψ( Q ) = (cid:82) p Q d 0. We assume this HAL-MLE is discrete so that we can use the finite dimensionalrepresentation Q n = (cid:80) s,j ∈J n ( s ) β n ( s, j ) φ s,j with (cid:107) β n (cid:107) L ≤ C n , as in our general presentation.Let Q hn,(cid:15) = Q n (0)(1 + (cid:15)h (0)) + (cid:80) s (cid:82) (0 s ,x s ] (1 + (cid:15)h ( s, u s )) dQ n,s ( u s ), indexed by any boundedfunction h , be the paths as defined in our general presentation (and previous section). Let S h ( Q n ) = dd(cid:15) L ( Q hn,(cid:15) ) (cid:12)(cid:12) (cid:15) =0 be score of this path under the log-likelihood loss. These scores aregiven by S h ( Q n ) = ddQ n L ( Q n )( f ( h, Q n )) , where f ( h, Q n ) = Q n (0) h (0) + (cid:80) s (cid:82) (0 s ,x s ] h ( s, u s ) dQ n,s ( u s ). Let S ( Q n ) = { S h ( Q n ) : h } be thecollection of scores. In order to apply Theorem 2 we need to determine an approximation D n ( Q n ) ∈ S ( Q n ) of the canonical gradient D ∗ ( Q n ) = 2( p Q n − Ψ( Q n )). We have S h ( Q ) = A ( f ( h, Q )) /C ( Q ) − M exp( Q )(1+exp( Q ))( δ + δ exp( Q )+ M ) f ( h, Q ) , where A ( f ) = (cid:82) exp( Q ) / (1 + exp( Q )) f d (cid:82) ( δ + M/ (1 + exp( Q ))) d . G ( Q ) = − M exp( Q )(1+exp( Q ))( δ + δ exp( Q )+ M ) , so that the equation S h ( Q ) = D ∗ ( Q ) correspondswith G ( Q ) f ( h, Q ) + C ( Q ) − A ( f ( h, Q )) = D ∗ ( Q ), which can be rewritten as f ( h, Q ) + G ( Q ) A ( f ( h, Q )) = D ∗ ( Q ) /G ( Q ), and G ( Q ) = 1 / ( C ( Q ) G ( Q )). Let D ( Q ) = D ∗ ( Q ) /G ( Q ),so that the equation becomes f + G ( Q ) A ( f ) = D ( Q ). Once we have solved for f , whosesolution we will denote with f ( Q ), then it remains to solve for h in f ( h, Q ) = f ( Q ) or finda closest solution. It is important to note the f → A ( f ) is a linear real valued operator.Applying this operator to both sides yields A ( f ) + A ( f ) A ( G ( Q )) = A ( D ( Q )), so that weobtain the solution A ( f ) = A ( D ( Q ))1 + A ( G ( Q )) . Plugging this back into the equation, we obtain f ( Q ) ≡ D ( Q ) − G ( Q ) A ( D ( Q ))1+ A ( G ( Q )) . Thus, wehave shown that if we can set f ( h, Q n ) = f ( Q n ), then we have S h ( Q n ) = D ∗ ( Q n ). It remainsto determine a choice h ( Q n ) so that f ( h, Q n ) ≈ f ( Q n ). The space { f ( h, Q n ) : h } equals { (cid:80) s,j ∈J n ( s ) α ( s, j ) φ s,j : α } the linear span of the basis functions { φ s,j : s, j ∈ J n ( s ) } . Let f n ( Q n ) be the projection of f ( Q n ) onto this linear space, for example, w.r.t. L ( P )-norm.Let h n ( Q n ) be the solution of f ( h, Q n ) = f n ( Q n ), and let D ∗ n ( Q n ) = S h n ( Q n ) ( Q n ) be ourdesired approximation of D ∗ ( Q n ) which is an element of the set of scores { S h ( Q n ) : h } . Wenote that D ∗ n ( Q n ) − D ∗ ( Q n ) = S h n ( Q n ) ( Q n ) − D ∗ ( Q n )= G ( Q n ) f n ( Q n ) + C ( Q n ) − A ( f n ( Q n )) − D ∗ ( Q n )= G ( Q n ) f n ( Q n ) + C ( Q n ) − A ( f n ( Q n )) − G ( Q n ) f ( Q n ) − C ( Q n ) − A ( f ( Q n ))= G ( Q n )( f n ( Q n ) − f ( Q n )) + C ( Q n ) − A ( f n ( Q n ) − f ( Q n )) . We will assume that (cid:107) f n ( Q n ) − f ( Q n ) (cid:107) P = o P ( n − / ). The main condition beyond (7) ofTheorem 2 is that P { D ∗ n ( Q n ) − D ∗ ( Q n ) } = o P ( n − / ). Note that P D ∗ n ( Q ) = 0 = P D ∗ ( Q ).Therefore, P { D ∗ n ( Q n ) − D ∗ ( Q n ) } = P { D ∗ n ( Q n ) − D ∗ n ( Q ) } − P { D ∗ ( Q n ) − D ∗ ( Q ) } = P { G ( Q n )( f n ( Q n ) − f ( Q n )) } + P { C ( Q n ) − A ( f n ( Q n ) − f ( Q n )) }− P { G ( Q )( f n ( Q ) − f ( Q )) − C ( Q ) − A ( f n ( Q ) − f ( Q )) } . Let Π n be the projection operator on the linear span generated by the basis function of Q n ,which is of the same dimension as the number of basis functions. The latter difference canalso be represented as P { D ∗ ( Q n ) − D ∗ ( Q ) − Π n ( D ∗ ( Q n ) − D ∗ ( Q )) } , or, if we define Π ⊥ n = ( I − Π n ) as the projection operator onto the orthgonal complement ofthe linear space spanned by the basis functions in Q n , then this term can be denoted as P { Π ⊥ n ( D ∗ ( Q n ) − D ∗ ( Q )) } , (12)which can, in particular, be bounded by the operator norm (cid:107) Π ⊥ n (cid:107) of Π ⊥ n times the L ( P )-norm of D ∗ ( Q n ) − D ∗ ( Q ). Thus, if we assume that (cid:107) Π ⊥ n (cid:107) = O P ( n − / ), then it followsthat this term is o P ( n − / ). We will simply assume (12) to be o P ( n − / ). The other con-ditions, beyond (7) of Theorem 2 hold by the fact that (cid:107) Q n − Q (cid:107) P = o P ( n − / ) and20hat D ∗ ( Q n ) , D ∗ n ( Q n ) falls in a P -Donsker class of cadlag functions with universal bound onsectional variation norm. Verification of Assumption 7 of Theorem 2: Assumption (7) is stating thatmin s,j ∈J n ( s ) P n ddQ n L ( Q n )( φ s,j ) = o P ( n − / ) . We apply the last part of Theorem 1. We have (cid:107) ddQ n L ( Q n ) − ddQ L ( Q ) (cid:107) P = O ( (cid:107) Q n − Q (cid:107) P ) . (13)Given that we have d ( Q n , Q ) = O P ( n − / − α ( k ) ), it follows that the remaining condition is(25), or, equivalently, min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P n φ s,j = O P ( n − / α ( k ) ) . This reduces to the assumption that O (cid:0) min { s,j ∈J n ( s ): β n ( s,j ) (cid:54) =0 } P n ( O ( s ) ≥ O s,j ) (cid:1) = O P ( n − / α ( k ) ).We arrange this assumption to hold by selecting C n accordingly.This proves the following efficiency theorem for the HAL-MLE in this particular estima-tion problem. Theorem 4 Let O ∼ P be a k -variate random variable with Lebesgue density p that isassumed to be bounded from below by a δ > and from above by an M < ∞ . Let p Q = c ( Q ) { δ + ( M − δ ) expit ( Q ) } , where expit ( x ) = 1 / (1 + exp( − x )) , and c ( Q ) is the normalizingconstant defined by (cid:82) p Q do = 1 , where Q ∈ Q ( C u ) can be any cadlag function with finitesectional variation norm bounded by C u . Let the statistical model M = { P Q : Q ∈ Q ( C u ) } for P be nonparametric beyond that each probability distribution is dominated by the Lebesguemeasure, Q varies over cadlag functions with sectional variation norm bounded by C u . Thestatistical target parameter Ψ : M → IR is defined by Ψ( P ) = (cid:82) p ( o ) do , which we alsodenote with Ψ( Q ) . The canonical gradient of Ψ at P is given by D ∗ ( P )( O ) = 2( p ( O ) − Ψ( P )) ,and, the exact second order remainder R ( P, P ) = Ψ( P ) − Ψ( P ) + P D ∗ ( P ) is given by R ( P, P ) = − (cid:82) ( p − p ) do .Consider the formulation above of the statistical estimation problem. Assumptions: • (cid:107) Q n − Q (cid:107) P = o P ( n − / ) . • Given the fit Q n = (cid:80) s,j ∈J n ( s ) β n ( s, j ) φ s,j with support points the observations { O j ( s ) : j = 1 , . . . , n, s } and indicator basis functions φ s,j ( W ) = I ( O ( s ) > O j ( s )) , we assumethat C n < C u for some finite C u is chosen so that either (cid:107) Q n − Q (cid:107) ∞ min { s,j ∈J n ( s ): β n ( s,j ) (cid:54) =0 } P n ( W ( s ) ≥ W j ( s )) = o P ( n − / ) , or min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P n φ s,j = O P ( n − / α ( k ) ) . Let Π ⊥ n be the projection operator in L ( P ) onto the orthogonal complement of thelinear span of the basis functions { φ s,j : s, j ∈ J n ( s ) , β n ( s, j ) (cid:54) = 0 } in the fit of Q n .Assume P { Π ⊥ n ( D ∗ ( Q n ) − D ∗ ( Q )) } = o P ( n − / ) . (14) A sufficient condition is that the operator norm (cid:107) Π ⊥ n (cid:107) of Π ⊥ n is O P ( n − / ) .Then, Ψ( Q n ) is an asymptotically efficient estimator of Ψ( Q ) . Our global undersmoothing condition only specifies a sufficient rate at which the sparsestselected basis function should converge to zero, but it does not provide a constant in frontof this rate. Thus, it does not immediately yield a practical method for tuning the level ofundersmoothing. In our simulation studies, we investigate the targeted L -norm selector thatis chosen so that the empirical mean of the canonical gradient at the HAL-MLE (indexedby L -norm) and possibly a HAL-MLE of the nuisance parameter in the canonical gradientis o P ( n − / ). In extensive simulations, this method appears to give better practical resultsthan several direct implementations of our global undersmoothing criterion (i.e., the choiceof constant matters for practical performance). More research will be needed to investigateif one can construct a global undersmoothing selector (according to our theorem) that wouldresult in well behaved efficient plug-in estimators across a large class of target parameters.Our simulations also demonstrate that our targeted selection method for undersmoothingcontrols the sectional variation norm of the fit, which is a crucial part of the Donsker classor asymptotic equicontinuity condition. We simulated a vector W = ( W , W ). W was simulated by drawing Z ∼ Beta(0 . , . W = 4 Z − W was drawn independently from a Bernoulli(0.5) distribution.Given W = w , a binary random variable A was drawn with probability A = 1 equal to¯ G ( w ) = logit − { w − w w w } . Given W = w , we set Y = ¯ Q ( w ) + (cid:15) , where ¯ Q ( w ) =logit − { w − w w } and (cid:15) ∼ Normal(0 , . P ) as follows. We estimate ¯ Q using a HALregression estimator and select the regularization of the estimator by choosing the smallestvalue of C for L -norm such that P n D ∗ ( Q C,n , ¯ G n ) < P n { D ∗ ( Q C,n , ¯ G n ) } log( n ) n / , where ¯ G n is the HAL-MLE estimate of ¯ G (i.e., a HAL regerssion that uses cross-validatedchoice for C ). We then computed the plug-in estimator as described in Section 4.We generated 3,000 data sets in this way and computed the undersmoothed HAL esti-mate. We report the estimator’s bias (scaled by n / ), variance (scaled by n ), mean squared22 . . . . E ( ψ ( Q n ) - ψ ( Q )) 250 500 1000 2000 4000 . . . E ( ψ ( Q n ) - E ( ψ ( Q n ))) 250 500 1000 2000 4000 . . . n n E ( ψ ( Q n ) - ψ ( Q )) 250 500 1000 2000 4000 250 500 1000 2000 4000 - . - . . P n D ( Q n , G ) 250 500 1000 2000 4000 - n n m i n . p r op . -4 -2 0 2 4 . . . . n ( ψ (Q n ) - ψ (Q )) D en s i t y Figure 1: Left column top to bottom: bias, variance, and mean squared-error (all scaled by n / ) of undersmoothed HAL-MLE. Right column top to bottom: scaled empirical average ofcanonical gradient, empirical average of quantity given in equation (7), sampling distributionof scaled and centered estimator. The dashed lines in the variance and mean-squared errorplots denote the efficiency bound. The reference sampling distribution for the estimators isa mean-zero Normal distribution with this variance.error (by n ), and the sampling distribution of n / { Ψ( ˜ Q n ) − Ψ( P ) } . We additionally reporton the behavior of n / P n D ∗ ( Q C n ,n , ¯ G ) and n / (cid:26) min s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 (cid:107) P n ddQ n L ( Q n )( φ s,j ) (cid:107) (cid:27) . As predicted by theory, the bias of the estimator diminishes faster than n − / and thevariance of the estimator approaches the efficiency bound in larger samples (Figure 1). Theempirical average of the canonical gradient is appropriately controlled (top right) and ourselection criteria for the HAL tuning parameter appears to also satisfy the global criteriastipulated by equation (7). At all sample sizes, the sampling distribution of the scaled andcentered estimator is well-approximated by the efficient asymptotic distribution. We simulated a univariate variable O ∼ N ( − , / 3) and evaluated the performance of un-dersmoothed HAL for estimating the integral of the square of the density of O (Section 5).We implemented a HAL-based estimator of the density using an approach similar to the onedescribed in Munoz and van der Laan (2011). This approach entails estimating a discretehazard function using HAL using a pre-specified binning of the real line. For this simulation,23e used 320 equidistant bins, and note that the HAL density estimator is robust to thischoice, so long as a large enough value is chosen. We sample 1,000 data sets for each ofseveral sample sizes ranging from n = 100 to 100,000. We compare the results for under-smoothed HAL to those obtained by using a typical implementation of HAL that selects thelevel of smoothing based on cross-validation. We compared these estimators on the samecriterion described in the previous subsection.The simulations results reflect what is expected based on theory. In particular, the un-dersmoothed HAL achieves the efficiency bound in large samples and the scaled-centeredsampling distribution of the estimator is well approximated by the efficient asymptotic dis-tribution. We found that our selection criterion for the level of undersmoothing based on theEIF led to control of the variation norm of the resultant fit. On the other hand, results forthe HAL estimator with level of smoothing selected via cross-validation demonstrated thatthis estimator does not have bias that is decreasing faster than n − / . Thus, this estimatorperforms worse in terms of all criteria that we considered.(a) (b)(c) (d)Figure 2: Simulation results for the average density value parameter: (a) √ n times absolutebias (b) variance (c) MSE (d) histogram of √ n (Ψ n − Ψ ) for undersmoothed HAL andHAL(CV) at different sample sizes In this article we established that for realistic and nonparametric statistical models an over-fitted Spline HAL-MLE of a functional parameter of the data distribution results in efficientplug-in estimators of pathwise differentiable functionals of this functional parameter. Thestatistical model can be any model for which the parameter space for the functional is a24a) (b)(c) (d)Figure 3: Summarization features for the average density value simulation: (a) √ nP n ( D ∗ )(b) √ n ( P n − P )( D n − D ) (c) (cid:107) f n (cid:107) (d) √ n min s,j ∈ J n P n φ s,j for undersmoothed HAL andHAL(CV) at different sample sizes(cartesian product of a) subset of the the set of multivariate cadlag functions with a uni-versal bound on the sectional variation norm. The undersmoothing condition requires thatone chooses the L -norm in the HAL-MLE large enough so that the basis functions withnon-zero coefficients includes ”sparse enough” basis functions, where ”sparse enough” cor-responds with assuming that the proportion of non-zero elements (among n observations ofthis basis function) in the basis function converges to zero at a rate faster than a rate slowerthan n − / . This rate could be as slow as n − / if one would be able to establish that theHAL-MLE converges in supremum norm at a rate faster than n − / as it does in L ( P )-normor loss-based dissimilarity, but, either way, the rate could be set at level n − / α ( k ) , where n − / − α ( k ) is the rate of the HAL-MLE w.r.t. loss-based dissimilarity. The undersmoothingcondition represents a rate that is not parameter specific, so that such an undersmoothedHAL-MLE will be efficient for any of its smooth functionals. In addition, the undersmooth-ing of the HAL-MLE does not change its rate of convergence w.r.t. the HAL-MLE optimallytuned with cross-validation, suggesting that it is still a good estimator of the true functionalparameter.On the other hand, a typical TMLE targeting one particular target parameter will gen-erally only be asymptotically efficient for that particular target parameter, and not evenasymtotically linear for other smooth functionals, even if it uses as initial estimator theHAL-MLE tuned with cross-validation. Therefore it appears to be an interesting topic tobetter understand the sampling distribution of the undersmoothed HAL-MLE in an asymp-totic sense and in relation to a sampling distribution of a TMLE using an optimally smoothed(i.e., cross-validation) HAL-MLE as initial estimator. Note, however, that if the TMLE uses25n undersmoothed HAL-MLE as initial estimator, than the TMLE step should result insmall changes, thereby mostly preserving the behavior of the undersmoothed HAL-MLE.It is also of interest to observe that the second order remainder of the HAL-MLE for apathwise differentiable functional appears to either be driven by the square of the L ( P )-norm of the HAL-MLE itself w.r.t. the functional parameter, or, in the case that the efficientinfluence curve has a nuisance parameter G , a second order remainder might also (or only)involve a product of differences of the HAL-MLE Q n w.r.t. its true counterpart Q andthe difference of a projection G ,n of the true G w.r.t. the linear space of basis functionsselected by the undersmoothed HAL-MLE Q n . Since G ,n is a type of oracle estimator of G , this suggest that in a model in which our knowledge on G is not any better than ourknowledge on Q , this HAL-MLE has a good second order remainder that might generallybe smaller than what it would be for a TMLE that estimates G with an actual estimatorsuch as the HAL-MLE.On the other hand, if the statistical model involves particularly strong knowledge on thenuisance parameter G , then a TMLE can fully utilize this model on G and thereby obtaina better behaved second order remainder than the one for the overfitted HAL-MLE. One alsosuspects that a TMLE will be more sensitive to lack of positivity for the target parameterthan the undersmoothed HAL-MLE. Therefore, we conjecture that an undersmoothed HAL-MLE might be the preferred estimator in models in which case the estimation of G is as hardas estimation of Q , and when lack of positivity is a serious issue, while an HAL-TMLE mightbe the preferred estimator when estimation of G is easier than estimation of Q . These arenot formal statements, but indicate a qualitative comparison between the undersmoothedHAL-MLE and a HAL-TMLE using an estimator (HAL-MLE) G n of G .In future research we hope to address the comparison between undersmoothed HAL-MLEand HAL-TMLE in realistic simulations and possibly by formal comparison by their secondorder remainders. In a subsequent article we will marry the TMLE with the HAL-MLE bydefining a targeted HAL-MLE that minimizes the empirical risk over the linear span of basisfunctions (approximating the true cadlag function with finite sectional variation norm) underthe L -constraint and under the constraint that the Euclidean norm of the empirical meanof the efficient influence curve at the HAL-MLE (as well as at an estimator G n ) is o P ( n − / ).We will show that undersmoothing this targeted HAL-MLE results in an estimator that isstill efficient across all smooth functionals, while it is able to fully exploit all knowledge on G for the sake of the specific target parameter.A key advantage of a TMLE is that it can utilize any super-learner so that its library caninclude many other powerful machine learning algorithms beyond m -th order Spline-HALMLEs. In this manner a TMLE using a powerful super-learner might compensate for thefavorable property of an undersmoothed HAL-MLE w.r.t. size of the second order remainder.In another future article we will provide a method that marries a powerful super-learner withHAL-MLE, by using the super-learner as a dimension reduction, and applying HAL-MLEas the meta learning step in an ensemble learner. We will show that an undersmoothedHAL-MLE in this metalearning step will result again in an estimator that is efficient for anyof its smooth functionals. By actually using a targeted HAL-MLE as meta learning step,we might end up with an estimator that is able to still fully exploit the strengths super-learning, undersrmoothed HAL-MLE, and TMLE using a good esitmator of G , combinedin one method. 26 eferences D. Benkeser and M.J. van der Laan. The highly adaptive lasso estimator. Proceedings of theIEEE Conference on Data Science and Advanced Analytics , 2016. To appear.P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J. Wellner. Efficient and adaptive estimation forsemiparametric models . Springer, Berlin Heidelberg New York, 1997.R.D. Gill, M.J. van der Laan, and J.A. Wellner. Inefficient estimators of the bivariate survivalfunction for three models. Annales de l’Institut Henri Poincar´e , 31(3):545–597, 1995.Ivan Diaz Munoz and Mark J van der Laan. Super learner based conditional density es-timation with application to marginal structural models. The International Journal ofBiostatistics , 7(1):1–20, 2011.W. Newey. The asymptotic variance of semiparametric estimators. Econometrica , 62(6):1349–1382, 2014.E.C. Polley, S. Rose, and M.J. van der Laan. Super Learner. In M.J. van der Laan andS. Rose, editors, Targeted Learning: Causal Inference for Observational and ExperimentalData . Springer, New York Dordrecht Heidelberg London, 2011.J.M. Robins and A. Rotnitzky. Recovery of information and adjustment for dependentcensoring using surrogate markers. In AIDS Epidemiology . Birkh¨auser, Basel, 1992.X. Shen. On methods of sieves and penalization. Annals of Statitics , 252(6):2555–2591,1997.X. Shen. Large sample sieve estimation of semiparametric models. Chapter in Handbook ofEconometrics , 76(00):0000, 2007.M.J. van der Laan. Causal effect models for intention to treat and realistic individualizedtreatment rules. Technical report 203, Division of Biostatistics, University of California,Berkeley, 2006.M.J. van der Laan. Estimation based on case-control designs with known prevalance prob-ability. Int J Biostat , 4(1):Article 17, 2008.M.J. van der Laan. A generally efficient targeted minimum loss-based estimator. Techni-cal Report 300, UC Berkeley, 2015. http://biostats.bepress.com/ucbbiostat/paper343, toappear in IJB, 2017.M.J. van der Laan and A. Bibaut. Technical report, U.C. Berkeley Division of Biostatistics,https://arxiv.org/abs/1709.06256.M.J. van der Laan and S. Dudoit. Unified cross-validation methodology for selection amongestimators and a general cross-validated adaptive epsilon-net estimator: finite sample or-acle inequalities and examples. Technical Report 130, Division of Biostatistics, Universityof California, Berkeley, 2003. 27.J. van der Laan and S. Gruber. One-step targeted minimum loss-based estimation basedon universal least favorable one-dimensional submodels. to appear in International Journalof Biostatistics , 2015.M.J. van der Laan and J.M. Robins. Unified Methods for Censored Longitudinal Data andCausality . Springer, Berlin Heidelberg New York, 2003.M.J. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational andExperimental Data . Springer, Berlin Heidelberg New York, 2011.M.J. van der Laan and Daniel B. Rubin. Targeted maximum likelihood learning. Int JBiostat , 2(1):Article 11, 2006.M.J. van der Laan, S. Dudoit, and A.W. van der Vaart. The cross-validated adaptive epsilon-net estimator. Stat Decis , 24(3):373–395, 2006.M.J. van der Laan, E.C. Polley, and A.E. Hubbard. Super learner. Stat Appl Genet Mol , 6(1):Article 25, 2007.A.W. van der Vaart. Asymptotic statistics . Cambridge, New York, 1998.A.W. van der Vaart, S. Dudoit, and M.J. van der Laan. Oracle inequalities for multi-foldcross-validation. Stat Decis , 24(3):351–371, 2006. AppendixA Representing a function as an infinitesimal linearcombination of spline-basis functions We have the following representation theorem for the smoothness class D m [0 , τ ] consistingof k -variate real valued cadlag functions for which the m -th order sectional variation normis bounded (and thereby is, in particular, m -times differentiable), as defined in Section 2. Theorem 5 m -th order spline representation of a function Q ∈ D m [0 , τ ] . For anyfunction Q ∈ D m [0 , τ ] (i.e., finite m -th order sectional variation norm), we have Q ( x ) = Q (0) + m − (cid:88) j =0 (cid:88) ¯ s ( j ) Q j +1¯ s ( j ) (0 s j ) φ ¯ s ( j ) , ∅ ,x s (0)+ (cid:88) ¯ s ( m ) (cid:90) φ ¯ s ( m ) ,x s ( z s m ) dQ m ¯ s ( m ) ( z s m ) . hus, if Q ∈ D m +1 [0 , τ ] , then we have Q ( x ) = Q (0) + m − (cid:88) j =0 (cid:88) ¯ s ( j ) Q j +1¯ s ( j ) (0 s j ) φ ¯ s ( j ) , ∅ ,x s (0)+ (cid:88) ¯ s ( m ) Q m +1¯ s ( m ) (0 s m ) φ ¯ s ( m ) , ∅ ,x s (0) + (cid:88) ¯ s ( m +1) (cid:90) φ ¯ s ( m +1) ,x s ( z s m +1 ) dQ m +1¯ s ( m +1) ( z s m +1 ) . Proof of Theorem: We already expressed Q in terms of it integrals w.r.t. the measures generated by its s -specific sections Q s ( x ) = Q ( x s , − s ) for s ⊂ { , . . . , k } . Suppose that for each subset sQ s is absolutely continuous w.r.t. Lebesque measure so that we have dQ s ( u s ) = Q s ( u s ) du s .Suppose now that Q s is a cadlag function so that we can apply the same representation to Q s : Q s ( u s ) = Q s (0 s ) + (cid:88) s ⊂ s (cid:90) (0 s ,u s ] dQ s,s ( y s ) . In this manner we obtain a representation for Q ( x ) in terms of integrals w.r.t. Q s,s acrossall subsets s, s with s ⊂ s . Specifically, we have Q ( x ) = Q (0) + (cid:88) s ⊂{ ,...,d } (cid:90) (0 s ,x s ] Q s ( u s ) du s = Q (0) + (cid:88) s (cid:90) (0 s ,x s ] Q s (0 s ) du s + (cid:88) s (cid:90) (0 s ,x s ] (cid:88) s ⊂ s (cid:90) (0 s ,u s ] dQ s,s ( y s ) du s = Q (0) + (cid:88) s Q s (0 s ) (cid:89) j ∈ s x j + (cid:88) s,s ⊂ s (cid:90) y s (cid:26)(cid:90) u s I ( u s ≤ x s ) I ( y s ≤ u s ) du s (cid:27) dQ s,s ( y s ) , where we used Fubini’s theorem to exchange the order of integration. The product of theindicators can be written as I ( y s ≤ u s ≤ x s ) I ( y s ≤ x s ) (cid:81) j ∈ s,j (cid:54)∈ s I ( u j ≤ x j ). The innerintegral represents a mapping from the original indicator basis function x s → I ( u s ≤ x s ) toa new basis function x s → φ s,s ,x s ( y s ) ≡ (cid:90) u s I ( u s ≤ x s ) I ( y s ≤ u s ) du s = (cid:89) j ∈ s ( x ( j ) − y ( j )) I ( y ( j ) ≤ x ( j )) (cid:89) j ∈ s/s x j . This is a tensor product of spline basis functions across the components in s . Specifically,for j ∈ s it is the first order spline-basis (line with slope 1 starting at knot y j ), while for j ∈ s/s , it is first order spline at knotpoint 0. We also define φ s, ∅ ,x s (0 s ) = (cid:89) j ∈ s x j , s equal to empty set in definition of φ s,s ,x s ( y s ) and selectingknot y s = 0 s . Thus, we have obtained: Q ( x ) = Q (0) + (cid:88) s Q s (0 s ) φ s, ∅ ,x s (0 s ) + (cid:88) s,s (cid:90) y s φ s,s ,x s ( y s ) dQ s,s ( y s ) . This proves the representation for functions in D m [0 , τ ] for m = 1. This representation showsthat this class of functions can be represented as an infinitesimal linear combination of firstorder spline basis functions for which the L -norm of the coefficients equals the first-ordersectional variation norm of the function.Let’s now use the same method to derive a representation by assuming another degreeof differentiability, thereby establishing the general story. The last expression expresses Q in integrals w.r.t Q s,s . Suppose now that Q s,s is absolutely continuous w.r.t. Lebesquemeasure so that dQ s,s ( y s ) = Q s,s ( y s ) dy s for a second order derivative Q s,s . Let’s nowalso assume that Q s,s is cadlag and has finite sectional variation norm so that Q s,s ( y s ) = Q s,s (0 s ) + (cid:88) s ⊂ s (cid:90) (0 s ,y s ] dQ s,s ,s ( z s ) . Substitution of this into the last expression for Q , combined with the derivation aboveinvolving change or order of integration, yields the following: Q ( x ) = Q (0) + (cid:88) s Q s (0 s ) φ s, ∅ ,x s (0 s ) + (cid:88) s,s Q s,s (0 s ) φ s,s , ∅ ,x s (0)+ (cid:88) s,s ,s (cid:90) z s φ s,s ,s ,x s ( z s ) dQ s,s ,s ( z s ) . The above representation shows that this function is represented as an infinitesimal linearcombination of (up till) second order spline basis functions for which the L -norm of thecoefficients equals the second order sectional variation norm of the function. Some of thespline basis functions in the tensor product basis functions are of lower order, but only atknot points equal to 0.Similarly, the 3rd order representation follows from the above 2-nd order representation: Q ( x ) = Q (0) + (cid:88) s Q s (0 s ) φ s, ∅ ,x s (0 s ) + (cid:88) s,s Q s,s (0 s ) φ s,s , ∅ ,x s (0)+ (cid:88) s,s ,s Q s,s ,s (0 s ) φ s,s ,s , ∅ ,x s (0)+ (cid:88) s,s ,s ,s (cid:90) z s φ s,s ,s ,s ,x s ( z s ) dQ s,s ,s ,s ( z s ) , where the second order basis functions x s → φ s,s ,s ,x s ( y s ) indexed by knots y is mappedto the new third order basis functions φ ¯ s (3) ,x s ( z s ) ≡ (cid:89) j ∈ s (cid:90) ( z j ,x j ] φ j, ¯ s (2) ,x j ( y j ) dy j (cid:89) j ∈ s /s (cid:90) (0 ,x j ] φ j, ¯ s (2) ,x j ( y j ) dy j (cid:89) j ∈ s/s φ j, ¯ s (2) ,x j . 30n general, the m -th order spline representation of Q is represented as follows: Q ( x ) = Q (0) + m − (cid:88) j =1 (cid:88) ¯ s ( j ) Q j +1¯ s ( j ) (0 s j ) φ ¯ s ( j ) , ∅ ,x s (0) + (cid:88) ¯ s ( m ) (cid:90) φ ¯ s ( m ) ,x s ( z s m ) dQ m ¯ s ( m ) ( z s m ) , and the m + 1-th order spline representation is derived from this one as follows: Q ( x ) = Q (0) + m − (cid:88) j =1 (cid:88) ¯ s ( j ) Q j +1¯ s ( j ) (0 s j ) φ ¯ s ( j ) , ∅ ,x s (0)+ (cid:88) ¯ s ( m ) Q m +1¯ s ( m ) (0 s m ) φ ¯ s ( m ) , ∅ ,x s (0) + (cid:88) ¯ s ( m +1) (cid:90) φ ¯ s ( m +1) ,x s ( z s m +1 ) dQ m +1¯ s ( m +1) ( z s m +1 ) , where the m -th order basis functions x s → φ ¯ s ( m ) ,x s ( y s m ) indexed by knots y s m is mapped tothe new m + 1-th order basis functions φ ¯ s ( m +1) ,x s ( z s m +1 ) ≡ (cid:89) j ∈ s m +1 (cid:90) ( z j ,x j ] φ j, ¯ s ( m ) ,x j ( y j ) dy j (cid:89) j ∈ s m /s m +1 (cid:90) (0 ,x j ] φ j, ¯ s ( m ) ,x j ( y j ) dy j (cid:89) j ∈ s/s m φ j, ¯ s ( m ) ,x j (0) . Note that the last term in the m -th order representation is replaced by the last two newterms to obtain the m + 1-th order representation. This completes the proof. (cid:50) B Rate of convergence of m -th order Spline HAL-MLE,and smoothness-adaptive Spline HAL-MLE Let Q m = D m [0 , τ ]. Under the assumption that M < ∞ and M < ∞ , we have that r n ( m ) ≡ d ( Q mn , Q m ) = o P ( n − / ), where Q m = arg min Q ∈Q m P L ( Q ). Let m be the smallestinteger m for which Q ∈ Q m . Then, we have that, if m ≥ m , d ( Q mn , Q ) = r n ( m ) = o P ( n − / ). In general, the rate of convergence r n ( m ) will be unique for each m so that thehighest rate of convergence is achieved by the m -th order Spline HAL-MLE Q m n , whichis achieved by selecting m with cross-validation (due to asymptotic equivalence of cross-validation selector with oracle selector). Thus, if m n = arg min m E B n P n,B n L ( ˆ Q m ( P n,B n )) isthe cross-validation selector, then, by the oracle inequality of the cross-validation selector,we have E B n d ( ˆ Q m n n ( P n,B n ) , Q )min m E B n d ( ˆ Q m ( P n,B n ) , Q ) → p . (15)We refer to Q m n n = ˆ Q m n ( P n ) as the smoothness adaptive Spline HAL-MLE. B.1 Asymptotic equivalence of the smoothness adaptive SplineHAL-MLE with the oracle Spline HAL-MLE The following lemma states that if the rates of convergence of the m -th order Spline HAL-MLE are unique across m , and the conditions under which the cross-validation selector m n 31s asymptotically equivalent with the oracle selector (15) hold, then it follows that P ( m n = m ) → 1. In that case, establishing that the plug-in m -th order Spline HAL-MLE isasymptotically efficient also implies that the plug in of the smoothness adaptive SplineHAL-MLE is asymptotically efficient. Thus, under this condition, our asymptotic efficiencyresults for the m -th order Spline HAL-MLE implies also our desired result for the asymptoticefficiency of the smoothness adaptive Spline HAL-MLE. Lemma 2 Let r m ( n ) ≡ E B n d ( ˆ Q m ( P n,B n ) , Q m ) . Let m n ∈ { , . . . , K n } . Assume that the m -specific rates r m ( n ) are unique in the sense that r m ( n )min m 0; (16)log K n /n min m r m ( n ) → p 0; (17) and M < ∞ , M < ∞ .Then, (15) holds, and P ( m n = m ) → as n → ∞ . Proof of Lemma: Due to the asymptotic equivalence of the cross-validation selector m n with the oracle selector under the above conditions, it follows that lim P ( m n > m ) → m P ( m n < m ) > δ > δ > 0. Let A = { m : m > m } sothat P ( m n ∈ A ) > δ > 0. Then, for m n ∈ A , we have r m ( n ) /r m n ( n ) ≤ r m ( n ) / min m>m r m ( n ) , but the latter upper-bound converges to zero in probability, by assumption (16). Thus, inthat case r m ( n ) /r m n ( n ) does not converge to 1 in probability. But min m r m ( n ) /r m n ( n ) ≤ r m ( n ) /r m n ( n ), so that this also implies that min m r m ( n ) /r m n ( n ) does not converge to 1in probability. However, this contradicts (15) which states that min m r m ( n ) /r m n ( n ) → p P ( m n = m ) → n → ∞ . (cid:50) C Proof of Theorem 1 and Lemma 1. The HAL-MLE has the form Q n = (cid:80) s,j ∈J n ( s ) β n ( s, j ) φ s,j for a finite collection of basisfunctions. A basis function φ s,j ( X ) has support on { X : X s > x s,j } for a knot point x s,j , and s a subset of { , . . . , k } . We also know that (cid:80) s,j | β n ( s, j ) |≤ C n for the selected L -bound C n (typically the L -norm will be equal to C n ). We have that β n = arg min β, (cid:80) s,j | β ( s,j ) |≤ C n P n L (cid:88) s,j ∈J n ( s ) β ( s, j ) φ s,j . Consider paths (1 + (cid:15)h ( s, j )) β n ( s, j ) for a bounded vector h , which yields a collection ofscores S h ( Q n ) = ddQ n L ( Q n ) (cid:88) s,j ∈J n ( s ) h ( s, j ) β n ( s, j ) φ s,j . r ( h, Q n ) = (cid:80) s,j ∈J n ( s ) h ( s, j ) | β n ( s, j ) | . If r ( h, Q n ) = 0, then for (cid:15) small enough, (cid:88) s,j ∈J n ( s ) | (1 + (cid:15)h ( s, j )) β n ( s, j ) | = (cid:88) s,j ∈J n ( s ) (1 + (cid:15)h ( s, j )) | β n ( s, j ) | = (cid:88) s,j ∈J n ( s ) | β n ( s, j ) | + (cid:15)r ( h, Q n )= (cid:88) s,j ∈J n ( s ) | β n ( s, j ) | . Thus, by β n being an MLE, P n S h ( Q n ) = 0 for any h satisfying r ( h, Q n ) = 0. Let h ∗ = h ∗ n bechosen so that P n S h ∗ n ( Q n ) = P n D ∗ n ( Q n , G ) for the approximation D ∗ n ( Q n , G ) of D ∗ ( Q n , G )specified in the theorem. We want to show that P n D ∗ n ( Q n , G ) = o P ( n − / ), i.e. P n S h ∗ n ( Q n ) = o P ( n − / ). Let ( s ∗ , j ∗ ) be a particular choice in our finite index set J n satisfying β n ( s ∗ , j ∗ ) (cid:54) =0, which we can specify later to minimize the bound. Let ˜ h be defined by ˜ h ( s, j ) = h ∗ ( s, j )for ( s, j ) (cid:54) = ( s ∗ , j ∗ ), and ˜ h ( s ∗ , j ∗ ) is defined by r (˜ h, Q n ) = (cid:80) s,j ∈J n ( s ) ˜ h ( s, j ) | β n ( s, j ) | = 0, sothat we know P n S ˜ h ( Q n ) = 0. Thus, (cid:88) ( s,j ) (cid:54) =( s ∗ ,j ∗ ) h ∗ ( s, j ) | β n ( s, j ) | +˜ h ( s ∗ , j ∗ ) | β n ( s ∗ , j ∗ ) | = 0 . This gives ˜ h ( s ∗ , j ∗ ) = − (cid:80) ( s,j ) (cid:54) =( s ∗ ,j ∗ ) h ∗ ( s, j ) | β n ( s, j ) || β n ( s ∗ , j ∗ ) | . So (cid:80) s,j (˜ h − h ∗ )( s, j ) β n ( s, j ) φ s,j = (˜ h − h ∗ )( s ∗ , j ∗ ) β n ( s ∗ , j ∗ ) φ s ∗ ,j ∗ = (cid:16) − (cid:80) ( s,j ) (cid:54) =( s ∗ ,j ∗ ) h ∗ ( s,j ) | β n ( s,j ) || β n ( s ∗ ,j ∗ ) | β n ( s ∗ , j ∗ ) − h ∗ ( s ∗ , j ∗ ) β n ( s ∗ , j ∗ ) (cid:17) φ s ∗ ,j ∗ ≡ c n ( s ∗ , j ∗ ) φ s ∗ ,j ∗ , where c n ( s ∗ , j ∗ ) = − (cid:80) ( s,j ) (cid:54) =( s ∗ ,j ∗ ) h ∗ ( s, j ) | β n ( s, j ) || β n ( s ∗ , j ∗ ) | β n ( s ∗ , j ∗ ) − h ∗ ( s ∗ , j ∗ ) β n ( s ∗ , j ∗ ) . We note that c n ( s ∗ , j ∗ ) is bounded by (cid:80) s,j | h ∗ ( s, j ) || β n ( s, j ) | . So we can bound this by (cid:107) h ∗ (cid:107) ∞ C n . Thus under the assumption that (cid:107) h ∗ n (cid:107) ∞ = O P (1), we have that c n ( s ∗ , j ∗ ) = O P (1).For this choice ˜ h , let’s compute P n S ˜ h ( Q n ) − P n S h ∗ ( Q n ) (which equals P n S h ∗ ( Q n )): P n S ˜ h ( Q n ) − P n S h ∗ ( Q n ) = P n ddQ n L ( Q n ) (cid:32)(cid:88) s,j (˜ h − h ∗ )( s, j ) β n ( s, j ) φ s,j (cid:33) = P n ddQ n L ( Q n )( c n ( s ∗ .j ∗ ) φ s ∗ ,j ∗ )= c n ( s ∗ , j ∗ ) P n ddQ n L ( Q n )( φ s ∗ ,j ∗ )= O P (cid:18) P n ddQ n L ( Q n )( φ s ∗ ,j ∗ ) (cid:19) . s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 (cid:107) P n ddQ n L ( Q n )( φ s,j ) (cid:107) = o P ( n − / ) . (18)Under this condition we have P n S ˜ h ( Q n ) − P n D ∗ n ( Q n , G ) = o P ( n − / ), but, since P n S ˜ h ( Q n ) =0, this implies the desired conclusion P n D ∗ n ( Q n , G ) = o P ( n − / ). This proves the firststatement of Theorem 1.Let now ( s ∗ , j ∗ ) = arg min s,j ∈J n ( s ) P φ s,j . To understand, P n ddQ n L ( Q n )( φ s ∗ ,j ∗ ) we canproceed as follows. P n ddQ n L ( Q n )( φ s ∗ ,j ∗ ) = ( P n − P ) ddQ n L ( Q n )( φ s ∗ ,j ∗ ) + P ddQ n L ( Q n )( φ s ∗ ,j ∗ ) . Let S s,j ( Q n ) ≡ ddQ n L ( Q n )( φ s,j ). Suppose that P S s ∗ ,j ∗ ( Q n ) → p 0, which will generally holdwhenever P φ s ∗ ,j ∗ = o P (1). We also have that { S s,j ( Q ) : Q ∈ Q , ( s, j ) } is contained inthe class of cadlag functions with uniformly bounded sectional variation norm, which is aDonsker class. Thereby, by asymptotic equicontinuity of the empirical process indexed bya Donsker class, we have ( P n − P ) S s ∗ ,j ∗ ( Q n ) = o P ( n − / ). Thus, it remains to show that P S s ∗ ,j ∗ ( Q n ) = o P ( n − / ). We now note that P S s ∗ ,j ∗ ( Q n ) = P { S s ∗ ,j ∗ ( Q n ) − S s ∗ ,j ∗ ( Q ) } + P S s ∗ ,j ∗ ( Q ) , but P S s,j ( Q ) = 0 for all ( s, j ), since Q = arg min Q P L ( Q ). Therefore, P n ddQ n L ( Q n )( φ s ∗ ,j ∗ ) = o P ( n − / ) if P { S s ∗ ,j ∗ ( Q n ) − S s ∗ ,j ∗ ( Q ) } = o P ( n − / ) . (19)This proves the second statement of Theorem 1. The third statement is a trivial implication,which completes the proof of Theorem 1. (cid:50) Proof of Lemma 1: Consider the special case that O = ( Z, X ), L ( Q )( O ) dependson Q through Q ( X ) only, and ddQ L ( Q )( φ ) = ddQ L ( Q ) × φ , i.e., the directional derivative dd(cid:15) L ( Q + (cid:15)φ ) (cid:12)(cid:12) (cid:15) =0 of L () at Q in the direction φ is just multiplication of a function ddQ L ( Q ) of O with φ ( X ). In that case, we have that (19) reduces to P (cid:26) ddQ n L ( Q n ) − ddQ L ( Q ) (cid:27) φ s ∗ ,j ∗ = o P ( n − / ) . (20)We assume (cid:107) ddQ n L ( Q n ) − ddQ L ( Q ) (cid:107) ∞ = O ( (cid:107) Q n − Q (cid:107) ∞ ) . Then, (20) reduces to (cid:107) Q n − Q (cid:107) ∞ P φ s ∗ ,j ∗ = o P ( n − / ) . This teaches us that the critical condition (18) holds ifmin s,j ∈J n ( s ) ,β n ( s,j ) (cid:54) =0 P φ s,j = O P ( n − / ) . and that for this choice ( s ∗ , j ∗ ) we have P { S s ∗ ,j ∗ ( Q n ) } → p 0. The latter holds if min s,j P φ s,j = o P (1), since ddQ n L ( Q n ) is uniformly bounded. Finally, since P φ s,j = O ( P ( X ( s ) ≥ x s,j )),sup s,j | ( P n − P ) ) φ s,j | = O P ( n − / ), we can replace P φ s,j by P n φ s,j in the condition. Thisproves Lemma 1. (cid:50) Proof of Theorem 2 Let G n be an approximation of G , and let D ∗ ( Q n , G n ) be the approximation of D ∗ ( Q n , G )in the space of scores S ( Q n ). We have the following general theorem which proves the firstpart of Theorem 2. Theorem 6 Consider the HAL-MLE Q n with C = C u or C = C n . Assume M , M < ∞ .We have d ( Q n , Q ) = O P ( n − / − α ( k ) ) . Assume also that for a given approximation G n ∈ G of G which satisfies P n D ∗ ( Q n , G n ) = o P ( n − / ) . (21) • R (( Q n , G n ) , ( Q , G )) = o P ( n − / ) and P { D ∗ ( Q n , G n ) − D ∗ ( Q , G ) } → p . • { D ∗ ( Q, G ) : Q ∈ Q , G ∈ G} is contained in the class of k -variate cadlag functions ona cube [0 , τ o ] ⊂ IR k in a Euclidean space and that sup Q ∈Q ,G ∈G (cid:107) D ∗ ( Q, G ) (cid:107) ∗ v < ∞ .Then Ψ( Q n ) is asymptotically efficient at P . Proof: The exact second order expansion at G n of the target parameter Ψ yieldsΨ( Q n ) − Ψ( Q ) = ( P n − P ) D ∗ ( Q n , G n ) − P n D ∗ ( Q n , G n )+ R (( Q n , G n ) , ( Q , G )) . Given that d ( Q n , Q ) = O P ( n − / − α ( k ) ), and that G n is presumably at least as good ofan approximation of G , it is a reasonable assumption to assume R (( Q n , G n ) , ( Q , G )) = o P ( n − / ) and P { D ∗ ( Q n , G n ) − D ∗ ( Q , G ) } → p 0. We also assume that { D ∗ ( Q, G ) : Q ∈Q , G ∈ G} is contained in the class of k -variate cadlag functions on a cube [0 , τ o ] ⊂ IR k in a Euclidean space and that sup Q ∈Q ,G ∈G (cid:107) D ∗ ( Q, G ) (cid:107) ∗ v < ∞ . This essentially states thatthe sectional variation norm of D ∗ ( Q, G ) can be bounded in terms of the sectional variationnorm of Q and G , which will naturally hold under a strong positivity assumption that boundsdenominators away from zero. Since the class of cadlag functions on [0 , τ o ] with sectionalvariation norm bounded by a universal constant is a Donsker class, empirical process theoryyields: Ψ( Q n ) − Ψ( Q ) = ( P n − P ) D ∗ ( Q , G ) − P n D ∗ ( Q n , G n ) + o P ( n − / ) . (cid:50) This theorem can be easily generalized to a more general approximation D ∗ n ( Q n , G ) ∈S ( Q n ) of D ∗ ( Q n , G ) (not necessarily of form D ∗ n ( Q n , G ) = D ∗ ( Q n , G n ) for some G n ). Theorem 7 Consider the HAL-MLE Q n with C = C u or C = C n . Assume M , M < ∞ . We have d ( Q n , Q ) = O P ( n − / − α ( k ) ) . Assume also that for a given approximation D ∗ n ( Q n , G ) we have P n D ∗ ( Q n , G n ) = o P ( n − / ) . In addition, assume • R (( Q n , G ) , ( Q , G )) = o P ( n − / ) , P { D ∗ n ( Q n , G ) − D ∗ ( Q n , G ) } = o P ( n − / ) , and P { D ∗ n ( Q n , G ) − D ∗ ( Q , G ) } → p . • { D ∗ n ( Q, G ) , D ∗ ( Q, G ) : Q ∈ Q} is contained in the class of k -variate cadlag functionson a cube [0 , τ o ] ⊂ IR k in a Euclidean space and that sup Q ∈Q max( (cid:107) D ∗ ( Q, G ) (cid:107) ∗ v , (cid:107) D ∗ n ( Q, G ) (cid:107) ∗ v ) < ∞ .Then Ψ( Q n ) is asymptotically efficient at P . Therefore, in order to prove Theorem 2, it remains to establish the condition under which(21) holds, which was proven in the previous Appendix.35 .1 General proof of efficient score equation condition at G This subsection can be skipped for the purpose of proving Theorem 2, but the followingresult fits here. Lemma 3 Under the conditions of Theorem 6, if P n D ∗ ( Q n , G n ) = o P ( n − / ) , then also P n D ∗ ( Q n , G ) = o P ( n − / ) . Under the conditions of Theorem 7, if P n D ∗ n ( Q n , G ) = o P ( n − / ) ,then also P n D ∗ ( Q n , G ) = o P ( n − / ) . Proof: Firstly, we have P n D ∗ ( Q n , G ) = P n D ∗ n ( Q n , G ) + P n { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } = P n { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } + o P ( n − / ) . In addition, we have P n { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } = ( P n − P ) { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } + P { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } = o P ( n − / ) + P { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } , since sup Q ∈ Q ( M ) max( (cid:107) D ∗ ( Q, G ) (cid:107) ∗ v , (cid:107) D ∗ n ( Q, G ) (cid:107) ∗ v ) < ∞ , and P { D ∗ n ( Q n , G ) − D ∗ ( Q n , G ) } → p 0. If D ∗ n ( Q n , G ) = D ∗ ( Q n , G n ), then the first assumption holds if sup P ∈M (cid:107) D ∗ ( P ) (cid:107) ∗ v < ∞ .To understand the last term, consider the case that D ∗ n ( Q n , G ) = D ∗ ( Q n , G n ). By theexact second order expansion Ψ( Q n ) − Ψ( Q ) = − P D ∗ ( Q n , G ) + R ( Q n , G, Q , G ) for all G , we have P { D ∗ ( Q n , G ) − D ∗ ( Q n , G n ) } = R ( Q n , G , Q , G ) − R ( Q n , G n , Q , G ) . In our general theorem 6 we assumed R ( Q n , G n , Q , G ) = o P ( n − / ), which certainlyimplies R ( Q n , G , Q , G ) (which actually equals zero in double robust problems). Thisthen establishes that P n D ∗ ( Q n , G ) = o P ( n − / ) . For general D ∗ n ( Q n , G ), Theorem 7 simply assumed P { D ∗ ( Q n , G ) − D ∗ n ( Q n , G ) } = o P ( n − / ). (cid:50) E Efficiency of HAL-MLE for general non-linear riskfunctions. Formulation of statistical estimation problem: Consider a smooth parameter Ψ : M → IR d , where Ψ( P ) = Ψ( Q ( P )) for a Q ( P ) that is identified as the minimizer of arisk function R ( Q, P ) over all Q ∈ Q ( M ): Q = arg min Q ∈ Q ( M ) R ( Q, P ). As above, weassume that Ψ is pathwise differentiable at P with canonical gradient D ∗ ( Q ( P ) , G ( P )) fora nuisance parameter G ( P ), and, for each Q ∈ Q ( M ), P → R Q ( P ) ≡ R ( Q, P ) is pathwisedifferentiable at P with canonical gradient D ∗ Q ( P ). Let R ( P, P ) = Ψ( P ) − Ψ( P )+ P D ∗ ( P ),36nd R Q ( P, P ) = R Q ( P ) − R Q ( P ) + P D ∗ Q ( P ) be the exact second order remainders forthese two pathwise differentiable target parameters. Defining the general HAL-MLE for non-linear risk functions: Let Q n = arg min Q ∈ Q ( M ) , (cid:107) Q (cid:107) ∗ v We will write R ( Q n , P ∗ n ) for the plug in estimator of R ( Q n , P ) and R ( Q n , P ∗ ,n ) for the plug-in estimator of R ( Q , P ) (treating Q as given), satisfying P n D ∗ Q n ( P ∗ n ) = o P ( n − / ) and P n D ∗ Q ( P ∗ ,n ) = o P ( n − / ) , respectively. (We could have that P ∗ n and P ∗ ,n are TMLEs targeting R ( Q n , P ) and R ( Q , P ) , respectively, but we could also have that P ∗ n = P ∗ ,n is an undersmoothed HAL-MLE that solves both efficient influence curve equa-tions.)We make the following assumptions: • R ( Q , P ∗ n ) − R ( Q , P ∗ ,n ) = o P ( n − / ) . • { D ∗ Q ( P ) : Q ∈ Q ( M ) , P ∈ M} is a P -Donsker class. • P { D ∗ Q n ( P ∗ n ) − D ∗ Q ( P ∗ ,n ) } → p , where we can use that d ( Q n , Q ) → p . • R Q n ( P ∗ n , P ) − R Q ( P ∗ ,n , P ) = o P ( n − / ) .Then, d ( Q n , Q ) = o P ( n − / ) . Proof: Let d ( Q n , Q ) = R ( Q n , P ) − R ( Q , P ). We have0 ≤ d ( Q n , Q )= R ( Q n , P ) − R ( Q , P )= R ( Q n , P ) − R ( Q n , P ∗ n ) − ( R ( Q , P ) − R ( Q , P ∗ ,n ))+ R ( Q n , P ∗ n ) − R ( Q , P ∗ n )+ R ( Q , P ∗ n ) − R ( Q , P ∗ ,n ) ≤ R ( Q n , P ) − R ( Q n , P ∗ n ) − R ( Q , P ) + R ( Q , P ∗ ,n )+ R ( Q , P ∗ n ) − R ( Q , P ∗ ,n )= R ( Q n , P ) − R ( Q n , P ∗ n ) − R ( Q , P ) + R ( Q , P ∗ ,n ) + o P ( n − / )= P D ∗ Q n ( P ∗ n ) − R Q n ( P ∗ n , P ) − P D ∗ Q ( P ∗ ,n ) + R Q ( P ∗ ,n , P ) + o P ( n − / )= − ( P n − P ) { D ∗ Q n ( P ∗ n ) − D ∗ Q ( P ∗ ,n ) }−{ R Q n ( P ∗ n , P ) − R Q ( P ∗ ,n , P ) } + o P ( n − / ) . 37t the last equality we used that P n D ∗ Q n ( P ∗ n ) = P n D ∗ Q ( P ∗ ,n ) = o P ( n − / ). By the Donskerclass condition, we have that the leading empirical process term is O P ( n − / ), and, by thethird assumption, this shows that d ( Q n , Q ) = O P ( n − / ). By our second assumption(where we can now use that d ( Q n , Q ) = O P ( n − / )), and the asymptotic equicontinuityof the empirical process indexed by a Donsker class, this yields that the leading empiricalprocess term is o P ( n − / ). This completes the proof. (cid:50) Regarding the first assumption we make the following remarks. Firstly, we note that if P → R ( Q, P ) would be a compactly differentiable functional so that R ( Q, P n ) is asymp-totically linear estimator of R ( Q, P ) for all Q , then the above lemma can be generalizedto allow setting P ∗ n and P ∗ ,n equal to the empirical probability measure P n . In that case, R ( Q , P ∗ ,n ) − R ( Q , P ∗ n ) = 0, obviously. In the case that P → R ( Q, P ) is non-smooth, theone could set P ∗ n and P ∗ ,n equal to an undersmoothed HAL-MLE, so that again P ∗ n = P ∗ ,n and thereby the first assumption holds. Finally, in the case that P ∗ n and P ∗ ,n are TMLEs tar-geting R ( Q n , P ) and R ( Q , P ), respectively, one can still show that R ( Q , P ∗ ,n ) − R ( Q , P ∗ n )is a second order term so that assuming that it is o P ( n − / ) is a reasonable assumption. E.2 Efficiency theorem for the general HAL-MLE that allows fornon-linear risk functions Consider again the general HAL-MLE w.r.t. a potentially non-linear risk function P → R ( Q, P ). We want to investigate conditions under which Ψ( Q n ) is also asymptotically ef-ficient for Ψ( Q ), thereby generalizing our Theorem 2 for the loss-based HAL-MLE. Let L ( Q )( O ) = R ( Q, P ) + D Q ( P )( O ) and note that L ( Q ) is a valid (when treating its depen-dence on P as known) loss function in the sense that Q = arg min Q ∈ Q ( M ) P L ( Q ).The following lemma provides sufficient conditions for R ( Q, P ∗ n ) = P n L ( Q ) + r n ( Q ) witha second order remainder r n ( Q ) that we can control uniformly in Q ∈ Q ( M ). Lemma 5 Recall P ∗ n = P ∗ n,Q is such that P n D ∗ Q ( P ∗ n,Q ) = o P ( n − / ) for all Q . Assume sup Q ∈ Q ( M ) P n D ∗ Q ( P ∗ n,Q ) = o P ( n − / )sup Q ∈ Q ( M ) | R Q, ( P ∗ n , P ) | = o P ( n − / )sup Q ∈ Q ( M ) | ( P n − P ) { D ∗ Q ( P ∗ n ) − D Q ( P ) } | = o P (1) . Then, R ( Q, P ∗ n ) = R ( Q, P ) + P n D ∗ Q ( P ) + r n ( Q ) , where sup Q ∈ Q ( M ) | r n ( Q ) | = o P ( n − / ) . Here r n ( Q ) = ( P n − P ) { D ∗ Q ( P ∗ n ) − D ∗ Q ( P ) } + R Q ( P ∗ n , P ) . Proof: By P ∗ n = P ∗ n,Q being a plug-in estimator satisfying P n D ∗ Q ( P ∗ n ) = o P ( n − / ), we have R ( Q, P ∗ n ) − R ( Q, P ) = ( P n − P ) D ∗ Q ( P ∗ n ) + R Q ( P ∗ n , P ) + o P ( n − / ) . 38y assumption, R Q ( P ∗ n , P ) is o P ( n − / ) uniformly in all Q . By assumption, the empiricalprocess term equals P n D ∗ Q ( P ) plus a negligible remainder, uniformly in Q . Thus, we have R ( Q, P ∗ n ) − R ( Q, P ) = P n D ∗ Q ( P ) + r n ( Q ) , where sup Q ∈ Q ( M ) | r n ( Q ) | = o P ( n − / ). (cid:50) This result suggests that minimizing R ( Q, P ∗ n ) is approximately the same as minimizing P n L ( Q ), where L ( Q ) = R ( Q, P ) + D ∗ Q ( P ) is an unknown loss function (i.e., a loss indexedby nuisance parameter). To further formalize this we want to show that the score equations dd(cid:15) R ( Q hn,(cid:15) , P ∗ n,(cid:15) ) at (cid:15) = 0 for Q n equal the score equations S h, ( Q n ) ≡ dd(cid:15) P n L ( Q hn,(cid:15) ) up till an o P ( n − / ) approximation. With this result in hand, we can then simply apply Theorem 1 andTheorem 2 with the loss function L ( Q ) replaced by L ( Q ), treating this loss function as given,and with the exact score equations P n S h ( Q n ) = 0 now replaced by P n S h, ( Q n ) = o P ( n − / )for, uniformly in h with r ( h, Q n ) = 0. This provides us then with the conditions under whichΨ( Q n ) is asymptotically efficient.Application of Lemma 5 to Q n,(cid:15) for a path { Q n,(cid:15) : (cid:15) } yields R ( Q n,(cid:15) , P ∗ n,(cid:15) ) = R ( Q n,(cid:15) , P ) + P n D ∗ Q n,(cid:15) ( P ) + r n ( Q n,(cid:15) ) . It is reasonable to assume that dd(cid:15) r n ( Q n,(cid:15) ) (cid:12)(cid:12) (cid:15) =0 = o P ( n − / ), because of two reasons. Firstly, r n ( Q n,(cid:15) ) represents a second order remainder between P ∗ n,(cid:15) − P , where P ∗ n,(cid:15) is just a TMLEof P targeting a particular target parameter and will thus converge just as fast as the initialestimator P n used in the TMLE. Secondly, r n ( Q n,(cid:15) ) is a second order remainder indexed by Q n,(cid:15) , so taking a derivative w.r.t. (cid:15) does not make this remainder worse. This yields thefollowing lemma. Lemma 6 Recall definition of r n ( Q ) of Lemma 5. Define r n ( Q, h ) ≡ dd(cid:15) r n ( Q h(cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 . Assume that for some upper bound C < ∞ , sup Q ∈ Q ( M ) , (cid:107) h (cid:107) ∞ Corollary 1 Assume the conditions of previous lemma. Suppose that dd(cid:15) R ( Q hn,(cid:15) , P ∗ n,(cid:15) ) (cid:12)(cid:12) (cid:15) =0 = 0 for a set of paths indeed by h ∈ H . Then, P n dd(cid:15) L ( Q n,(cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = o P ( n − / ) uniformly in all h ∈ H . 39o we have now established conditions under which the general HAL-MLE is equivalentwith a loss based HAL-MLE using loss function L ( Q ), and that it also solves the samescore equations of an HAL-MLE defined by minimizing P n L ( Q ) over Q . This allows us nowto apply Theorem 1 and Theorem 2 with this choice of loss function. This results in ournext Theorem 8.Theorem 8 relies on the following definitions, analogue to Theorem 2. Definitions: • Recall we can represent Q n = arg min Q ∈Q ( C n ) P n R ( Q, P ∗ n,Q ) as follows: Q n ( x ) = I ( Q n )( x ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,x sm ] φ ¯ s ( m ) ,x s ( u s m ) dQ mn, ¯ s ( m ) ( u s m ) , where I ( Q n )( x ) = Q n (0) + (cid:80) m − j =0 (cid:80) ¯ s ( j ) Q j +1 n, ¯ s ( j ) (0 s j ) φ ¯ s ( j ) , ∅ ,x s (0 s ) . • Consider the family of paths { Q hn,(cid:15) : (cid:15) ∈ ( − δ, δ ) } through Q n at (cid:15) = 0 for arbitrarilysmall δ > 0, indexed by any uniformly bounded h , defined by Q hn,(cid:15) ( x ) = I ( Q hn,(cid:15) )( x ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,x sm ] φ ¯ s ( m ) ,x s ( u s m )(1 + (cid:15)h (¯ s ( m ) , u s m )) dQ mn, ¯ s ( m ) ( u s m ) , (22)where I ( Q hn,(cid:15) )( x ) = (1 + (cid:15)h (0)) Q n (0) + m − (cid:88) j =0 (cid:88) ¯ s ( j ) φ ¯ s ( j ) , ∅ ,x s (0 s )(1 + (cid:15)h (¯ s ( j ) , s j )) Q j +1 n, ¯ s ( j ) (0 s j ) . • Let r ( h, Q n ) ≡ I ( h, Q n ) + (cid:88) ¯ s ( m ) (cid:90) (0 sm ,τ sm ] h (¯ s ( m ) , u s m ) | dQ mn, ¯ s ( m ) ( u s m ) | , where I ( h, Q n ) = h (0)) | Q n (0) | + (cid:80) m − j =0 (cid:80) ¯ s ( j ) h (¯ s ( j ) , s j ) | Q j +1 n, ¯ s ( j ) (0 s j ) | . • For any uniformly bounded h with r ( h, Q n ) = 0 we have that for a small enough δ > { Q hn,(cid:15) : (cid:15) ∈ ( − δ, δ ) } ⊂ Q m ( C n ). • Let S ,h ( Q n ) = dd(cid:15) L ( Q hn,(cid:15) ) (cid:12)(cid:12) (cid:15) =0 be the L ( Q )-score of this h -specific submodel • Consider the resulting set of scores S ( Q n ) = { S ,h ( Q n ) = ddQ n L ( Q n )( f ( h, Q n )) : (cid:107) h (cid:107) ∞ < ∞} , (23)40here f ( h, Q n ) ≡ h (0) Q n (0) + (cid:88) s ⊂{ ,...,k } (cid:90) (0 s ,x s ] I (( s, u s ) ∈ A ) h ( s, u s ) dQ n,s ( u s ) . This is the set of scores generated by the above class of paths if we do not enforceconstraint r ( h, Q n ). • Define r n ( Q ) ≡ ( P n − P ) { D ∗ Q ( P ∗ n ) − D ∗ Q ( P ) } + R Q ( P ∗ n , P ) r n ( Q, h ) ≡ dd(cid:15) r n ( Q h(cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 . Under assumption (24) below, the general HAL-MLE Q n = arg min Q ∈ Q ( M ) , (cid:107) Q (cid:107) ∗ v Consider the above defined generalized HAL-MLE Q n = arg min Q ∈Q ( C n ) R ( Q, P ∗ n ) for some C n ≤ C u < ∞ with probability tending to 1. Consider also the above pre-sented pathwise differentiable target parameter Ψ : Q ( M ) → IR d with canonical gradient D ∗ ( Q ( P ) , G ( P )) at P ∈ M and exact second order remainder R (( Q, G ) , ( Q , G )) =Ψ( Q ) − Ψ( Q ) + P D ∗ ( Q, G ) . Assumptions: • sup Q ∈ Q ( M ) | R Q, ( P ∗ n , P ) | = o P ( n − / )sup Q ∈ Q ( M ) | ( P n − P ) { D ∗ Q ( P ∗ n ) − D Q ( P ) } | = o P (1) . • For some upper bound C < ∞ , sup Q ∈ Q ( M ) , (cid:107) h (cid:107) ∞