Canonical Least Favorable Submodels:A New TMLE Procedure for Multidimensional Parameters
aa r X i v : . [ s t a t . M E ] A p r Canonical Least Favorable SubmodelsA New TMLE Procedure for Multidimensional Parameters
Jonathan LevyApril 12, 2019 bstract
This paper is a fundamental addition to the world of targeted maximum likelihood estimation (TMLE)(van der Laan and Rubin 2006) (or likewise, targeted minimum loss estimation) for simultaneous estimationof multi-dimensional parameters of interest. TMLE, as part of the targeted learning framework (van derLaan and Gruber 2016), offers a crucial step in constructing efficient plug-in estimators for nonparametricor semiparametric models. The so-called targeting step of targeted learning, involves fluctuating the initialfit of the model in a way that maximally adjusts the plug-in estimate per change in the log likelihood (vander Laan and Gruber 2016). Previously for multidimensional parameters of interest, iterative TMLE’s wereconstructed using locally least favorable submodels as defined in van der Laan and Gruber, 2016, which areindexed by a multidimensional fluctuation parameter. In this paper we define a canonical least favorablesubmodel in terms of a single dimensional epsilon for a d -dimensional parameter of interest. One can viewthe clfm as the iterative analog to the one-step TMLE as constructed in van der Laan and Gruber, 2016. Itis currently implemented in several software packages we provide in the last section. Using a single epsilonfor the targeting step in TMLE could be useful for high dimensional parameters, where using a fluctuationparameter of the same dimension as the parameter of interest could suffer the consequences of curse ofdimensionality. The clfm also enables placing the so-called clever covariate denominator as an inverse weightin an offset intercept model. It has been shown that such weighting mitigates the effect of large inverseweights sometimes caused by near positivity violations (Robins et al. 2007). ntroduction We offer a new way to construct a targeted maximum likelihood estimator for multidimensional parametersvia defining the canonical least favorable submodel (clfm). TMLE is a plug-in estimator so it follows thatwe might prefer to use the same model estimate for all dimensions of a parameter of interest. The obviousexample of such is a survival curve, in order to insure monotonicity of the estimates in time. The clfmleads naturally to the construction of the one-step TMLE (van der Laan and Gruber 2016). The resultingTMLE algorithm can be seen as an iterative version of the one-step TMLE in that both TMLE’s use a singledimensional submodel in their construction.The TMLE defined here-in can converge much faster than its one-step recursive counterpart when evaluatingthe efficient influence curve has a cost. This is due to relatively few logistic regression fits as compared tovery small recursions. The procedure also enables placing the denominator of the clever covariate as aninverse weight in an offset intercept model, shown to stabilize large weights caused by near positivity viola-tions. In addition, like the one-step TMLE, the TMLE based on a clfm involves the use of a one-dimensionalsubmodel, which avoids high dimensional regressions to perform the targeting step in the algorithm.In this paper we will first review the TMLE basics and then construct the TMLE based on the clfm, givingan algorithm for its implementation, currently available in several R packages where simultaneous estimationis an option.
We refer the reader to Targeted Learning Appendix (van der Laan and Rose 2011) as well as (van der Laan2016; van der Laan and Gruber 2016; van der Laan and Rubin 2006) for a more detailed look at the theoryof TMLE. Here we review the basics for the convenience of the reader.Consider observed data, O ∼ P ∈ M , non-parametric, and a d-dimensional pathwise differentiable (van derVaart 2000) parameter mapping, Ψ : M −→ R d . Consider our sample are iid copies from P . The efficient1nfluence curve or canonical gradient, D ⋆ Ψ ( P )( O ) = ( D ⋆ Ψ ( P )( O ) , ..., D ⋆ Ψ d ( P )( O ))is a d-dimensional function of the observed data O and defined in terms of the distribution, P . Its variancegives the generalized Cramer-Rao lower bound for the variance of any regular asymptotically linear estimatorof Ψ( P ) (van der Vaart 2000).We will employ the notation, P n f ( O ) to be the empirical average of function, f ( · ), and P f ( O ) to be E P f ( O ).Define a loss function, L ( P )( O ), which is a function of the observed data, O, and indexed at the distributionon which it is defined, P , such that E P L ( P )( O ) is minimized at the true observed data distribution, P = P .The TMLE procedure maps an initial estimate, P n ∈ M , of the true data generating distribution to P ⋆n ∈ M such that P n L ( P ⋆n ) ≤ P n L ( P n ) and such that P n D ⋆ ( P ⋆n ) = 0 d × . P ⋆n is called the TMLE of the initialestimate P n . We can then write a second order expansion, Ψ( P ⋆n ) − Ψ( P ) = ( P n − P ) D ⋆ ( P ⋆n ) + R ( P ⋆n , P ). Define the norm k f k L ( P ) = p E P f . Assume the following TMLE conditions:1. D ⋆ Ψ j ( P ⋆n ) is in a P-Donsker class for all j . This condition can be dropped in the case of using CV-TMLE(Zheng and van der Laan 2010).2. R ,j ( P ∗ n , P ) is o p (1 / √ n ) for all j .3. D ⋆ Ψ j ( P ⋆n ) L ( P ) −→ D ⋆ Ψ j ( P ) for all j .then √ n (Ψ( P ⋆n ) − Ψ( P )) D = ⇒ N [0 × , cov P ( D ⋆ Ψ ( P ) × ] where cov P ( D ⋆ Ψ ( P )( O ) is a 2 × i, j ) entry given as E P D ∗ Ψ i ( P )( O ) D ∗ Ψ j ( P )( O ). The i th diagonal of cov P ( D ⋆ Ψ ( P )( O ) is thevariance of the D ∗ Ψ i ( P ) and the limiting variance of √ n (Ψ i ( P ∗ n ) − Ψ i ( P )) under TMLE conditions. Thus,our plug-in TMLE estimates and CI’s given byΨ j ( P ⋆n ) ± z α ∗ b σ n ( D ⋆j ( P ⋆n )) √ n will be as small as possible for any regular asymptotically linear estimator at significance level, 1 − α ,2here P r ( | Z | ≤ z α ) = α for Z standard normal and b σ n ( D ⋆j ( P ⋆n )) is the sample standard deviation of { D ⋆j ( P ⋆n )( O i ) | i ∈ n } (van der Laan and Rubin 2006). P n to P ⋆n : The Targeting Step The preceding section sketched the framework by which TMLE provides asymptotically efficient estimatorsfor nonparametric models. Here we will explain how TMLE maps an initial estimate P n to P ⋆n , otherwiseknown as the targeting step. P n is considered to be the initial estimate for the true distribution, P . Definition 2.1.
We can define a canonical 1-dimensional locally least favorable submodel (clfm) of anestimate, P n , of the true distribution as { P n,ǫ s.t ddǫ P n L ( P n,ǫ ) (cid:12)(cid:12)(cid:12)(cid:12) ǫ =0 = k P n D ⋆ ( P n ) k , ǫ ∈ [ − δ, δ ] } (1)where P n,ǫ = P n and k · k is the euclidean norm. We consider a d − dimensional parameter mappingΨ : M −→ R d .This definition only slightly differs slightly from the locally least favorable submodel (lfm) defined by Markvan der Laan (van der Laan and Gruber 2016) in that we can define a clfm with only a single epsilon and inso far as the lfm is defined so the score with respect to the loss spans the efficient influence curve. Definition 2.2.
A Universal Least Favorable Submodel (ulfm) of P n satisfies ddǫ P n L ( P ǫn ) = k P n D ⋆ ( P ǫn ) k ∀ ǫ ∈ ( − δ, δ )and naturally, P ǫ =0 n = P n .We can construct the universal least favorable submodel (ulfm) in terms of the clfm if we use the differ-ence equation P n ( L ( P n,dt ) − L ( P n )) ≈ k P n D ⋆ ( P n ) k dt , where P dtn = P n,dt is an element of the clfm of P n . More generally, we can map any partition t = m × dt for an arbitrarily small, dt , to an equation P n ( L ( P t + dtn ) − L ( P tn )) ≈ k P n D ⋆ ( P tn ) k dt , where P t + dtn is an element of the clfm of P tn . We therefore canrecursively define the integral equation: P n ( L ( P ǫn ) − L ( P n )) = R ǫ k P n D ⋆ ( P tn ) k dt and P ǫn will thusly be anelement of the ulfm of P n . For log likelihood loss, which is valid for both continuous outcome scaled between0 and 1 as well as binary outcomes, an analytic formula for a ulfm of distribution with density, p , is therefore3efined by the density p ǫ = p × exp ( R ǫ k D ∗ ( P t ) k dt ) (van der Laan and Gruber 2016) where P t + dt is anelement of the clfm of P t .In applying the one-step TMLE, when the empirical loss is minimized at a given ǫ , we will have solved, k P n D ⋆ ( P ǫn ) k = 0. Therefore, the loss is decreased and all influence curve equations are solved simultaneouslywith a single ǫ in one step. Specifically, P n D ⋆j ( P ⋆n ) = 0 for all j . Thus P ⋆n = P ǫn and we have defined therequired TMLE mapping. With an iterative approach, we first find P n,ǫ = P n , that is an element of the clfm of P n such that ddǫ P n L ( P n,ǫ ) (cid:12)(cid:12)(cid:12)(cid:12) ǫ = ǫ = 0 (2)This initializes an iterative process where by ddǫ P n L ( P j − n,ǫ ) (cid:12)(cid:12)(cid:12)(cid:12) ǫ = ǫ j = 0 . (3)where P jn,ǫ is an element of the clfm of P j − n . When ǫ j = 0, we stop the process and our TMLE is P ⋆n = P j − n . Assume we have a parameter mapping as defined in the previous section, where the data is of the form O = ( W, A, Y ) ∼ P where Y and A are binary and W is a vector of confounders. We consider the likelihoodfactored according to p ( w, a, y ) = ¯ Q ( a, w ) Y (1 − ¯ Q ( a, w )) − Y g ( a | w ) q W, ( w ). We also assume we haveefficient inflluence curve for the jth component of the parameter of the form: D ∗ j ( P )( O ) = H ,j ( p )( A, W )( Y − ¯ Q ( A, W )) + H ,j ( p )( A, W )( A − g ( A, W )) + H ,j ( A, W )( f ( P ) j ( A, W ) − Ψ( P )) where Ψ( P ) = E [ H ,j ( O i ( f ( P ) j ( O )] and E [ H j, ( O i ) = 1 for fixed function H j, . Also note the dependenceof the function H ,j ( p ) and H ,j ( p ) on the distribution. Now assume we have an initial estimate of P n ,of P , via an estimate, p n , of the density p . We define p n by estimates of factors of the likelihood. Thatis, ¯ Q n ≈ ¯ Q , g n ≈ g , and Q W,n places a q W,n = 1 /n weight on every observation. The latter is used to4pproximate the true distribution of W , Q W, . A clfm of P n is defined by leaving Q W,n fixed and defining¯ Q n,ǫ ( A, W ) = expit (cid:18) logit ( ¯ Q n ( A, W )) + ǫ (cid:28) H ( P n )( A, W ) , P n D ∗ ( P n ) k P n D ∗ ( P n ) k (cid:29) (cid:19) and g n,ǫ ( A | W ) = expit (cid:18) logit ( g n ( A | W )) + ǫ (cid:28) H ( P n )( A, W ) , P n D ∗ ( P n ) k P n D ∗ ( P n ) k (cid:29) (cid:19) where k · k is the euclidean norm induced by dot product, h· , ·i . In the usual case we have P n H ,j ( f ( P n ) j − Ψ( P n )) = 0 and therefore p n,ǫ defines an element, P n,ǫ , of a clfm of P n . Initialization
We start the iterative process with our initial estimate p n as defined in the previous subsection. P n L ( P n ) = − n n X i =1 (cid:2) Y i log ¯ Q n ( A i , W i ) + (1 − Y i )log(1 − ¯ Q n ( A i , W i )) (cid:3) − n n X i =1 (cid:2) A i log g n ( A i | W i ) + (1 − A i )log(1 − g n ( A i | W i )) (cid:3) = L our starting loss The Targeting Step
Starting with m = 0 step 2: Compute H ( P mn )( A, W ), H ( P mn )( A, W ) and H ( A, W ) over the data and then check the following: If | P n D ⋆j ( P mn ) | < ˆ σ ( D ⋆j ( P mn )) /n for all j then P ⋆n = P mn and go to step 4. This insures that we stop the processonce the bias is second order. Note, ˆ σ ( · ) refers to the sample standard deviation operator. Otherwise set m = m + 1 and go to step 3. step 3 We perform a pooled logistic regression with Y as the outcome,5ffset = logit ( ¯ Q m − n )( A, W ) and so-called clever covariate, (cid:28) ( H ( P m − n )( A, W ) , P n D ( P m − n ) k P n D ( P m − n ) k (cid:29) . and A as the outcome, offset logit ( g m − n )( A | W ) and so-called clever covariate, (cid:28) ( H ( P m − n )( A, W ) , P n D ( P m − n ) k P n D ( P m − n ) k (cid:29) . Assume ǫ j is the coefficient computed from the above pooled regression. We then update the models as perbelow, using euclidean inner product notation, h· , ·i : ¯ Q mn = expit logit ( ¯ Q m − n ) − ǫ j (cid:28) ( H ( P m − n )( A, W ) , P n D ( P m − n ) k P n D ( P m − n ) k (cid:29) ! (4)and g mn ( A | W ) = expit logit ( g m − ( A | W )) − ǫ j (cid:28) ( H ( P m − n )( A, W ) , P n D ( P m − n ) k P n D ( P m − n ) k (cid:29) ! (5) Possible alternative targeting step to ameliorate near positivity violations
We can alternatively perform a pooled logistic regression as follows. For all observations we use Y as theoutcome, offset = logit ( ¯ Q m − n )( A, W ). We denote the denominator of H ,j ( P m − n ) as g j ( P m − n ), which, insome cases is a fixed propensity score, g ( P m − n ). We can use its inverse as a weight in a logistic regressionmodel with covariate g ( P m − n )( A | W ) − (cid:28) ( H ( P m − n )( A, W ) , P n D ( P m − n ) k P n D ( P m − n ) k (cid:29) . We then stack all observations using A as the outcome, offset, logit ( g m − n )( A | W ) and so-called clevercovariate, (cid:28) ( H ( P m − n )( A, W ) , P n D ( P m − n ) k P n D ( P m − n ) k (cid:29) . Thus we use a weight of 1 for when A is the outcome because H ( P m − n )( A, W ) generally does not havelarge values. We then update the models similarly as before upon solving for the coefficient ǫ j . With either6egression scheme we solve the same score equation so either are appropriate for the targeting step.Once we are done with the targeting step we define the distribution, P mn , via factors of the density for theoutcome model and propensity score, i.e., ¯ Q mn ( A, W ) and g mn ( A | W ), while placing a weight of 1 /n for eachobservation as an estimate of the true distribution of W . Return to step 2. step 4 Our estimate is ˆΨ( P n ) = Ψ( P ⋆n ) which is really only dependent on ¯ Q ⋆n and the empirical distribution. Currently there are three packages which employ the iterative TMLE as presented in this paper for parameterswith influence curves of the form as in this paper. Note to the reader, we have yet to implement the weightedintercept targeting scheme as discussed in step 3 of the algorithm in section 4. • tmle3, https://github.com/tlverse/tmle3 (Coyle, Malenica, et al. 2018)There are various parameters for which one can perform a TMLE estimator, including variable im-portance measure for continuous variables (Chambaz et al. 2012), treatment effect among the treated,causal risk difference, treatment specific mean and more. • gentmle2, https://github.com/jeremyrcoyle/gentmle2 (Coyle and Levy 2018) The reader may notethis clfm is what is employed in this R package when specifying the approach as ”line”. An lfm withepsilon the same dimension as the parameter is employed with the ”full” option. Other than causal riskdifference and treatment specific mean, there is also the variance of treatment effect ( catesurvival )as well as the mean under stochastic intervention (Diaz Mu˜noz and van der Laan 2012). • cateSurvival, https://github.com/jlstiles/cateSurvival (Levy 2018)This package implements a TMLE estimator for Ψ k,t ( P ) = R k (cid:0) x − th (cid:1) E P I ( B ( W ) > x ) dx which iskernel-smoothed version of the non-pathwise differentiable parameter, E P I ( B ( W ) > t ), where B ( W )is the treatment effect function or TE function, defined by E P [ Y | A = 1 , W ] − E P [ Y | A = 0 , W ].The non-pathwise differentiable parameter gives the probability a subject selected at random will havetreatment effect beyond the level t . It can be thought of as a ”survival” of the treatment effect functionbecause it is monotonically decreasing. It is also more familiarly, 1 - CDF of the random variable thatgives the treatment effect for a subject drawn at random. The user can select the kernel according to7ts support and its order. 8 eferences Chambaz, Antoine A, P Neuvial, and van der Laan Mark J (2012). “Estimation of a non-parametric variableimportance measure of a continuous exposure”. In:
Electron J Statistics , pp. 1059–1099.Coyle, Jeremy and Jonathan Levy (2018). gentmle2 . url : https://github.com/jeremyrcoyle/gentmle2 .Coyle, Jeremy, Ivana Malenica, et al. (2018). tmle3 . url : https://github.com/tlverse/tmle3 .Diaz Mu˜noz, Ivan and Mark van der Laan (2012). “Population Intervention Causal Effects Based on Stochas-tic Interventions”. In: Biometrics
Journal of the American StatisticalAssociation
U.C.Berkeley Division of Biostatistics Working Paper Series url : http://biostats.bepress.com/ucbbiostat/paper343 .van der Laan, Mark and Susan Gruber (2016). “One-Step Targeted Minimum Loss-based Estimation Basedon Universal Least Favorable One-Dimensional Submodels”. In: The International Journal of Biostatistics
Targeted Learning . New York: Springer.van der Laan, Mark and Daniel Rubin (2006). “Targeted Maximum Likelihood Learning”. In:
U.C. BerkeleyDivision of Biostatistics Working Paper Series url : http://biostats.bepress.com/ucbbiostat/paper213 .Levy, Jonathan (2018). Cate Survival . url : https://github.com/jlstiles/cateSurvival .Robins, James et al. (2007). “Comment: Performance of Double-Robust Estimators When “Inverse Proba-bility” Weights Are Highly Variable”. In: Statistical Science
Weak Convergence and Empirical Processes . New York:Springer-Verlag.van der Vaart, Aad (2000).
Asymptotic Statistics . Vol. Chapter 25. Cambridge, UK: Cambridge UniversityPress.Zheng, Wenjing and Mark van der Laan (2010). “Asymptotic Theory for Cross-validated Targeted MaximumLikelihood Estimation”. In: