A Note on Debiased/Double Machine Learning Logistic Partially Linear Model
aa r X i v : . [ s t a t . M E ] A ug A Note on Debiased/Double Machine LearningLogistic Partially Linear Model
Molei Liu Abstract
It is of particular interests in many application fields to draw doubly robust inference of alogistic partially linear model with the predictor specified as combination of a targeted lowdimensional linear parametric function and a nuisance nonparametric function. In recent,Tan (2019) proposed a simple and flexible doubly robust estimator for this purpose. Theyintroduced the two nuisance models, i.e. nonparametric component in the logistic model andconditional mean of the exposure covariates given the other covariates and fixed response, andspecified them as fixed dimensional parametric models. Their framework could be potentiallyextended to machine learning or high dimensional nuisance modelling exploited recently, e.g. inChernozhukov et al. (2018a,b) and Smucler et al. (2019); Tan (2020).Motivated by this, we derive the debiased/double machine learning logistic partially linearmodel in this note. For construction of the nuisance models, we separately consider the useof high dimensional sparse parametric models and general machine learning methods. By de-riving certain moment equations to calibrate the first order bias of the nuisance models, wepreserve a model double robustness property on high dimensional ultra-sparse nuisance models.We also discuss and compare the underlying assumption of our method with debiased LASSO(Van de Geer et al., 2014). To implement the machine learning proposal, we design a “full modelrefitting” procedure that allows the use of any blackbox conditional mean estimation methodin our framework. Under the machine learning setting, our method is rate doubly robust in asimilar sense as Chernozhukov et al. (2018a).
Keywords : Logistic partially linear model, Double machine learning, Debiased inference.
Consider a logistic partially linear model. Let { ( Y i , A i , X i ) : i = 1 , , . . . , n } be independent andidentically distributed samples of Y ∈ { , } , A ∈ R and X ∈ R p . Assume thatP( Y = 1 | A, X ) = expit { β A + r ( X ) } , (1)where expit( · ) = logit − ( · ), logit( a ) = log { a/ (1 − a ) } and r ( · ) is an unknown nuisance function of X . In a casual scenario with A taken as the exposure (treatment) of interests, Y being the observedoutcome and X representing all the confounding variables, the parameter β is of particular interestsin that it measures the casual effect of A on the potential outcome, in a scale of logarithmic oddsratio. And as the most common and natural way to characterize the casual model for a binaryoutcome, model (1) is considered in extensive application fields like medical science, economy andpolitical science. While for observational studies with unobserved confounding variables such aselectronic health record (EHR) studies, model (1) also plays an importatn role in studying the Department of Biostatistics, Harvard Chan School of Public Health. Y and a key feature A , e.g. the diagnosis codefor Y , conditional on patient profiles X .Our goal is to estimate and infer β asymptotic normally at the rate n − / . It has beendshown that if X is a scalar and r ( · ) is smooth, the semiparametric kernel or sieve regression(Severini and Staniswalis, 1994; Lin and Carroll, 2006) works well for this purpose. However, when X is of relatively high dimensionality, these classic approaches have poor performance due to curseof dimensionality and one would specify r ( x ) in a parametric form: r ( x ) = x ⊺ γ . To enhance therobustness to the potential misspecification of r ( x ), Tan (2019) proposed a doubly robust estimatingequation for β including a parametric model m ( x ) = g ( x ⊺ α ) with a known link function g ( · ) forthe conditional mean m ( x ) = E( A | Y = 0 , X = x ):1 n n X i =1 b φ ( X i ) n Y i e − βA i − X ⊺ i b γ − (1 − Y i ) o (cid:8) A i − g ( X ⊺ i b α ) (cid:9) = 0 , (2)where b φ ( x ) is an estimation of some scalar nuisance function φ ( x ) affecting the asymptotic efficiencyof the estimator, and b α and b γ are two fixed dimensional nuisance model estimators. As demon-strated in Tan (2019), b β solved from (2) is doubly robust in the sense that it is valid when either r ( x ) = x ⊺ γ is correctly specified for the nonparametric component in the logistic partial model,or m ( x ) = g ( x ⊺ α ) is correct for the conditional mean model m ( x ) = E( A | Y = 0 , X = x ). Itshows a novel doubly robustness property since prior to this, the doubly robust semiparametricestimation of odds ratio was built upon p ( A | X , Y = 0), the conditional density of A given X and Y = 0 (Chen, 2007; Tchetgen Tchetgen et al., 2010, e.g.), requiring a stronger model assumptionthan (2) for continuous A .Nevertheless, Tan (2019) focuses on fixed dimensional parametric nuisance models that are stillprone to misspecification in practice. And their proposed framework is not readily applicable to thehigh dimensional (Athey et al., 2016; Chernozhukov et al., 2018b; Smucler et al., 2019; Tan, 2020)or general machine learning (Chernozhukov et al., 2018a) nuisance models frequently exploited inrecent years. This is because for such nuisance models with higher complexity, simply using themto replace the fixed dimensional parametric models in (2) incurs excessive fitting bias and doesnot guarantee the desirable property of b β . In addition, estimating r ( x ) with arbitrary machinelearning algorithms (of conditional mean) is not flexible because it is linked with the responsethrough a nonlinear logit function. In this note, we handle these challenges and fill the gap byderiving the extensions of (2) to accommodate high dimensional sparse nuisance models or generalmachine learning nuisance models separately.For the high dimensional sparse model setting, i.e. p ≫ n and the two nuisance componentsare specified as parametric models with sparse coefficients, we realize bias reduction with respectto regularization errors of the nuisance estimators through certain dantzig moment equations of X . Under the ultra-sparsity assumption of the nuisance models, our estimator preserves the samemodel double robustness property as the fixed ( p ≪ n ) dimensional nuisance models. Comparedwith the debiased (desparsified) LASSO estimator for logistic model (Van de Geer et al., 2014;Jankov´a and Van De Geer, 2016), we find our model sparsity assumption is more reasonable andexplainable while debiased LASSO is being criticized on requiring the inverse information matrixto be sparse, a generally unverifiable technical condition (Xia et al., 2020).Under the general machine learning framework, our approach allows for the use of any blackboxlearning algorithm for condition mean estimation as in Chernozhukov et al. (2018a). Unlike thepartially linear model considered in their paper, this generality is not readily achievable on logisticmodel due to its non-linear link function. We propose a easy-to-implement “full model refitting”procedure to handle this problem and make implementation of learning algorithms flexible in our2ramework. Similar to Chernozhukov et al. (2018a), we discuss the rate double robustness propertyof the proposed estimator assuming that the machine learning estimation of the two nuisance modelsapproaches the true models at certain geometric rates. Before introducing the specific methods in Section 3, we first present a (simplified) generalization ofthe doubly robust estimating equation (2) and derive its first and second order error decomposition,which plays a central role in motivating and guiding our method construction and theoreticalanalysis. Suppose the nuisance models r ( x ) and m ( x ) are estimated by b r ( x ) and b m ( x ) thatapproach some limiting functions ¯ r ( x ) and ¯ m ( x ). Motivated by (2), we consider1 n n X i =1 n Y i e − βA i − (1 − Y i ) e b r ( X i ) o (cid:8) A i − b m ( X i ) (cid:9) = 0 , (3)and denote its solution as b β . Compared with (2), we omit here a multiplicative factor e − b r ( X i ) b φ ( X i )that will only affect asymptotic variance of the estimator, to simplify the formation so that theattention would not be distracted from our main idea. And we shall comment on the incorporationof this nuisance function with our framework in Section 3.3.Concerning the error depending on b r ( · ) and b m ( · ), we decompose equation (3) as follows:1 n n X i =1 n Y i e − βA i − (1 − Y i ) e b r ( X i ) o (cid:8) A i − b m ( X i ) (cid:9) = 1 n n X i =1 h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) − n n X i =1 n Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) o (cid:8) b m ( X i ) − ¯ m ( X i ) (cid:9) − n n X i =1 (1 − Y i ) e ¯ r ( X i ) (cid:8)b r ( X i ) − ¯ r ( X i ) (cid:9) (cid:8) A i − ¯ m ( X i ) (cid:9) + O p (cid:16) k b r ( X ) − ¯ r ( X ) k , + k b m ( X ) − ¯ m ( X ) k , (cid:17) + o p (1 / √ n ) , (4)where we denote by h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) = { Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) }{ A i − ¯ m ( X i ) } , define k f ( X ) k P , = E f ( X ), and extract the second order terms (and beyond) as k b r ( X ) − ¯ r ( X ) k , + k b m ( X ) − ¯ m ( X ) k , under certain mild regularity conditions. When at least one nuisance model iscorrectly specified, i.e. ¯ r ( · ) = r ( · ) or ¯ m ( · ) = m ( · ) holds, we haveE(1 − Y ) { e ¯ r ( X ) − e r ( X ) }{ A − ¯ m ( X ) } = E h { e ¯ r ( X ) − e r ( X ) }{ A − ¯ m ( X ) } (cid:12)(cid:12)(cid:12) Y = 0 , X i = 0 , leading to E h ( Y, A, X ; ¯ r ( · ) , ¯ m ( · )) = E h ( Y, A, X ; r ( · ) , ¯ m ( · )) and β solves E h ( Y, A, X ; ¯ r ( · ) , ¯ m ( · )) =0. Similar to various existing work like Chernozhukov et al. (2018a, 2016, 2018b) and Tan (2020),the root mean squared errors (rMSEs) of high dimensional parametric and machine learning meth-ods, k b r ( X ) − ¯ r ( X ) k P , and k b m ( X ) − ¯ m ( X ) k P , , are assumed to be o p ( n − / ) and consequentlytheir impact is negligible asymptotically. Thus, it remains to remove the first order bias terms:∆ m = 1 n n X i =1 n Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) o (cid:8) b m ( X i ) − ¯ m ( X i ) (cid:9) ;∆ r = 1 n n X i =1 (1 − Y i ) e ¯ r ( X i ) (cid:8)b r ( X i ) − ¯ r ( X i ) (cid:9) (cid:8) A i − ¯ m ( X i ) (cid:9) , (5)3n the low dimensional parametric case, these first order terms do not impact the asymptoticnormality of n / ( b β − β ) as the nuisance estimators themselves are asymptotic normal at rate n − / . While for high dimensional and machine learning nuisance models, removal of them is nottrivial due to the excessive fitting error of the nuisance models. And the non-negligible bias incurredby this is known as over-fitting (or first order) bias (Chernozhukov et al., 2018a). In Section 3, weshall derive the constructing procedure for complex nuisance models to remove ∆ m and ∆ r properly. Consider the setting with p ≫ n , r ( x ) = x ⊺ γ and m ( x ) = g ( x ⊺ α ) where g ( · ) is a monotone linkfunction with derivative g ′ ( · ). We derive the constructing procedure for high dimensional sparsenuisance models preserving a similar (model) doubly robustness property as Tan (2019).First, we obtain e γ as some initial estimators for γ . Estimating procedure for e γ is quite flexibleas it only needs to satisfy that e γ converges to some sparse limiting parameter γ ∗ equaling to thetrue model parameter γ when the nuisance model r ( x ) = x ⊺ γ is correct. Motivated by Section 2,we propose to obtain b α by solving the dantzig moment equation:min α ∈ R p k α k s . t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 (1 − Y i ) e e γ ⊺ X i (cid:8) A i − g ( X ⊺ i α ) (cid:9) X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ α , (6)where λ α is a tuning parameter controlling the regularization bias. Then we solve the nuisance γ and the target parameter β jointly from:min β ∈ R , γ ∈ R p k γ k s . t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 n Y i e − βA i − (1 − Y i ) e X ⊺ i γ o g ′ ( X ⊺ i b α ) X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ γ ; n − n X i =1 n Y i e − βA i − (1 − Y i ) e X ⊺ i γ o (cid:8) A i − g ( X ⊺ i b α ) (cid:9) = 0 , (7)Denote the solution of (7) as b β and b γ . We demonstrate as follows that b β converges to β at theparametric rate when at least one nuisance model is correct and both of them are ultra-sparse.Similar to Tan (2020), the maximum-norm constraints (also known as Karus–Kuhn–Tuckercondition) in (6) and (7) impose certain moment conditions to the nuisance parameters under po-tential model misspecification. We shall outline how this assists calibrating the first order biasterms in (5). For simplification, we neglect some technical assumptions and analytical details thatcould be found in existing literature of high dimensional estimation and semiparametric inference .Let ¯ α and { ¯ γ , ¯ β } represent the limiting values of the solutions to (6) and (7) respectively, and s be the maximum sparsity level of ¯ α , ¯ γ and γ ∗ . Following literature in high dimension statis-tics (Candes et al., 2007; Bickel et al., 2009; B¨uhlmann and Van De Geer, 2011), we assume that X i is subgaussian with O (1) scale. Then λ α and λ γ are picked at the rate λ = (log p/n ) / See Candes et al. (2007), Bickel et al. (2009), B¨uhlmann and Van De Geer (2011) and Negahban et al. (2012) forgeneral theory of high dimensional regularized estimation. And see Bradic et al. (2019), Smucler et al. (2019) andTan (2020) for the theoretical framework of analyzing doubly robust estimator of the average treatment effect withhigh dimensional sparse nuisance models. ξ = k e γ − γ ∗ k + | b β − ¯ β | + k b γ − ¯ γ k + k b α − ¯ α k = O p ( sλ ); ξ = k X ⊺ ( e γ − γ ∗ ) k , + k A ( b β − ¯ β ) k , + k X ⊺ ( b γ − ¯ γ ) k , + k X ⊺ ( b α − ¯ α ) k , = O p ( sλ ) . (8) Remark 1.
Note that (6) involves the initial estimator e γ and (7) involves the estimator b α ob-tained beforehand, which requires some additional effort on removing their fitting errors in ana-lyzing b γ and b α , compared to the standard analysis procedures of dantzig selector. One could seeBradic et al. (2019); Smucler et al. (2019); Tan (2020) for a similar issue and to find relevanttechnical details being used for this purpose. Now consider the case when at least one nuisance model is correctly specified. Define that ϕ ( Y i , A i , X i ; β, γ , α ) = { Y i e − βA i − (1 − Y i ) e X ⊺ i γ } g ′ ( X ⊺ i α ) X i ; ψ ( Y i , A i , X i ; γ , α ) =(1 − Y i ) e γ ⊺ X i { A i − g ( X ⊺ i α ) } X i . When r ( x ) is correctly specified, i.e. r ( x ) = x ⊺ γ for some γ , we have γ ∗ = ¯ γ = γ . Soit is satisfied that E ϕ ( Y, A, X ; ¯ β, ¯ γ , ¯ α ) = by the correctness of r ( x ) and E ψ ( Y, A, X ; ¯ γ , ¯ α ) =E ψ ( Y, A, X ; γ ∗ , ¯ α ) = by the moment condition in (6). When m ( x ) is correct: m ( x ) = g ( x ⊺ α )and ¯ α = α , we have E ψ ( Y, A, X ; ¯ γ , ¯ α ) = and corresponding to the constraint in the first rowof (7), it holds that E ϕ ( Y, A, X ; ¯ β, ¯ γ , ¯ α ) = . In addition, ¯ β = β in both cases, according to themoment equation in the second row of (7) and the discussion in Section 2. Combining these withthe subgaussianity of the covariates, the two bias term in (5) can be controlled through∆ m ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 ϕ ( Y i , A i , X i ; ¯ β, ¯ γ , ¯ α ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ k b α − ¯ α k + O p ( ξ ) + o p (1 / √ n )= O p { (log p/n ) / } O p ( sλ ) + O p ( sλ ) + o p (1 / √ n ) = O p ( s log p/n ) + o p (1 / √ n );∆ r ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 ψ ( Y i , A i , X i ; γ ∗ , ¯ α ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ k b γ − ¯ γ k + O p ( ξ ) + o p (1 / √ n )= O p ( s log p/n ) + o p (1 / √ n ) , where the second order error terms can be extracted again into O p ( ξ ) + o p (1 / √ n ). Thus, when s = o ( n / / log p ), ∆ m and ∆ r are below the parametric rate and consequently, equation (4)is asymptotically equivalent with n − P ni =1 h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) = 0 and n / ( b β − β ) weaklyconverges to normal distribution with mean 0 under some mild regularity conditions.Though extensively studied and used in recent years, debiased LASSO (Zhang and Zhang, 2014;Javanmard and Montanari, 2014; Van de Geer et al., 2014) has been criticized on that under thegeneralized linear model setting, its sparsity condition imposed on inverse of the information matrix,is not explainable and justifiable, leading to a subpar performance in practice (Xia et al., 2020).Interestingly, we find the model and sparsity assumption of our method is more reasonable thandebiased LASSO and present a simple comparison of these two approaches in Remark 2. Remark 2.
Assume the logistic model P( Y = 1 | A, X ) = expit { β A + X ⊺ γ } is correctly specified.Let its expected information matrix be Σ β , γ = E[expit ′ { β A + X ⊺ γ } ( A, X ⊺ ) ⊺ ( A, X ⊺ )] , Θ β , γ = Σ − β , γ and θ β be the first column of Θ β , γ . In Van de Geer et al. (2014); Jankov´a and Van De Geer (2016), symptotic normality of the debiased logistic LASSO estimator for β is built under the sparsityassumption: k θ β k = o ( { n/ (log p ) } / ) . Using cross-fitting to estimate Σ β , γ , recent work likeMa et al. (2020) and Liu et al. (2020) has relaxed this condition to k θ β k = o ( n / / log p ) , or re-placed it with approximate sparsity assumptions such as k θ β k = O (1) . However, in the presenseof the weight expit ′ { β A + X ⊺ γ } in Σ β , γ , neither of these assumptions are explainable nor theyare justifiable for the most common gaussian design (Xia et al., 2020).In comparison, we require that E( A | Y = 0 , X = x ) = g ( X ⊺ i α ) with k α k = o ( n / / log p ) .This assumption has two advantages. First, it accommodates nonlinear link function g ( · ) , whichcould make the assumption more reasonable for a categorical A . Second, our assumption is imposedon a conditional model directly so it is more explainable than debiased LASSO. As a simple example,consider a practically useful conditional gaussian model: ( A, X ⊺ ) ⊺ | { Y = j } ∼ N ( µ j , Σ ) for j = 0 , . Then we have r ( x ) = x ⊺ γ where γ = Σ − ( µ − µ ) and A | X , Y = 0 followsa gaussian linear model: m ( x ) = x ⊺ α with α determined by Σ − . Thus, our model sparsityassumptions on the nuisance coefficients α and γ are actually imposed on the data generationparameters Σ − and µ − µ , which are more explainable and verifiable in practice. In this section, we turn to a more general machine learning setting that any learning algorithms ofconditional mean could be potentially applied to estimate r ( · ) and m ( · ). Similar to Chernozhukov et al. (2018a),we randomly split the n samples into K = O (1) folds: I , I , . . . , I K of equal size, to help removethe over-fitting bias. Denote by I - k = { , . . . , n } \ I k and we replace estimating equation (3) by thecross-fitted 1 n K X k =1 X i ∈I k n Y i e − βA i − (1 − Y i ) e b r [- k ] ( X i ) o n A i − b m [- k ] ( X i ) o = 0 , (9)where b r [- k ] ( · ) and m [- k ] ( · ) are two machine learning estimators obtained with the samples in I - k and converging to the true models r ( · ) and m ( · ). We outline as follows a strategy utilizing anarbitrary (supervised) learning algorithm to estimate the nuisance models.Suppose there is a blackbox procedure L ( R i , C i ; I ) that inputs samples from I ⊆ { , , . . . , n } with some response R i and covariates C i and outputs an estimation of E[ R i | C i , i ∈ I ]. Estimatorof m ( · ) can be obtained by b m [- k ] ( · ) = L ( A i , X i ; I - k ∩{ i : Y i = 0 } ). Unlike the partial linear settingin Chernozhukov et al. (2018a), estimating r ( · ) using L is more sophisticated since it is definedthrough an unextractable semiparametric form P( Y = 1 | A, X ) = expit { β A + r ( X ) } . Certainly,one could modify some existing machine learning approaches like neural network to accommodatethis form. However, such modification is not readily available in general and even there was someway to adapt L to this semiparametric form, it would typically require additional human effortson its implementing and validating. Thus, we introduce a “full model refitting” procedure basingon an arbitrary algorithm L to estimate r ( · ). Our method is motivated by a simple proposition: Proposition 1.
Let the full model M ( A, X ) = P( Y = 1 | A, X ) = expit { β A + r ( X ) } . One have: β = argmin β ∈ R E (cid:2) logit { M ( A, X ) } − β ( A − E[ A | X ]) (cid:3) . Without purposed modification, the natural form of most contemporary supervised learning methods, e.g. randomforest, support vector machine and neural network, can be conceptualized in this way since their goal is predictionfor a continuous response and classification for a categorical response. By setting the last layer of the neural network to be the combination of a complex network of X and a linearfunction of A and linking it with the outcome through an expit link. roof. For any β ∈ R , we haveE (cid:2) logit { M ( A, X ) } − β ( A − E[ A | X ]) (cid:3) = E (cid:8) β A + r ( X ) − β ( A − E[ A | X ]) (cid:9) =E (cid:8) ( β − β )( A − E[ A | X ]) + η ( X ) (cid:9) = ( β − β ) E( A − E[ A | X ]) + E { η ( X ) } , where η ( X ) = r ( X ) + E[ A | X ]. Thus, β minimizes E (cid:2) logit { M ( A, X ) } − β ( A − E[ A | X ]) (cid:3) .Randomly each split I - k into K folds I - k, , . . . I - k,K , of equal size. Motivated by Proposition 1,we first estimate the “full” model M ( A, X ) leaving out each I - k, - j in I - k : c M [- k, - j ] ( · ) = L ( Y i , ( A i , X ⊺ i ) ⊺ ; I - k, - j ) , and learn a ( X ) = E[ A | X ] as b a [- k, - j ] ( · ) = L ( A i , X i ; I - k, - j ). Then fit the (cross-fitted) least squareregression to obtain:˘ β [- k ] = argmin β ∈ R |I - k | K X j =1 X i ∈I - k,j (cid:20) logit n c M [- k, - j ] ( A i , X i ) o − β n A i − b a [- k, - j ] ( X i ) o(cid:21) . (10)We use cross-fitting in (10) to avoid the over-fitting of c M [- k, - j ] ( · ) and a [- k, - j ] ( · ). Estimator of r ( X i )could then be given through r ( X i ) = logit { M ( A i , X i ) } − β A i . Note that the empirical versionof logit { M ( A i , X i ) } − β A i typically involves A i due to the discrepancy of the true β and M ( · )from their empirical estimation, which can impede the removal of over-fitting bias terms in (5) since A − m ( X ) is not orthogonal to the error dependent on A . So we instead use the conditional mean oflogit { M ( A, X ) }− β A on X to estimate r ( · ). Let W i = logit { c M [- k, - j ] ( A i , X i ) } for each i ∈ I - k,j andobtain b t [- k ] ( · ) = L ( W i , X i ; I - k ), as a “refitting” estimation of t ( x ) := E[logit { M ( A, X ) }| X = x ].Then the estimator of r ( · ) is given by: b r [- k ] ( x ) = b t [- k ] ( x ) − ˘ β [- k ] b a [- k ] ( x ) , where b a [- k ] ( x ) = 1 K K X j =1 b a [- k, - j ] ( x ) . Alternatively, one can also refit r ( · ) through b r [- k ] ( · ) = log L ( e − ˘ β [- k ] A i , X i ; I - k ∩ { i : Y i = 1 } ) L (1 − Y i , X i ; I - k ) ! , inspired by the moment condition sufficient to identify r ( x ):E h Y e − β A − (1 − Y ) e r ( X ) (cid:12)(cid:12)(cid:12) X i = E h e − β A (cid:12)(cid:12)(cid:12) X , Y = 1 i − e r ( X ) E (cid:2) (1 − Y ) (cid:12)(cid:12) X (cid:3) = 0 . At last, we outline the theoretical investigation on the estimator b β finally solved from (9). Similarto Chernozhukov et al. (2018a), the analysis relies on certain mild regularity conditions and theassumptions that (i) L outputs uniformly consistent estimators for the conditional mean modelsin all learning objects it is implemented on; (ii) rMSEs of these estimators output by L arecontrolled by o p ( n − / ). Remark 3.
The above described assumptions (i) and (ii) imply that L should perform similarlywell on the learning objects with the covariates set as X or ( A, X ⊺ ) ⊺ . Classic nonparametricregression approaches like kernel smoothing or sieve may not satisfy this because including one ore covariate A in the model could have substantial impact on their rate of convergence. Thus,we recommend using modern learning approaches that are more dimensionality-robust, such asrandom forest and neural network, in our framework. While the classic sieve or kernel constructionfor one-dimensional X in a type of “plug-in” model has been well-studied in existing work likeSeverini and Staniswalis (1994); Lin and Carroll (2006). Based on assumption (ii), we have that k b m [- k ] ( X ) − m ( X ) k P , = o p ( n − / ),˘ β [- k ] = P i ∈I - k logit { M ( A i , X i ) }{ A i − a ( X i ) } P i ∈I - k { A i − a ( X i ) } + o p ( n − / ) = β + O p ( n − / ) + o p ( n − / ) , and consequently k b r [- k ] ( X ) − r ( X ) k P , = o p ( n − / ). Then the second order error presented in(the cross-fitted version of) (4) is o p ( n − / ). Also, we remove the first order (over-fitting) biasdefined by (5) through assumption (i) and concentration, facilitated by the use of cross-fitting in(9), in the same spirit as Chernozhukov et al. (2018a). Combining these two results leads to that(9) is asymptotically equivalent with n − P ni =1 h ( Y i , A i , X i ; r ( · ) , m ( · )) = 0 and thus n / ( b β − β )is asymptotically normal with mean 0 under mild regularity conditions. We turn back to the construction of Tan (2019):1 n n X i =1 b φ ( X i ) e − b r ( X i ) n Y i e − βA i − (1 − Y i ) e b r ( X i ) o (cid:8) A i − b m ( X i ) (cid:9) = 0 , (11)where b φ ( X i ) is an empirical estimation of a (typically positive) nuisance function φ ( · ) that dependson the nuisance models ¯ r ( · ) and ¯ m ( · ) and affects the asymptotic variance of b β . Tan (2019) proposedand studied two options for φ ( · ) including: φ opt ( X ) = E[ { A − ¯ m ( X ) } | X , Y = 0]E[ { A − ¯ m ( X ) } / expit { β A + ¯ r ( X ) }| X , Y = 0] ; φ simp ( X ) = expit { ¯ r ( X ) } . It was shown that when both nuisance models are correctly specified, the estimator solved with theweight φ opt ( X ) achieves the minimum asymptotic variance. Since calculation of φ opt ( X ) involvesnumerical integration with respect to X given Y = 0, it is sometimes inconvenient to implement.So Tan (2019) also provides another simplified choice: φ simp ( X ), obtained by evaluating φ opt ( X )at β = 0.Inclusion of the nuisance estimator b φ ( X i ) e − b r ( X i ) incurs two challenges. First, it introducesadditional bias terms. Second, formation of the first order bias ∆ m and ∆ r in (5) alters. Corre-spondingly, we make some moderate modifications on methods described in Sections 3.1 and 3.2.We adopt again a cross-fitting strategy (for both the high dimensional parametric and machinelearning settings) to obtain b φ [- k ] ( X i ) e − b r [- k ] ( X i ) and plug-in it at i ∈ I k , for k = 1 , , . . . , K . Notethat a function depending solely on X i is orthogonal to h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) = { Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) }{ A i − ¯ m ( X i ) } when ¯ r ( · ) = r ( · ) or ¯ m ( · ) = m ( · ). Then as long as b φ [- k ] ( X i ) e − b r [- k ] ( X i ) isconsistent and at least one nuisance model is correct, we can remove the (cross-fitted) bias term:1 n K X k =1 X i ∈I k n b φ [- k ] ( X i ) e − b r [- k ] ( X i ) − φ i ( X i ) e ¯ r ( X i ) o h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · ))8hrough concentration. Meanwhile, we note that the first order bias terms ∆ m and ∆ r defined in (5)are weighted by b φ [- k ] ( X i ) e − b r [- k ] ( X i ) under this efficiency enhancing construction. So we naturallyweight the moment equations (6) and (7) with b φ [- k ] ( X i ) e − b r [- k ] ( X i ) , as a modification of the highdimensional sparse modelling strategy. While in the machine learning scenario, both the nuisanceestimators are supposed to approach the corresponding true models, i.e. ¯ r ( · ) = r and ¯ m ( · ) = m so there is no need to modify the way to obtain b m [- k ] ( · ) and b r [- k ] ( · ) in Section 3.2. In this note, we extend the low dimensional parametric doubly robust approach for logistic par-tially linear model of Tan (2019) to the settings where the nuisance models are estimated bythe high dimensional sparse regression or general machine learning methods. For the high di-mensional setting, we derive certain moment equations for the nuisance models to remove thefirst order bias. Also, we find the sparsity assumption of our approach is more explainable andreasonable than the “sparse inverse information matrix” assumption used by debiased LASSO(Van de Geer et al., 2014; Jankov´a and Van De Geer, 2016). For the general machine learningframework, we handle the non-linearity and “unextractablility” issue of the logistic partial modelusing a “full model refitting” procedure. This procedure is easy to implement and facilitates theuse of arbitrary learning algorithms for the nuisance models in our framework. Meanwhile, it couldbe potentially extended to handle other similar structure issues like that of the partially linear M -estimator.We also outline the key theoretical analysis procedures of our approaches and demonstrate themodel double robustness of the high dimensional construction under ultra-sparsity assumptionsand the rate double robustness of the machine learning setting. For the high dimensional setting,we note that our ultra-sparsity assumption, i.e. s = o ( n / / log p ) on both nuisance models may bemoderately relaxed through cross-fitting, inspired by Smucler et al. (2019). Acknowledgements
The author thanks his advisor, Tianxi Cai, and collaborator, Yi Zhang, for helpful discussion andcomments on this note.
References
Athey, S., Imbens, G. W., and Wager, S. (2016). Approximate residual balancing: Debiasedinference of average treatment effects in high dimensions. arXiv preprint arXiv:1604.07125 .Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. (2009). Simultaneous analysis of lasso and dantzigselector.
The Annals of statistics , 37(4):1705–1732.Bradic, J., Wager, S., and Zhu, Y. (2019). Sparsity double robust inference of average treatmenteffects. arXiv preprint arXiv:1905.00744 .B¨uhlmann, P. and Van De Geer, S. (2011).
Statistics for high-dimensional data: methods, theoryand applications . Springer Science & Business Media.Candes, E., Tao, T., et al. (2007). The dantzig selector: Statistical estimation when p is muchlarger than n . The annals of Statistics , 35(6):2313–2351.9hen, H. Y. (2007). A semiparametric odds ratio model for measuring association.
Biometrics ,63(2):413–421.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins,J. (2018a). Double/debiased machine learning for treatment and structural parameters.Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. (2016).Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033 .Chernozhukov, V., Newey, W. K., and Robins, J. (2018b). Double/debiased machine learning usingregularized riesz representers. Technical report, cemmap working paper.Jankov´a, J. and Van De Geer, S. (2016). Confidence regions for high-dimensional generalized linearmodels under sparsity. arXiv preprint arXiv:1610.01353 .Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression.
The Journal of Machine Learning Research , 15(1):2869–2909.Lin, X. and Carroll, R. J. (2006). Semiparametric estimation in general repeated measures problems.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):69–88.Liu, M., Xia, Y., Cho, K., and Cai, T. (2020). Integrative high dimensional multiple testing withheterogeneity under data sharing constraints. arXiv preprint arXiv:2004.00816 .Ma, R., Tony Cai, T., and Li, H. (2020). Global and simultaneous hypothesis testing for high-dimensional logistic regression models.
Journal of the American Statistical Association , pages1–15.Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A unified framework forhigh-dimensional analysis of M-estimators with decomposable regularizers.
Statistical Science ,27(4):538–557.Severini, T. A. and Staniswalis, J. G. (1994). Quasi-likelihood estimation in semiparametric models.
Journal of the American statistical Association , 89(426):501–511.Smucler, E., Rotnitzky, A., and Robins, J. M. (2019). A unifying approach for doubly-robust ℓ -regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737 .Tan, Z. (2019). On doubly robust estimation for logistic partially linear models. Statistics & Probability Letters , 155:108577.Tan, Z. (2020). Model-assisted inference for treatment effects using regularized calibrated estimationwith high-dimensional data.
Annals of Statistics , 48(2):811–837.Tchetgen Tchetgen, E. J., Robins, J. M., and Rotnitzky, A. (2010). On doubly robust estimationin a semiparametric odds ratio model.
Biometrika , 97(1):171–180.Van de Geer, S., B¨uhlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimalconfidence regions and tests for high-dimensional models.
The Annals of Statistics , 42(3):1166–1202.Xia, L., Nan, B., and Li, Y. (2020). A revisit to debiased lasso for generalized linear models. arXivpreprint arXiv:2006.12778 . 10hang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters inhigh dimensional linear models.