[PDF] A Note on Debiased/Double Machine Learning Logistic Partially Linear Model

Abstract

It is of particular interests in many application fields to draw doubly robust inference of a logistic partially linear model with the predictor specified as combination of a targeted low dimensional linear parametric function and a nuisance nonparametric function. In recent, Tan (2019) proposed a simple and flexible doubly robust estimator for this purpose. They introduced the two nuisance models, i.e. nonparametric component in the logistic model and conditional mean of the exposure covariates given the other covariates and fixed response, and specified them as fixed dimensional parametric models. Their framework could be potentially extended to machine learning or high dimensional nuisance modelling exploited recently, e.g. in Chernozhukovet al. (2018a,b) and Smucler et al. (2019); Tan (2020). Motivated by this, we derive the debiased/double machine learning logistic partially linear model in this note. For construction of the nuisance models, we separately consider the use of high dimensional sparse parametric models and general machine learning methods. By deriving certain moment equations to calibrate the first order bias of the nuisance models, we preserve a model double robustness property on high dimensional ultra-sparse nuisance models. We also discuss and compare the underlying assumption of our method with debiased LASSO (Van deGeer et al., 2014). To implement the machine learning proposal, we design a full model refitting procedure that allows the use of any blackbox conditional mean estimation method in our framework. Under the machine learning setting, our method is rate doubly robust in a similar sense as Chernozhukov et al. (2018a).

Full PDF

aa r X i v : . [ s t a t . M E ] A ug A Note on Debiased/Double Machine LearningLogistic Partially Linear Model

Molei Liu Abstract

It is of particular interests in many application ﬁelds to draw doubly robust inference of alogistic partially linear model with the predictor speciﬁed as combination of a targeted lowdimensional linear parametric function and a nuisance nonparametric function. In recent,Tan (2019) proposed a simple and ﬂexible doubly robust estimator for this purpose. Theyintroduced the two nuisance models, i.e. nonparametric component in the logistic model andconditional mean of the exposure covariates given the other covariates and ﬁxed response, andspeciﬁed them as ﬁxed dimensional parametric models. Their framework could be potentiallyextended to machine learning or high dimensional nuisance modelling exploited recently, e.g. inChernozhukov et al. (2018a,b) and Smucler et al. (2019); Tan (2020).Motivated by this, we derive the debiased/double machine learning logistic partially linearmodel in this note. For construction of the nuisance models, we separately consider the useof high dimensional sparse parametric models and general machine learning methods. By de-riving certain moment equations to calibrate the ﬁrst order bias of the nuisance models, wepreserve a model double robustness property on high dimensional ultra-sparse nuisance models.We also discuss and compare the underlying assumption of our method with debiased LASSO(Van de Geer et al., 2014). To implement the machine learning proposal, we design a “full modelreﬁtting” procedure that allows the use of any blackbox conditional mean estimation methodin our framework. Under the machine learning setting, our method is rate doubly robust in asimilar sense as Chernozhukov et al. (2018a).

Keywords : Logistic partially linear model, Double machine learning, Debiased inference.

Consider a logistic partially linear model. Let { ( Y i , A i , X i ) : i = 1 , , . . . , n } be independent andidentically distributed samples of Y ∈ { , } , A ∈ R and X ∈ R p . Assume thatP( Y = 1 | A, X ) = expit { β A + r ( X ) } , (1)where expit( · ) = logit − ( · ), logit( a ) = log { a/ (1 − a ) } and r ( · ) is an unknown nuisance function of X . In a casual scenario with A taken as the exposure (treatment) of interests, Y being the observedoutcome and X representing all the confounding variables, the parameter β is of particular interestsin that it measures the casual eﬀect of A on the potential outcome, in a scale of logarithmic oddsratio. And as the most common and natural way to characterize the casual model for a binaryoutcome, model (1) is considered in extensive application ﬁelds like medical science, economy andpolitical science. While for observational studies with unobserved confounding variables such aselectronic health record (EHR) studies, model (1) also plays an importatn role in studying the Department of Biostatistics, Harvard Chan School of Public Health. Y and a key feature A , e.g. the diagnosis codefor Y , conditional on patient proﬁles X .Our goal is to estimate and infer β asymptotic normally at the rate n − / . It has beendshown that if X is a scalar and r ( · ) is smooth, the semiparametric kernel or sieve regression(Severini and Staniswalis, 1994; Lin and Carroll, 2006) works well for this purpose. However, when X is of relatively high dimensionality, these classic approaches have poor performance due to curseof dimensionality and one would specify r ( x ) in a parametric form: r ( x ) = x ⊺ γ . To enhance therobustness to the potential misspeciﬁcation of r ( x ), Tan (2019) proposed a doubly robust estimatingequation for β including a parametric model m ( x ) = g ( x ⊺ α ) with a known link function g ( · ) forthe conditional mean m ( x ) = E( A | Y = 0 , X = x ):1 n n X i =1 b φ ( X i ) n Y i e − βA i − X ⊺ i b γ − (1 − Y i ) o (cid:8) A i − g ( X ⊺ i b α ) (cid:9) = 0 , (2)where b φ ( x ) is an estimation of some scalar nuisance function φ ( x ) aﬀecting the asymptotic eﬃciencyof the estimator, and b α and b γ are two ﬁxed dimensional nuisance model estimators. As demon-strated in Tan (2019), b β solved from (2) is doubly robust in the sense that it is valid when either r ( x ) = x ⊺ γ is correctly speciﬁed for the nonparametric component in the logistic partial model,or m ( x ) = g ( x ⊺ α ) is correct for the conditional mean model m ( x ) = E( A | Y = 0 , X = x ). Itshows a novel doubly robustness property since prior to this, the doubly robust semiparametricestimation of odds ratio was built upon p ( A | X , Y = 0), the conditional density of A given X and Y = 0 (Chen, 2007; Tchetgen Tchetgen et al., 2010, e.g.), requiring a stronger model assumptionthan (2) for continuous A .Nevertheless, Tan (2019) focuses on ﬁxed dimensional parametric nuisance models that are stillprone to misspeciﬁcation in practice. And their proposed framework is not readily applicable to thehigh dimensional (Athey et al., 2016; Chernozhukov et al., 2018b; Smucler et al., 2019; Tan, 2020)or general machine learning (Chernozhukov et al., 2018a) nuisance models frequently exploited inrecent years. This is because for such nuisance models with higher complexity, simply using themto replace the ﬁxed dimensional parametric models in (2) incurs excessive ﬁtting bias and doesnot guarantee the desirable property of b β . In addition, estimating r ( x ) with arbitrary machinelearning algorithms (of conditional mean) is not ﬂexible because it is linked with the responsethrough a nonlinear logit function. In this note, we handle these challenges and ﬁll the gap byderiving the extensions of (2) to accommodate high dimensional sparse nuisance models or generalmachine learning nuisance models separately.For the high dimensional sparse model setting, i.e. p ≫ n and the two nuisance componentsare speciﬁed as parametric models with sparse coeﬃcients, we realize bias reduction with respectto regularization errors of the nuisance estimators through certain dantzig moment equations of X . Under the ultra-sparsity assumption of the nuisance models, our estimator preserves the samemodel double robustness property as the ﬁxed ( p ≪ n ) dimensional nuisance models. Comparedwith the debiased (desparsiﬁed) LASSO estimator for logistic model (Van de Geer et al., 2014;Jankov´a and Van De Geer, 2016), we ﬁnd our model sparsity assumption is more reasonable andexplainable while debiased LASSO is being criticized on requiring the inverse information matrixto be sparse, a generally unveriﬁable technical condition (Xia et al., 2020).Under the general machine learning framework, our approach allows for the use of any blackboxlearning algorithm for condition mean estimation as in Chernozhukov et al. (2018a). Unlike thepartially linear model considered in their paper, this generality is not readily achievable on logisticmodel due to its non-linear link function. We propose a easy-to-implement “full model reﬁtting”procedure to handle this problem and make implementation of learning algorithms ﬂexible in our2ramework. Similar to Chernozhukov et al. (2018a), we discuss the rate double robustness propertyof the proposed estimator assuming that the machine learning estimation of the two nuisance modelsapproaches the true models at certain geometric rates. Before introducing the speciﬁc methods in Section 3, we ﬁrst present a (simpliﬁed) generalization ofthe doubly robust estimating equation (2) and derive its ﬁrst and second order error decomposition,which plays a central role in motivating and guiding our method construction and theoreticalanalysis. Suppose the nuisance models r ( x ) and m ( x ) are estimated by b r ( x ) and b m ( x ) thatapproach some limiting functions ¯ r ( x ) and ¯ m ( x ). Motivated by (2), we consider1 n n X i =1 n Y i e − βA i − (1 − Y i ) e b r ( X i ) o (cid:8) A i − b m ( X i ) (cid:9) = 0 , (3)and denote its solution as b β . Compared with (2), we omit here a multiplicative factor e − b r ( X i ) b φ ( X i )that will only aﬀect asymptotic variance of the estimator, to simplify the formation so that theattention would not be distracted from our main idea. And we shall comment on the incorporationof this nuisance function with our framework in Section 3.3.Concerning the error depending on b r ( · ) and b m ( · ), we decompose equation (3) as follows:1 n n X i =1 n Y i e − βA i − (1 − Y i ) e b r ( X i ) o (cid:8) A i − b m ( X i ) (cid:9) = 1 n n X i =1 h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) − n n X i =1 n Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) o (cid:8) b m ( X i ) − ¯ m ( X i ) (cid:9) − n n X i =1 (1 − Y i ) e ¯ r ( X i ) (cid:8)b r ( X i ) − ¯ r ( X i ) (cid:9) (cid:8) A i − ¯ m ( X i ) (cid:9) + O p (cid:16) k b r ( X ) − ¯ r ( X ) k , + k b m ( X ) − ¯ m ( X ) k , (cid:17) + o p (1 / √ n ) , (4)where we denote by h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) = { Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) }{ A i − ¯ m ( X i ) } , deﬁne k f ( X ) k P , = E f ( X ), and extract the second order terms (and beyond) as k b r ( X ) − ¯ r ( X ) k , + k b m ( X ) − ¯ m ( X ) k , under certain mild regularity conditions. When at least one nuisance model iscorrectly speciﬁed, i.e. ¯ r ( · ) = r ( · ) or ¯ m ( · ) = m ( · ) holds, we haveE(1 − Y ) { e ¯ r ( X ) − e r ( X ) }{ A − ¯ m ( X ) } = E h { e ¯ r ( X ) − e r ( X ) }{ A − ¯ m ( X ) } (cid:12)(cid:12)(cid:12) Y = 0 , X i = 0 , leading to E h ( Y, A, X ; ¯ r ( · ) , ¯ m ( · )) = E h ( Y, A, X ; r ( · ) , ¯ m ( · )) and β solves E h ( Y, A, X ; ¯ r ( · ) , ¯ m ( · )) =0. Similar to various existing work like Chernozhukov et al. (2018a, 2016, 2018b) and Tan (2020),the root mean squared errors (rMSEs) of high dimensional parametric and machine learning meth-ods, k b r ( X ) − ¯ r ( X ) k P , and k b m ( X ) − ¯ m ( X ) k P , , are assumed to be o p ( n − / ) and consequentlytheir impact is negligible asymptotically. Thus, it remains to remove the ﬁrst order bias terms:∆ m = 1 n n X i =1 n Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) o (cid:8) b m ( X i ) − ¯ m ( X i ) (cid:9) ;∆ r = 1 n n X i =1 (1 − Y i ) e ¯ r ( X i ) (cid:8)b r ( X i ) − ¯ r ( X i ) (cid:9) (cid:8) A i − ¯ m ( X i ) (cid:9) , (5)3n the low dimensional parametric case, these ﬁrst order terms do not impact the asymptoticnormality of n / ( b β − β ) as the nuisance estimators themselves are asymptotic normal at rate n − / . While for high dimensional and machine learning nuisance models, removal of them is nottrivial due to the excessive ﬁtting error of the nuisance models. And the non-negligible bias incurredby this is known as over-ﬁtting (or ﬁrst order) bias (Chernozhukov et al., 2018a). In Section 3, weshall derive the constructing procedure for complex nuisance models to remove ∆ m and ∆ r properly. Consider the setting with p ≫ n , r ( x ) = x ⊺ γ and m ( x ) = g ( x ⊺ α ) where g ( · ) is a monotone linkfunction with derivative g ′ ( · ). We derive the constructing procedure for high dimensional sparsenuisance models preserving a similar (model) doubly robustness property as Tan (2019).First, we obtain e γ as some initial estimators for γ . Estimating procedure for e γ is quite ﬂexibleas it only needs to satisfy that e γ converges to some sparse limiting parameter γ ∗ equaling to thetrue model parameter γ when the nuisance model r ( x ) = x ⊺ γ is correct. Motivated by Section 2,we propose to obtain b α by solving the dantzig moment equation:min α ∈ R p k α k s . t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 (1 − Y i ) e e γ ⊺ X i (cid:8) A i − g ( X ⊺ i α ) (cid:9) X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ α , (6)where λ α is a tuning parameter controlling the regularization bias. Then we solve the nuisance γ and the target parameter β jointly from:min β ∈ R , γ ∈ R p k γ k s . t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 n Y i e − βA i − (1 − Y i ) e X ⊺ i γ o g ′ ( X ⊺ i b α ) X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ γ ; n − n X i =1 n Y i e − βA i − (1 − Y i ) e X ⊺ i γ o (cid:8) A i − g ( X ⊺ i b α ) (cid:9) = 0 , (7)Denote the solution of (7) as b β and b γ . We demonstrate as follows that b β converges to β at theparametric rate when at least one nuisance model is correct and both of them are ultra-sparse.Similar to Tan (2020), the maximum-norm constraints (also known as Karus–Kuhn–Tuckercondition) in (6) and (7) impose certain moment conditions to the nuisance parameters under po-tential model misspeciﬁcation. We shall outline how this assists calibrating the ﬁrst order biasterms in (5). For simpliﬁcation, we neglect some technical assumptions and analytical details thatcould be found in existing literature of high dimensional estimation and semiparametric inference .Let ¯ α and { ¯ γ , ¯ β } represent the limiting values of the solutions to (6) and (7) respectively, and s be the maximum sparsity level of ¯ α , ¯ γ and γ ∗ . Following literature in high dimension statis-tics (Candes et al., 2007; Bickel et al., 2009; B¨uhlmann and Van De Geer, 2011), we assume that X i is subgaussian with O (1) scale. Then λ α and λ γ are picked at the rate λ = (log p/n ) / See Candes et al. (2007), Bickel et al. (2009), B¨uhlmann and Van De Geer (2011) and Negahban et al. (2012) forgeneral theory of high dimensional regularized estimation. And see Bradic et al. (2019), Smucler et al. (2019) andTan (2020) for the theoretical framework of analyzing doubly robust estimator of the average treatment eﬀect withhigh dimensional sparse nuisance models. ξ = k e γ − γ ∗ k + | b β − ¯ β | + k b γ − ¯ γ k + k b α − ¯ α k = O p ( sλ ); ξ = k X ⊺ ( e γ − γ ∗ ) k , + k A ( b β − ¯ β ) k , + k X ⊺ ( b γ − ¯ γ ) k , + k X ⊺ ( b α − ¯ α ) k , = O p ( sλ ) . (8) Remark 1.

Note that (6) involves the initial estimator e γ and (7) involves the estimator b α ob-tained beforehand, which requires some additional eﬀort on removing their ﬁtting errors in ana-lyzing b γ and b α , compared to the standard analysis procedures of dantzig selector. One could seeBradic et al. (2019); Smucler et al. (2019); Tan (2020) for a similar issue and to ﬁnd relevanttechnical details being used for this purpose. Now consider the case when at least one nuisance model is correctly speciﬁed. Deﬁne that ϕ ( Y i , A i , X i ; β, γ , α ) = { Y i e − βA i − (1 − Y i ) e X ⊺ i γ } g ′ ( X ⊺ i α ) X i ; ψ ( Y i , A i , X i ; γ , α ) =(1 − Y i ) e γ ⊺ X i { A i − g ( X ⊺ i α ) } X i . When r ( x ) is correctly speciﬁed, i.e. r ( x ) = x ⊺ γ for some γ , we have γ ∗ = ¯ γ = γ . Soit is satisﬁed that E ϕ ( Y, A, X ; ¯ β, ¯ γ , ¯ α ) = by the correctness of r ( x ) and E ψ ( Y, A, X ; ¯ γ , ¯ α ) =E ψ ( Y, A, X ; γ ∗ , ¯ α ) = by the moment condition in (6). When m ( x ) is correct: m ( x ) = g ( x ⊺ α )and ¯ α = α , we have E ψ ( Y, A, X ; ¯ γ , ¯ α ) = and corresponding to the constraint in the ﬁrst rowof (7), it holds that E ϕ ( Y, A, X ; ¯ β, ¯ γ , ¯ α ) = . In addition, ¯ β = β in both cases, according to themoment equation in the second row of (7) and the discussion in Section 2. Combining these withthe subgaussianity of the covariates, the two bias term in (5) can be controlled through∆ m ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 ϕ ( Y i , A i , X i ; ¯ β, ¯ γ , ¯ α ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ k b α − ¯ α k + O p ( ξ ) + o p (1 / √ n )= O p { (log p/n ) / } O p ( sλ ) + O p ( sλ ) + o p (1 / √ n ) = O p ( s log p/n ) + o p (1 / √ n );∆ r ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 ψ ( Y i , A i , X i ; γ ∗ , ¯ α ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ k b γ − ¯ γ k + O p ( ξ ) + o p (1 / √ n )= O p ( s log p/n ) + o p (1 / √ n ) , where the second order error terms can be extracted again into O p ( ξ ) + o p (1 / √ n ). Thus, when s = o ( n / / log p ), ∆ m and ∆ r are below the parametric rate and consequently, equation (4)is asymptotically equivalent with n − P ni =1 h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) = 0 and n / ( b β − β ) weaklyconverges to normal distribution with mean 0 under some mild regularity conditions.Though extensively studied and used in recent years, debiased LASSO (Zhang and Zhang, 2014;Javanmard and Montanari, 2014; Van de Geer et al., 2014) has been criticized on that under thegeneralized linear model setting, its sparsity condition imposed on inverse of the information matrix,is not explainable and justiﬁable, leading to a subpar performance in practice (Xia et al., 2020).Interestingly, we ﬁnd the model and sparsity assumption of our method is more reasonable thandebiased LASSO and present a simple comparison of these two approaches in Remark 2. Remark 2.

Assume the logistic model P( Y = 1 | A, X ) = expit { β A + X ⊺ γ } is correctly speciﬁed.Let its expected information matrix be Σ β , γ = E[expit ′ { β A + X ⊺ γ } ( A, X ⊺ ) ⊺ ( A, X ⊺ )] , Θ β , γ = Σ − β , γ and θ β be the ﬁrst column of Θ β , γ . In Van de Geer et al. (2014); Jankov´a and Van De Geer (2016), symptotic normality of the debiased logistic LASSO estimator for β is built under the sparsityassumption: k θ β k = o ( { n/ (log p ) } / ) . Using cross-ﬁtting to estimate Σ β , γ , recent work likeMa et al. (2020) and Liu et al. (2020) has relaxed this condition to k θ β k = o ( n / / log p ) , or re-placed it with approximate sparsity assumptions such as k θ β k = O (1) . However, in the presenseof the weight expit ′ { β A + X ⊺ γ } in Σ β , γ , neither of these assumptions are explainable nor theyare justiﬁable for the most common gaussian design (Xia et al., 2020).In comparison, we require that E( A | Y = 0 , X = x ) = g ( X ⊺ i α ) with k α k = o ( n / / log p ) .This assumption has two advantages. First, it accommodates nonlinear link function g ( · ) , whichcould make the assumption more reasonable for a categorical A . Second, our assumption is imposedon a conditional model directly so it is more explainable than debiased LASSO. As a simple example,consider a practically useful conditional gaussian model: ( A, X ⊺ ) ⊺ | { Y = j } ∼ N ( µ j , Σ ) for j = 0 , . Then we have r ( x ) = x ⊺ γ where γ = Σ − ( µ − µ ) and A | X , Y = 0 followsa gaussian linear model: m ( x ) = x ⊺ α with α determined by Σ − . Thus, our model sparsityassumptions on the nuisance coeﬃcients α and γ are actually imposed on the data generationparameters Σ − and µ − µ , which are more explainable and veriﬁable in practice. In this section, we turn to a more general machine learning setting that any learning algorithms ofconditional mean could be potentially applied to estimate r ( · ) and m ( · ). Similar to Chernozhukov et al. (2018a),we randomly split the n samples into K = O (1) folds: I , I , . . . , I K of equal size, to help removethe over-ﬁtting bias. Denote by I - k = { , . . . , n } \ I k and we replace estimating equation (3) by thecross-ﬁtted 1 n K X k =1 X i ∈I k n Y i e − βA i − (1 − Y i ) e b r [- k ] ( X i ) o n A i − b m [- k ] ( X i ) o = 0 , (9)where b r [- k ] ( · ) and m [- k ] ( · ) are two machine learning estimators obtained with the samples in I - k and converging to the true models r ( · ) and m ( · ). We outline as follows a strategy utilizing anarbitrary (supervised) learning algorithm to estimate the nuisance models.Suppose there is a blackbox procedure L ( R i , C i ; I ) that inputs samples from I ⊆ { , , . . . , n } with some response R i and covariates C i and outputs an estimation of E[ R i | C i , i ∈ I ]. Estimatorof m ( · ) can be obtained by b m [- k ] ( · ) = L ( A i , X i ; I - k ∩{ i : Y i = 0 } ). Unlike the partial linear settingin Chernozhukov et al. (2018a), estimating r ( · ) using L is more sophisticated since it is deﬁnedthrough an unextractable semiparametric form P( Y = 1 | A, X ) = expit { β A + r ( X ) } . Certainly,one could modify some existing machine learning approaches like neural network to accommodatethis form. However, such modiﬁcation is not readily available in general and even there was someway to adapt L to this semiparametric form, it would typically require additional human eﬀortson its implementing and validating. Thus, we introduce a “full model reﬁtting” procedure basingon an arbitrary algorithm L to estimate r ( · ). Our method is motivated by a simple proposition: Proposition 1.

Let the full model M ( A, X ) = P( Y = 1 | A, X ) = expit { β A + r ( X ) } . One have: β = argmin β ∈ R E (cid:2) logit { M ( A, X ) } − β ( A − E[ A | X ]) (cid:3) . Without purposed modiﬁcation, the natural form of most contemporary supervised learning methods, e.g. randomforest, support vector machine and neural network, can be conceptualized in this way since their goal is predictionfor a continuous response and classiﬁcation for a categorical response. By setting the last layer of the neural network to be the combination of a complex network of X and a linearfunction of A and linking it with the outcome through an expit link. roof. For any β ∈ R , we haveE (cid:2) logit { M ( A, X ) } − β ( A − E[ A | X ]) (cid:3) = E (cid:8) β A + r ( X ) − β ( A − E[ A | X ]) (cid:9) =E (cid:8) ( β − β )( A − E[ A | X ]) + η ( X ) (cid:9) = ( β − β ) E( A − E[ A | X ]) + E { η ( X ) } , where η ( X ) = r ( X ) + E[ A | X ]. Thus, β minimizes E (cid:2) logit { M ( A, X ) } − β ( A − E[ A | X ]) (cid:3) .Randomly each split I - k into K folds I - k, , . . . I - k,K , of equal size. Motivated by Proposition 1,we ﬁrst estimate the “full” model M ( A, X ) leaving out each I - k, - j in I - k : c M [- k, - j ] ( · ) = L ( Y i , ( A i , X ⊺ i ) ⊺ ; I - k, - j ) , and learn a ( X ) = E[ A | X ] as b a [- k, - j ] ( · ) = L ( A i , X i ; I - k, - j ). Then ﬁt the (cross-ﬁtted) least squareregression to obtain:˘ β [- k ] = argmin β ∈ R |I - k | K X j =1 X i ∈I - k,j (cid:20) logit n c M [- k, - j ] ( A i , X i ) o − β n A i − b a [- k, - j ] ( X i ) o(cid:21) . (10)We use cross-ﬁtting in (10) to avoid the over-ﬁtting of c M [- k, - j ] ( · ) and a [- k, - j ] ( · ). Estimator of r ( X i )could then be given through r ( X i ) = logit { M ( A i , X i ) } − β A i . Note that the empirical versionof logit { M ( A i , X i ) } − β A i typically involves A i due to the discrepancy of the true β and M ( · )from their empirical estimation, which can impede the removal of over-ﬁtting bias terms in (5) since A − m ( X ) is not orthogonal to the error dependent on A . So we instead use the conditional mean oflogit { M ( A, X ) }− β A on X to estimate r ( · ). Let W i = logit { c M [- k, - j ] ( A i , X i ) } for each i ∈ I - k,j andobtain b t [- k ] ( · ) = L ( W i , X i ; I - k ), as a “reﬁtting” estimation of t ( x ) := E[logit { M ( A, X ) }| X = x ].Then the estimator of r ( · ) is given by: b r [- k ] ( x ) = b t [- k ] ( x ) − ˘ β [- k ] b a [- k ] ( x ) , where b a [- k ] ( x ) = 1 K K X j =1 b a [- k, - j ] ( x ) . Alternatively, one can also reﬁt r ( · ) through b r [- k ] ( · ) = log L ( e − ˘ β [- k ] A i , X i ; I - k ∩ { i : Y i = 1 } ) L (1 − Y i , X i ; I - k ) ! , inspired by the moment condition suﬃcient to identify r ( x ):E h Y e − β A − (1 − Y ) e r ( X ) (cid:12)(cid:12)(cid:12) X i = E h e − β A (cid:12)(cid:12)(cid:12) X , Y = 1 i − e r ( X ) E (cid:2) (1 − Y ) (cid:12)(cid:12) X (cid:3) = 0 . At last, we outline the theoretical investigation on the estimator b β ﬁnally solved from (9). Similarto Chernozhukov et al. (2018a), the analysis relies on certain mild regularity conditions and theassumptions that (i) L outputs uniformly consistent estimators for the conditional mean modelsin all learning objects it is implemented on; (ii) rMSEs of these estimators output by L arecontrolled by o p ( n − / ). Remark 3.

The above described assumptions (i) and (ii) imply that L should perform similarlywell on the learning objects with the covariates set as X or ( A, X ⊺ ) ⊺ . Classic nonparametricregression approaches like kernel smoothing or sieve may not satisfy this because including one ore covariate A in the model could have substantial impact on their rate of convergence. Thus,we recommend using modern learning approaches that are more dimensionality-robust, such asrandom forest and neural network, in our framework. While the classic sieve or kernel constructionfor one-dimensional X in a type of “plug-in” model has been well-studied in existing work likeSeverini and Staniswalis (1994); Lin and Carroll (2006). Based on assumption (ii), we have that k b m [- k ] ( X ) − m ( X ) k P , = o p ( n − / ),˘ β [- k ] = P i ∈I - k logit { M ( A i , X i ) }{ A i − a ( X i ) } P i ∈I - k { A i − a ( X i ) } + o p ( n − / ) = β + O p ( n − / ) + o p ( n − / ) , and consequently k b r [- k ] ( X ) − r ( X ) k P , = o p ( n − / ). Then the second order error presented in(the cross-ﬁtted version of) (4) is o p ( n − / ). Also, we remove the ﬁrst order (over-ﬁtting) biasdeﬁned by (5) through assumption (i) and concentration, facilitated by the use of cross-ﬁtting in(9), in the same spirit as Chernozhukov et al. (2018a). Combining these two results leads to that(9) is asymptotically equivalent with n − P ni =1 h ( Y i , A i , X i ; r ( · ) , m ( · )) = 0 and thus n / ( b β − β )is asymptotically normal with mean 0 under mild regularity conditions. We turn back to the construction of Tan (2019):1 n n X i =1 b φ ( X i ) e − b r ( X i ) n Y i e − βA i − (1 − Y i ) e b r ( X i ) o (cid:8) A i − b m ( X i ) (cid:9) = 0 , (11)where b φ ( X i ) is an empirical estimation of a (typically positive) nuisance function φ ( · ) that dependson the nuisance models ¯ r ( · ) and ¯ m ( · ) and aﬀects the asymptotic variance of b β . Tan (2019) proposedand studied two options for φ ( · ) including: φ opt ( X ) = E[ { A − ¯ m ( X ) } | X , Y = 0]E[ { A − ¯ m ( X ) } / expit { β A + ¯ r ( X ) }| X , Y = 0] ; φ simp ( X ) = expit { ¯ r ( X ) } . It was shown that when both nuisance models are correctly speciﬁed, the estimator solved with theweight φ opt ( X ) achieves the minimum asymptotic variance. Since calculation of φ opt ( X ) involvesnumerical integration with respect to X given Y = 0, it is sometimes inconvenient to implement.So Tan (2019) also provides another simpliﬁed choice: φ simp ( X ), obtained by evaluating φ opt ( X )at β = 0.Inclusion of the nuisance estimator b φ ( X i ) e − b r ( X i ) incurs two challenges. First, it introducesadditional bias terms. Second, formation of the ﬁrst order bias ∆ m and ∆ r in (5) alters. Corre-spondingly, we make some moderate modiﬁcations on methods described in Sections 3.1 and 3.2.We adopt again a cross-ﬁtting strategy (for both the high dimensional parametric and machinelearning settings) to obtain b φ [- k ] ( X i ) e − b r [- k ] ( X i ) and plug-in it at i ∈ I k , for k = 1 , , . . . , K . Notethat a function depending solely on X i is orthogonal to h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · )) = { Y i e − βA i − (1 − Y i ) e ¯ r ( X i ) }{ A i − ¯ m ( X i ) } when ¯ r ( · ) = r ( · ) or ¯ m ( · ) = m ( · ). Then as long as b φ [- k ] ( X i ) e − b r [- k ] ( X i ) isconsistent and at least one nuisance model is correct, we can remove the (cross-ﬁtted) bias term:1 n K X k =1 X i ∈I k n b φ [- k ] ( X i ) e − b r [- k ] ( X i ) − φ i ( X i ) e ¯ r ( X i ) o h ( Y i , A i , X i ; ¯ r ( · ) , ¯ m ( · ))8hrough concentration. Meanwhile, we note that the ﬁrst order bias terms ∆ m and ∆ r deﬁned in (5)are weighted by b φ [- k ] ( X i ) e − b r [- k ] ( X i ) under this eﬃciency enhancing construction. So we naturallyweight the moment equations (6) and (7) with b φ [- k ] ( X i ) e − b r [- k ] ( X i ) , as a modiﬁcation of the highdimensional sparse modelling strategy. While in the machine learning scenario, both the nuisanceestimators are supposed to approach the corresponding true models, i.e. ¯ r ( · ) = r and ¯ m ( · ) = m so there is no need to modify the way to obtain b m [- k ] ( · ) and b r [- k ] ( · ) in Section 3.2. In this note, we extend the low dimensional parametric doubly robust approach for logistic par-tially linear model of Tan (2019) to the settings where the nuisance models are estimated bythe high dimensional sparse regression or general machine learning methods. For the high di-mensional setting, we derive certain moment equations for the nuisance models to remove theﬁrst order bias. Also, we ﬁnd the sparsity assumption of our approach is more explainable andreasonable than the “sparse inverse information matrix” assumption used by debiased LASSO(Van de Geer et al., 2014; Jankov´a and Van De Geer, 2016). For the general machine learningframework, we handle the non-linearity and “unextractablility” issue of the logistic partial modelusing a “full model reﬁtting” procedure. This procedure is easy to implement and facilitates theuse of arbitrary learning algorithms for the nuisance models in our framework. Meanwhile, it couldbe potentially extended to handle other similar structure issues like that of the partially linear M -estimator.We also outline the key theoretical analysis procedures of our approaches and demonstrate themodel double robustness of the high dimensional construction under ultra-sparsity assumptionsand the rate double robustness of the machine learning setting. For the high dimensional setting,we note that our ultra-sparsity assumption, i.e. s = o ( n / / log p ) on both nuisance models may bemoderately relaxed through cross-ﬁtting, inspired by Smucler et al. (2019). Acknowledgements

The author thanks his advisor, Tianxi Cai, and collaborator, Yi Zhang, for helpful discussion andcomments on this note.

References

Athey, S., Imbens, G. W., and Wager, S. (2016). Approximate residual balancing: Debiasedinference of average treatment eﬀects in high dimensions. arXiv preprint arXiv:1604.07125 .Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. (2009). Simultaneous analysis of lasso and dantzigselector.

The Annals of statistics , 37(4):1705–1732.Bradic, J., Wager, S., and Zhu, Y. (2019). Sparsity double robust inference of average treatmenteﬀects. arXiv preprint arXiv:1905.00744 .B¨uhlmann, P. and Van De Geer, S. (2011).

Statistics for high-dimensional data: methods, theoryand applications . Springer Science & Business Media.Candes, E., Tao, T., et al. (2007). The dantzig selector: Statistical estimation when p is muchlarger than n . The annals of Statistics , 35(6):2313–2351.9hen, H. Y. (2007). A semiparametric odds ratio model for measuring association.

Biometrics ,63(2):413–421.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., and Robins,J. (2018a). Double/debiased machine learning for treatment and structural parameters.Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. (2016).Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033 .Chernozhukov, V., Newey, W. K., and Robins, J. (2018b). Double/debiased machine learning usingregularized riesz representers. Technical report, cemmap working paper.Jankov´a, J. and Van De Geer, S. (2016). Conﬁdence regions for high-dimensional generalized linearmodels under sparsity. arXiv preprint arXiv:1610.01353 .Javanmard, A. and Montanari, A. (2014). Conﬁdence intervals and hypothesis testing for high-dimensional regression.

The Journal of Machine Learning Research , 15(1):2869–2909.Lin, X. and Carroll, R. J. (2006). Semiparametric estimation in general repeated measures problems.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):69–88.Liu, M., Xia, Y., Cho, K., and Cai, T. (2020). Integrative high dimensional multiple testing withheterogeneity under data sharing constraints. arXiv preprint arXiv:2004.00816 .Ma, R., Tony Cai, T., and Li, H. (2020). Global and simultaneous hypothesis testing for high-dimensional logistic regression models.

Journal of the American Statistical Association , pages1–15.Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A uniﬁed framework forhigh-dimensional analysis of M-estimators with decomposable regularizers.

Statistical Science ,27(4):538–557.Severini, T. A. and Staniswalis, J. G. (1994). Quasi-likelihood estimation in semiparametric models.

Journal of the American statistical Association , 89(426):501–511.Smucler, E., Rotnitzky, A., and Robins, J. M. (2019). A unifying approach for doubly-robust ℓ -regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737 .Tan, Z. (2019). On doubly robust estimation for logistic partially linear models. Statistics & Probability Letters , 155:108577.Tan, Z. (2020). Model-assisted inference for treatment eﬀects using regularized calibrated estimationwith high-dimensional data.

Annals of Statistics , 48(2):811–837.Tchetgen Tchetgen, E. J., Robins, J. M., and Rotnitzky, A. (2010). On doubly robust estimationin a semiparametric odds ratio model.

Biometrika , 97(1):171–180.Van de Geer, S., B¨uhlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimalconﬁdence regions and tests for high-dimensional models.

The Annals of Statistics , 42(3):1166–1202.Xia, L., Nan, B., and Li, Y. (2020). A revisit to debiased lasso for generalized linear models. arXivpreprint arXiv:2006.12778 . 10hang, C.-H. and Zhang, S. S. (2014). Conﬁdence intervals for low dimensional parameters inhigh dimensional linear models.