[PDF] Semiparametric counterfactual density estimation

Abstract

Causal effects are often characterized with averages, which can give an incomplete picture of the underlying counterfactual distributions. Here we consider estimating the entire counterfactual density and generic functionals thereof. We focus on two kinds of target parameters. The first is a density approximation, defined by a projection onto a finite-dimensional model using a generalized distance metric, which includes f-divergences as well as L_p norms. The second is the distance between counterfactual densities, which can be used as a more nuanced effect measure than the mean difference, and as a tool for model selection. We study nonparametric efficiency bounds for these targets, giving results for smooth but otherwise generic models and distances. Importantly, we show how these bounds connect to means of particular non-trivial functions of counterfactuals, linking the problems of density and mean estimation. We go on to propose doubly robust-style estimators for the density approximations and distances, and study their rates of convergence, showing they can be optimally efficient in large nonparametric models. We also give analogous methods for model selection and aggregation, when many models may be available and of interest. Our results all hold for generic models and distances, but throughout we highlight what happens for particular choices, such as L_2 projections on linear models, and KL projections on exponential families. Finally we illustrate by estimating the density of CD4 count among patients with HIV, had all been treated with combination therapy versus zidovudine alone, as well as a density effect. Our results suggest combination therapy may have increased CD4 count most for high-risk patients. Our methods are implemented in the freely available R package npcausal on GitHub.

Full PDF

SSemiparametric Counterfactual Density Estimation

Edward H. Kennedy, Sivaraman Balakrishnan, Larry WassermanDepartment of Statistics & Data ScienceCarnegie Mellon University {edward, siva, larry} @ stat.cmu.edu

Abstract

Causal eﬀects are often characterized with averages, which can give an incompletepicture of the underlying counterfactual distributions. Here we consider estimating theentire counterfactual density and generic functionals thereof. We focus on two kinds oftarget parameters. The ﬁrst is a density approximation, deﬁned by a projection onto aﬁnite-dimensional model using a generalized distance metric, which includes f -divergencesas well as L p norms. The second is the distance between counterfactual densities, whichcan be used as a more nuanced eﬀect measure than the mean diﬀerence, and as a toolfor model selection. We study nonparametric eﬃciency bounds for these targets, givingresults for smooth but otherwise generic models and distances. Importantly, we showhow these bounds connect to means of particular non-trivial functions of counterfactuals,linking the problems of density and mean estimation. We go on to propose doubly robust-style estimators for the density approximations and distances, and study their rates ofconvergence, showing they can be optimally eﬃcient in large nonparametric models. Wealso give analogous methods for model selection and aggregation, when many modelsmay be available and of interest. Our results all hold for generic models and distances,but throughout we highlight what happens for particular choices, such as L projectionson linear models, and KL projections on exponential families. Finally we illustrate byestimating the density of CD4 count among patients with HIV, had all been treated withcombination therapy versus zidovudine alone, as well as a density eﬀect. Our resultssuggest combination therapy may have increased CD4 count most for high-risk patients.Our methods are implemented in the freely available R package npcausal on GitHub. Keywords: causal inference, density estimation, inﬂuence function, model misspeciﬁcation,semiparametric theory.

It is very common in causal inference to quantify causal eﬀects with means. The classic aver-age treatment eﬀect (ATE) parameter, for instance, measures the diﬀerence in mean outcomehad all versus none in a population been treated. This can certainly be a useful summary,but it can also miss potentially important diﬀerences in the distributions of the counterfactualoutcomes, beyond a simple mean shift. To illustrate, consider the densities in Figure 1, whichall have exactly the same mean and variance. These would be indistinguishable with the ATE,or any other measure that did not look past the ﬁrst two moments. a r X i v : . [ s t a t . M E ] F e b . . . . −4 −2 0 2 4 . . . . . D en s i t y −4 −2 0 2 4 . . . . . . D en s i t y −4 −2 0 2 4 . . . . . −4 −2 0 2 4 . . . . D en s i t y −4 −2 0 2 4 . . . . D en s i t y Figure 1:

Densities for six distributions, all with mean zero and variance four.

In general, it can be very practically useful to know the shape of the counterfactual density.If the counterfactual density diﬀered at all under treatment versus control (or under any othergeneric interventions), this would imply treatment had some eﬀect, even if the ATE were zero.The presence of skew would indicate that some subjects have relatively extreme responses totreatment; next steps could include trying to understand who these subjects are, and whytheir responses are unusual. Similarly, multimodal structure could point to the existence ofunderlying subgroups with diﬀerential responses to treatment, which could be important foroptimizing treatment policies. Contrasting the shape of the density under diﬀerent interven-tions could inform hypotheses about how treatment works, e.g., perhaps it works by reducingvariance, or driving up negative outcomes. This could help enhance future versions of treat-ment, or motivate the development of new treatments altogether.There is a large literature on distributional treatment eﬀects deﬁned in terms of quantiles,or cumulative distribution functions (CDFs), with a similar goal of moving beyond simplemean summaries to study the entire counterfactual distribution [Abadie, 2002, Chernozhukovand Hansen, 2005, Chernozhukov et al., 2013, Díaz, 2017, Firpo, 2007, Fortin et al., 2011,Frölich and Melly, 2013, Machado and Mata, 2005, Melly, 2005, Rothe, 2010, Wang et al.,2018, Wang and Qin, 2010, Zhang et al., 2012]. However, the challenges and methods aresubstantially diﬀerent for density estimation. This is largely a result of the fact that the CDFat y is the mean of the thresholded outcome ( Y ≤ y ) , so that counterfactual CDF estima-tion mostly reduces to counterfactual mean estimation, after replacing the outcome with anindicator. A related diﬀerence is that the CDF is pathwise diﬀerentiable in a nonparametricmodel, whereas the density function is not [Bickel et al., 1993, van der Laan and Robins, 2003].This is also true in the standard observational setup, where CDFs can be estimated at n − / rates with sample averages, while density estimation requires more careful balancing of biasand variance, with slower rates arising depending on underlying smoothness [Tsybakov, 2009,Wasserman, 2006]. Beyond this issue of statistical complexity, there are other trade-oﬀs intargeting CDFs versus densities. One is that, although CDFs are easier to estimate nonpara-metrically, densities are arguably more visually appealing and interpretable to practitioners.We view CDFs and densities as complementary pieces of the distributional puzzle.2nlike distribution function estimation, the literature on counterfactual density estima-tion appears much more sparse. In what was perhaps the ﬁrst study of the problem, DiNardoet al. [1996] used a reweighted kernel estimator to estimate eﬀects of US labor market factorson wages. However, the statistical properties of this proposed approach were not examined.Robins and Rotnitzky [2001] proposed a doubly robust version of the reweighted kernel esti-mator, and conjectured it would achieve usual density estimation rates under smoothness andother conditions. van der Laan and Dudoit [2003] and Rubin and van der Laan [2006] stud-ied general cross-validation-based approaches for model selection in the presence of nuisancefunctions, and suggested minimizing counterfactual KL or L loss for density estimation, butdid not detail the statistical properties. More recently, Westling and Carone [2020] tackled therelated problem of density estimation for right-censored outcomes, proposing new estimatorsthat can attain n − / rates, but under an assumption that the density is monotone. Kim et al.[2018] analyzed a version of the doubly robust estimator from Robins and Rotnitzky [2001],showing its conjectured oracle properties, and used it to estimate the (nonsmooth) L distancebetween counterfactual densities.Somewhat surprisingly, none of the known work above on counterfactual density estima-tion considers a semiparametric approach, where the density is approximated with a ﬁnite-dimensional model. Our work aims to ﬁll this gap in the literature, while also providingdata-driven model selection and aggregation tools. A separate contribution is our study ofgeneric density-based eﬀects, which characterize the distance between counterfactual densities,using a generalized notion of distance that includes f -divergences as well as L p norms.The structure of our paper is as follows. After introducing some basics and causal as-sumptions in Section 2, in Section 3 we detail the diﬀerent kinds of target parameters weconsider. The ﬁrst (described in Section 3.1) is an approximation of the density itself, deﬁnedby a projection onto a ﬁnite-dimensional model (3.1.1) using a generalized distance metric(3.1.2), which includes f -divergences as well as L p norms. Importantly, we show in Section3.1.3 that projection parameters for smooth models and distances can be framed as solutionsto moment conditions, providing a link between counterfactual densities and means (of func-tions of counterfactuals). The second parameter we consider (described in Section 3.2) is thedistance between counterfactual densities, which can be used as a new more nuanced eﬀectmeasure, or as a tool for model selection (as in Section 3.3). In Section 4 we study nonpara-metric eﬃciency bounds, by characterizing the eﬃcient inﬂuence functions of approximateddensity functions in Section 4.1, and density eﬀects in Section 4.2. These follow from a masterlemma in Section 4, which gives a von Mises expansion for generic integral functionals of thecounterfactual density, and so may be of independent interest. In Section 5 we propose doublyrobust-style estimators for the density approximations and distances, and study their rates ofconvergence, showing for example that they can be n − / consistent, asymptotically normal,and optimally eﬃcient under weak high-level conditions on nuisance estimation error. All ourresults hold for smooth but otherwise arbitrary models and distances. However, in variouscorollaries, we also highlight speciﬁc expressions for typical choices of models and distances,such as L projections on linear models, and KL projections on exponential families. Finallyin Section 6 we use our proposed methods to estimate counterfactual densities and densityeﬀects of combination therapy (versus zidovudine alone) on CD4 count, among patients withHIV. Our results show treatment eﬀects beyond a mean shift, suggesting that combinationtherapy may have increased CD4 count most for high-risk patients.3 Setup

We assume access to an iid sample ( Z , ..., Z n ) of Z = ( X, A, Y ) ∼ P where X ∈ R d arecovariates, A ∈ R is a treatment or exposure, and Y ∈ R is a continuous outcome. We let π a ( x ) = P ( A = a | X = x ) (1) (cid:90) B η a ( y | x ) dy = P ( Y ∈ B | X = x, A = a ) for measurable B (2)denote the propensity score (i.e., chance of being treated at level A = a given covariates) andconditional outcome density, respectively. In this work we focus on discrete treatments, butin a companion paper we consider the continuous case.We study “semiparametric” estimation of the covariate-adjusted marginal density p a ( y ) = (cid:90) η a ( y | x ) d P ( x ) (3)i.e., the conditional outcome density averaged over the covariates, as well as functionals thereof. Remark . Although we refer to our work in this paper as semiparametric, in reality it is alldone within a fully nonparametric model. As described in more detail starting in Section 3.1,the models we consider are only ever used as tools for deﬁning nonparametric approximations,and corresponding projection parameters, and are never assumed to be correct descriptions ofthe underlying true data-generating process. Further, our results on estimating counterfactualdensity functionals (e.g., Sections 3.2 and 4.2) do not require any approximating models, andso are nonparametric in the usual sense.We note that the density (3) is diﬀerent from the marginal density of Y , since the treatmentis ﬁxed at A = a in the conditioning; it is also not equal to the unadjusted conditional density p ( y | a ) . Instead, (3) is the density p ( y a ) of the counterfactual variable Y a (i.e., the outcomethat would have been observed if treatment were set to A = a ), if the following assumptionshold: Assumption . P { π a ( X ) ≥ (cid:15) } = 1 for some (cid:15) > . Assumption . Y = Y a if A = a . Assumption . A ⊥⊥ Y a | X .Positivity ensures all subjects have some chance at receiving treatment level A = a . Con-sistency can be viewed as ruling out interference, for example, where a subject’s counterfactualcan depend not only on how they were treated, but how other subjects were treated as well.Exchangeability says the treatment is as good as randomized within levels of the observed co-variates, and requires that suﬃciently many relevant confounders are collected. Each of theseassumptions can be weakened in various ways, at the expense of losing point identiﬁcation ofthe marginal counterfactual distribution. Nonetheless, under only the positivity assumption,all our statistical results will hold relative to the observational quantity in (3), regardless ofwhether the causal Assumptions 2–3 are violated or not.4 Target Parameters

In this section we detail the two kinds of quantities we consider estimating. The ﬁrst is an ap-proximation of the counterfactual density itself, deﬁned via a projection in some distributionaldistance. The second is a distance measure, e.g., a density-based causal eﬀect measuring thediﬀerence between counterfactual densities in terms of general f - or other divergences. Thelatter gives a more nuanced picture of how the counterfactual densities diﬀer, compared tothe usual ATE, for example. Finally in Section 3.3 we describe how these two kinds of targetquantities can be adapted for the purposes of model selection and aggregation. First we consider approximations of the counterfactual density p a ( y ) based on some speciﬁedmodel { g ( y ; β ) : β ∈ R d } . We mostly focus on the ﬁnite-dimensional parametric case with β ∈ R d , but more generally one could take β to be inﬁnite-dimensional in some L p space, orto belong to a subset of R d such as the standard simplex. Note that β ( a ) depends on a but fornow we suppress this dependence in the notation and simply write β . Here are some examples. Example 1a (Exponential family) . Let b ( y ) = { b ( y ) , ..., b d ( y ) } T denote a vector of knownbasis functions. Then we can project onto the exponential family g ( y ; β ) = exp (cid:110) β T b ( y ) − C ( β ) (cid:111) (4)where C ( β ) = log (cid:82) exp { β T b ( y ) } dy so that (cid:82) g ( y ; β ) dy = 1 . Typical exponential familynotation takes β = 1 and sets b ( y ) = log h ( y ) for some known base measure h .Although we refer to Example 1a as an exponential family, it can just as well be viewedas a truncated series expansion used together with a log link function. In the next examplewe consider a truncated series with an identity link function. Example 1b (Truncated series) . Let b ( y ) = { b ( y ) , ..., b d ( y ) } T denote a vector of known basisfunctions, and q ( y ) a known base density (e.g., uniform). Then we can project onto the linearbasis expansion g ( y ; β ) = q ( y ) + d (cid:88) j =1 β j b j ( y ) where we can take (cid:82) b j ( y ) dy = 0 so that the projection integrates to one. A natural choicewhen Y ∈ [0 , would be to take q ( y ) = 1 and b ( y ) the cosine basis b j ( y ) = √ πjy ) (5)which satisﬁes (cid:82) b j ( y ) dy = 0 and (cid:82) b j ( y ) b k ( y ) dy = ( j = k ) on the unit interval. One couldalternatively take q ( y ) = 0 and let b j ( y ) be (the linear span of) a collection of d candidatedensities, in which case the above could be viewed as a linear aggregation [Rigollet and Tsy-bakov, 2007]. Another related option would be to use a linear approximation for the squareroot of the density (cid:112) g ( y ; β ) = (cid:80) j β j b j ( y ) , so that g ( y ; β ) = (cid:80) j (cid:80) k β j β k b j ( y ) b k ( y ) [Chenet al., 2002, Pinheiro and Vidakovic, 1997]. Then the model would integrate to one if the basisfunctions were orthonormal ( (cid:82) b j ( y ) b k ( y ) dy = 0 and (cid:82) b j ( y ) dy = 1 ) and (cid:80) j β j = 1 .5 xample 1c (Gaussian mixture model) . Let ( µ , ..., µ k ) denote a vector of means, ( σ , ..., σ k ) a vector of positive standard deviations, ( (cid:36) , ..., (cid:36) k ) positive mixing proportions with (cid:80) j (cid:36) j =1 , and φ the standard normal density. Then the standard Gaussian mixture model is g ( y ; β ) = k (cid:88) j =1 (cid:36) j (cid:18) σ j (cid:19) φ (cid:18) y − µ j σ j (cid:19) where β = { ( (cid:36) , µ , σ ) , ...., ( (cid:36) k , µ k , σ k ) } .Now, based on the above approximations, a primary goal is to estimate the projectionparameter β = arg min β ∈ R p D f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) (6)where D f is a distributional distance measure of the form D f ( p, q ) = (cid:90) f ( p, q ) q ( y ) dy (7)for some given discrepancy function f : R → R . Remark . In contrast to typical f -divergences [Ali and Silvey, 1966, Csiszár, 1967, Rényiet al., 1961, Sason and Verdú, 2016], we allow the function f to have two arguments, one foreach distribution; this allows us to capture not only f -divergences but also other distancessuch as those based on L p norms. (The usual f -divergence takes f ( p, q ) = f ( p/q ) for somesingle-argument function, and so only depends on the density ratio). In a slight abuse ofterminology we sometimes refer to (7) as a distance, even though for some of our choices of f it will be an asymmetric divergence not satisfying the triangle inequality.Before giving examples of distances, we ﬁrst discuss the interpretation of our projectionparameter (6). Mathematically, β is the parameter of the best-ﬁtting model of the form g ( y ; β ) , i.e., the parameter value that makes g ( y ; β ) closest (in corresponding distance) to thetrue density p a ( y ) . If the model g is correctly speciﬁed, then D f ( p a ( y ) , g ( y ; β )) = 0 and so g ( y ; β ) = p a ( y ) is simply the true counterfactual density; however, the projection (6) remainswell-deﬁned even under model misspeciﬁcation. This is akin to the well-known concept of a best linear predictor in standard linear regression [White, 1980]. The projection approach,where a model is not assumed correct but instead only used for deﬁning approximations, hasbeen used widely throughout statistics [Beran, 1977, Buja et al., 2019a,b, Huber, 1967, Rakhlinet al., 2017, Rinaldo and Wasserman, 2010, Tsybakov, 2003, Wasserman, 2006, White, 1982,1996] as well as in causal inference [Chernozhukov et al., 2018b, Cuellar and Kennedy, 2020,Kennedy et al., 2019, Neugebauer and van der Laan, 2007, Semenova and Chernozhukov, 2020,van der Laan, 2006], though not in the counterfactual density estimation context. Remark . Since we only use models as tools to deﬁne approximations, all our results are for-mally nonparametric, as mentioned in Remark 1 and illustrated in subsequent theorems. Thisraises some interesting philosophical issues about the role of assumptions and correspondingbias-variance trade-oﬀs. In particular, we can imagine a rough taxonomy of stances one mighttake in estimation problems like this one: 6i) model-ist: My ﬁnite-dimensional/parametric representation is the correct one.(ii) model-agnostic: I may use a ﬁnite-dimensional model, but I do not know or require thatit is a perfectly accurate picture of the truth.(iii) anti-model-ist: No parametric model I can imagine contains the truth, and I do not careabout approximations.The model agnostic view is often captured by the famous quotes “All models are wrong butsome are useful” (George Box) and “Use models but don’t believe them” (possibly due to JohnTukey). Of course, in practice, how much one relies on models is a continuum, and so anyparticular approach may not fall entirely in one of the three camps above. Similarly, ourtaxonomy uses parametric models as a benchmark, but one could just as well replace with adiﬀerent assumption set (e.g., Hölder-smooth with index s ≥ versus s < ). Neverthelesswe ﬁnd the above framing useful if imperfect. In this paper, we mostly take the stance ofthe model-agnostic, though we ﬂirt with anti-model-ism in the data-driven model selectionapproaches of Sections 3.3 and 5.3 (and we are fully anti-model-ist in a companion paper).We also accept that each approach has advantages and disadvantages. The model-ist willdo well when the model is correct, but could unknowingly suﬀer large bias otherwise. Theanti-model-ist is most free from the constraints of human imagination (as they do not needto posit a parametric model), but with a more ambitious target can also suﬀer larger errors.The model-agnostic has a bit of the best of both worlds: when the model is correct, they mayhope to do nearly as well as the model-ist, and when the model is wrong, their inference canstill be valid for a still well-deﬁned approximation. Of course, if the model is very wrong, theapproximation may not be practically useful, no matter how well-deﬁned it is; thus there canbe important challenges in deﬁning a useful approximating model and distance. Now we give some examples of the distances we focus on in this paper:

Example 2a ( L ) . If f ( p, q ) = ( p − q ) /q then D f ( p, q ) = (cid:107) p − q (cid:107) is the squared L distance (cid:107) p a ( y ) − g ( y ; β ) (cid:107) = (cid:90) (cid:16) p a ( y ) − g ( y ; β ) (cid:17) dy. Example 2b (Kullback-Leibler) . If f ( p, q ) = ( p/q ) log( p/q ) then D f ( p, q ) = KL ( p, q ) is theKullback-Leibler divergenceKL (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = (cid:90) log (cid:18) p a ( y ) g ( y ; β ) (cid:19) p a ( y ) dy. Example 2c ( χ ) . If f ( p, q ) = ( p/q − then D f ( p, q ) = χ ( p, q ) is the χ divergence χ (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = (cid:90) { p a ( y ) − g ( y ; β ) } g ( y ; β ) dy. Example 2d (Hellinger) . If f ( p, q ) = ( (cid:112) p/q − then D f ( p, q ) = H ( p, q ) is the squaredHellinger divergence H (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = (cid:90) (cid:16)(cid:112) p a ( y ) − (cid:112) g ( y ; β ) (cid:17) dy. xample 2e (Smoothed Total Variation) . If f ( p, q ) = q | p − q | = ( p − q )2 q sgn( p − q ) then D f ( p, q ) = TV ( p, q ) = (cid:107) p − q (cid:107) is the total variation distance (and half the L distance). Note f ( p, q ) is not diﬀerentiable at p/q = 1 . Smooth versions can be obtained by approximatingthe absolute value or sign functions in f . For example, let ν t ( y ) be an approximation of theabsolute value function | y | , with parameter t controlling the approximation error. For exampleone could use ν t ( y ) = y tanh( ty ) or ν t ( y ) = y erf ( ty ) or a best polynomial approximation ofdegree t . Then taking f ( p, q ) = f t ( p, q ) = q ν t ( p − q ) gives a smoothed total variation D f ( p, q ) = TV ∗ ( p, q ) withTV ∗ (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = 12 (cid:90) ν t (cid:110) p a ( y ) − g ( y ; β ) (cid:111) dy. There exist polynomial and rational approximations ν t ( y ) of degree t ensuring that | TV ( p, q ) − TV ∗ ( p, q ) | is of order t − and exp( − t ) , respectively [Newman et al., 1964]. We also note thatthe Hellinger divergence is closely related to total variation in the sense that H ( p, q ) / ≤ TV ( p, q ) ≤ H ( p, q ) for any densities p, q .Figure 2 shows a few projections of a true density onto a truncated trigonometric serieswith six terms, using four diﬀerent distances ( L , Kullback-Leibler, χ , and Hellinger). Theprojections are all very similar in both cases. However, we note that, as discussed for examplein Beran [1977], Hellinger projections should be more stable and robust to outliers or contam-ination, compared to for example KL. The projections are closer to the true density for theﬁrst simpler Gaussian mixture, and are more of a rough approximation for the second morecomplex mixture. . . . y D en s i t y TrueKLChi−squaredHellingerL2 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . y D en s i t y TrueKLChi−squaredHellingerL2

Figure 2:

Projections of a truncated Gaussian mixture (left) and the Bart Simpson density(right) onto a trigonometric basis with six terms, using L distance, along with Kullback-Leibler, χ , and Hellinger divergences. .1.3 Moment Conditions The next proposition shows how, for smooth distances, the projection parameter β can bedeﬁned more explicitly than in equation (6), as a solution to a population moment condition,involving derivatives of the model g ( y ; β ) and the function f . This links projection parametersto integral functionals of the counterfactual density (i.e., moments of transformations of coun-terfactuals), which is why our eﬃciency bounds and estimators in the next section resemblethose for means of particular non-trivial functions of counterfactuals. Proposition 1.

Assume g is diﬀerentiable in β , f is diﬀerentiable in its second argumentwith derivative f (cid:48) ( q , q ) = ∂∂q f ( q , q ) , and that the minimizer in (6) is unique. Then theprojection parameter β = arg min β ∈ R p D f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) can be expressed as a solution to the moment condition m ( β ) = 0 , where m ( β ) ≡ (cid:90) ∂g ( y ; β ) ∂β (cid:110) f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) dy. (8)The proof of Proposition 1 follows from the chain rule; all subsequent proofs are given inAppendix B. Throughout we assume there is a unique solution to m ( β ) = 0 . Next we showhow the moment condition deﬁning β simpliﬁes for particular distances. Corollary 1.

The quantity f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) in the integrand ofthe moment (8) equals  (cid:110) g ( y ; β ) − p a ( y ) (cid:111) if D f = L − p a ( y ) g ( y ; β ) if D f = KL − (cid:26) p a ( y ) g ( y ; β ) (cid:27) if D f = χ − (cid:115) p a ( y ) g ( y ; β ) if D f = H − ν (cid:48) t (cid:110) p a ( y ) − g ( y ; β ) (cid:111) / if D f = TV ∗ . Corollary 1 shows how the moment m ( β ) essentially reduces to functionals of the counter-factual density for particular distances: simple means for L and KL, a quadratic functionalfor χ , and a square root functional for H . For the smoothed TV distance, it depends on theform of the absolute value approximation (e.g., for ν t a t degree polynomial approximation,the moment m ( β ) would be an integral of a t − degree polynomial in the countef).In the following corollaries we show how the form of the moment condition is particularlystraightforward when based on L or KL divergence with series models and exponential fam-ilies, respectively. 9 orollary 2. If D f = L then m ( β ) = 2 (cid:90) ∂g ( y ; β ) ∂β (cid:110) g ( y ; β ) − p a ( y ) (cid:111) dy. Therefore if the support of Y is [0 , , and g ( y ; β ) = 1 + β T b ( y ) is the truncated series inExample 1b then β = E (cid:110) b ( Y a ) (cid:111) (9) when b ( · ) is an orthogonal series with (cid:82) b j ( y ) dy = 0 and (cid:82) b j ( y ) b k ( y ) dy = ( j = k ) . Corollary 2 shows that when using orthogonal series with L projections, there is a closedform for β , given by a simple mean of a known function of the counterfactual outcome. Esti-mation and inference for parameters like (9) is relatively well-understood [Robins et al., 2009,2017], which allows exploiting existing theory and methods in the density estimation context. Corollary 3. If D f = KL then m ( β ) = − E (cid:26) ∂∂β log g ( Y a ; β ) (cid:27) , and so if g ( y ; β ) = exp { β T b ( y ) − C ( β ) } is the exponential family in Example 1a then m ( β ) = ∂∂β C ( β ) − E (cid:110) b ( Y a ) (cid:111) = (cid:90) b ( y ) (cid:104) exp { β T b ( y ) − C ( β ) } − p a ( y ) (cid:105) dy. (10)Similarly, for KL divergence, the moment m ( β ) is simply the expected score under coun-terfactual density g ( y ; β ) . Therefore, just as in the non-counterfactual setting, the parametervalues that maximize a posited likelihood are also those that minimize KL divergence [Huber,1967, White, 1982]. When one also uses an exponential family, the solution to m ( β ) = 0 corresponds to an intuitive “moment matching”, i.e., ﬁnding the value of β that equates ex-pectations of b ( · ) under g to those under the distribution of Y a . In addition to estimating projections of the counterfactual density onto a ﬁnite-dimensionalmodel, in this section we also consider estimation of distributional distances themselves. Themain focus is on density-based eﬀects measuring the distance between counterfactual densitiesin terms of L pp and f -divergences. These eﬀects can detect more nuanced disinctions betweenthe distributions of Y and Y , beyond simple diﬀerences-in-means captured by standard av-erage treatment eﬀects.More speciﬁcally, we consider the distance between p and p given by ψ f = D f (cid:16) p ( y ) , p ( y ) (cid:17) = (cid:90) f (cid:16) p ( y ) , p ( y ) (cid:17) p ( y ) dy (11)for discrepancy functions f as discussed in the previous subsection. In this setup we do notrequire approximating the densities p a ( y ) with ﬁnite-dimensional models, and instead considerestimating ψ f in a fully nonparametric model.10 .3 Model Selection & Aggregation In practice one may not have an approximating model such as (4) available a priori . In thesecases it would be natural to instead set up a sequence of models, and use the data to chooseamong them. In standard regression and density estimation problems, simple cross-validationprocedures are available for this task; however, because our goal is estimation of a more nu-anced counterfactual density, these require some reﬁnement, in the same spirit as van der Laanand Dudoit [2003]. Thus in this section we describe how the target quantities of Sections 3.1and 3.2 can be adapted for the purposes of model selection and aggregation.Speciﬁcally, for a set of estimators { (cid:98) g k ( y ) : k = 1 , ..., K } of p a ( y ) (e.g., estimated fromsome initial training sample, with each projected onto the space of valid densities), we candeﬁne the risk for a given estimator as R ( (cid:98) g k ) = D f (cid:16) p a ( y ) , (cid:98) g k ( y ) (cid:17) (12)The minimum risk oracle estimator (cid:98) g k ( y ) can then be deﬁned via k = arg min k R ( (cid:98) g k ) = arg min k D f (cid:16) p a ( y ) , (cid:98) g k ( y ) (cid:17) . (13)A model aggregation oracle can be deﬁned more generally as (cid:101) g ( y ) = (cid:80) k β k (cid:98) g k ( y ) where β = arg min β ∈ B D f (cid:32) p a ( y ) , K (cid:88) k =1 β k (cid:98) g k ( y ) (cid:33) . (14)for some appropriate selection set, e.g., the standard simplex B = { ( β , ..., β K ) ∈ R K : β k ≥ , (cid:80) k β k = 1 } for convex aggregation [Rigollet and Tsybakov, 2007, Tsybakov, 2003]. If onetakes B = R K for linear aggregation, then f -divergences may not be well-deﬁned, so thismight naturally only be used in the D f = L setting.Note that the proposed target parameters in Section 3.1 correspond to the aggregationtarget in (14) if we replace R d with the relevant space B . However, since model selection asdeﬁned in Equation (13) does not satisfy the smoothness assumptions we relied on in Section3.1.3, it can be useful in practice to estimate the risk separately for all K candidates; this ismore akin to the eﬀect estimation problem in Section 3.2, except where the density p ( y ) in(11) is replaced with a candidate estimator (cid:98) g k ( y ) (e.g., which may be estimated on a separateindependent sample/fold and conditioned upon, and so treated as ﬁxed). In this section we present a crucial von Mises expansion (i.e., distributional Taylor expansion)for generic density functionals, which yields eﬃcient inﬂuence functions for the projectionparameters and density eﬀects of interest, and thus nonparametric eﬃciency bounds [Bickelet al., 1993, van der Laan and Robins, 2003]. The latter can be further formalized as localminimax lower bounds [van der Vaart, 2002]. 11hroughout we make reference to the linear map T (cid:55)→ φ a ( T ; P ) deﬁned as φ a ( T ; P ) = ( A = a ) π a ( X ) (cid:110) T − E ( T | X, A = a ) (cid:111) + E ( T | X, A = a ) − E { E ( T | X, A = a ) } (15)which takes a random variable T (and distribution P ) and outputs the eﬃcient inﬂuence func-tion for the functional E { E ( T | X, A = a ) } . Note we drop the dependence of φ a ( T ; P ) on ( X, A ) for simplicity; at times we also drop the dependence on P if the context is clear. Inall our examples, T = h ( Y ) will be a known or P -dependent function of Y ; the functionalswe consider all have inﬂuence functions consisting of terms of the above form, but with dif-ferent and non-standard choices of T = h ( Y ) , depending on the model and distance being used.Recall that in Corollary 1 we showed the relevant moment m ( β ) reduces to a functionalof the counterfactual density for particular distances. Therefore our ﬁrst result gives a vonMises-style expansion for generic smooth integral functionals of the counterfactual density.This result paves the way for later expansions and eﬃciency bounds, and may be of indepen-dent interest in other problems involving diﬀerent counterfactual density functionals. Lemma 1.

Let ψ = ψ ( P ) = (cid:82) h ( p a ( y )) dy for some twice continuously diﬀerentiable function h . Then ψ satisﬁes the von Mises expansion ψ ( P ) − ψ ( P ) = (cid:90) φ a (cid:16) h (cid:48) (cid:16) p a ( Y ) (cid:17) ; P (cid:17) d ( P − P ) + R ( P , P ) (16) where R ( P , P ) = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ 12 (cid:90) h (cid:48)(cid:48) ( p ∗ a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy, where p ∗ a ( y ) lies between p a ( y ) and p a ( y ) . Lemma 1 has several important consequences. First, it indicates how one can correct theﬁrst-order bias of a plug-in estimator ψ ( (cid:98) P ) of counterfactual density functionals: by estimatingthe ﬁrst term in the expansion and subtracting it oﬀ. This is how standard semiparametricestimators (particularly of the one-step variety) based on inﬂuence functions are constructed[Bickel et al., 1993, Chernozhukov et al., 2018a, van der Laan and Robins, 2003], and ourproposed estimators in the next section do precisely this. Second, since the remainder term isquadratic in the nuisance functions, it implies that ψ ( P ) is pathwise diﬀerentiable with eﬃcientinﬂuence function φ a ( h (cid:48) ( p a ( Y )) ; for this fact we refer to Lemma 2 in the Appendix. In this subsection we use Lemma 1 to detail the eﬃcient inﬂuence function for the moment m ( β ) at a ﬁxed β , as well as the projection parameter β and projected density g ( y ; β ) . Theseeﬃcient inﬂuence functions yield local minimax lower bounds, as well as estimators that canattain the nonparametric eﬃciency bounds under generic high-level rate conditions on nui-sance estimators, which will be proved in Section 5.12 heorem 1. Assume f is twice diﬀerentiable and denote partial derivatives as f (cid:48) j ( q , q ) = ∂∂q j f ( q , q ) and similarly f (cid:48)(cid:48) jk ( q , q ) = ∂ ∂q j ∂q k f ( q , q ) . Then, under an unrestricted nonpara-metric model, the eﬃcient inﬂuence function for m ( β ) is given by φ a (cid:16) γ f ( Y ; β ) (cid:17) where γ f ( y ; β ) ≡ γ f ( y ; β, p a ) = ∂g ( y ; β ) ∂β (cid:110) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48)(cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) . The eﬃcient inﬂuence functions for β and g ( y ; β ) are similarly given by − ∂m ( β ) ∂β − φ a (cid:16) γ f ( Y ; β ) (cid:17) (cid:12)(cid:12)(cid:12) β = β and − ∂g ( y ; β ) ∂β T ∂m ( β ) ∂β − φ a (cid:16) γ f ( Y ; β ) (cid:17) (cid:12)(cid:12)(cid:12) β = β (17) respectively. The eﬃcient inﬂuence functions given in Theorem 1 are analogous to those of usual ATE-type parameters, but with the crucial diﬀerence that they correspond to means of γ f ( Y a ; β ) ,not Y a itself. This is what we should expect based on the result in Lemma 1, since the γ f transformation is the derivative of the integrand in the moment condition (8) given in Propo-sition 1. Note also that the form of γ f indicates that the eﬃciency bound for β (i.e., thevariance of the eﬃcient inﬂuence function) will be adversely aﬀected when the model g is sen-sitive to small changes in β , or when the distance is sensitive to small changes in its arguments,since then the derivatives in γ f will be large.In the next corollary, we give the particular form of the eﬃcient inﬂuence functions when D f is the L and KL divergence, and the approximating models are a linear series and expo-nential family. Corollary 4.

For L and KL divergence the quantity γ f from Theorem 1 reduces to γ f ( y ; β ) = (cid:40) − ∂g ( y ; β ) ∂β if D f = L − ∂ log g ( y ; β ) ∂β if D f = KL . Further, if either1. D f = L and g ( y ; β ) = q ( y ) + β T b ( y ) is the truncated series in Example 1b, or2. D f = KL and g ( y ; β ) = exp { β T b ( y ) − C ( β ) } is the exponential family in Example 1athen the eﬃcient inﬂuence function for m ( β ) is proportional to φ a (cid:16) b ( Y ) (cid:17) . The proportionality constant is − for D f = L , and − for D f = KL. L distance, and for projections onto an exponential family using the KL diver-gence. Further, this eﬃcient inﬂuence function simply corresponds to that of the counterfactualmean vector E { b ( Y a ) } , for b a known basis function vector. Thus the inﬂuence function con-veniently reduces to that of the mean of a transformed version of the counterfactual outcome,with no dependence on β . As mentioned after Corollary 2, this allows for adapting existingtheory and methods for average treatment eﬀects to the density estimation context.The following theorem summarizes the local minimax lower bound implied by the form ofthe eﬃcient inﬂuence function in Theorem 1, as in Corollary 2.6 of van der Vaart [2002]. Corollary 5.

Let σ = σ P denote the variance of the eﬃcient inﬂuence function from (17).The local minimax risk for β is lower bounded as inf δ> lim inf n →∞ sup TV ( P , P ) <δ E P (cid:104) (cid:96) (cid:110) √ n (cid:16) (cid:98) β − β ( P ) (cid:17)(cid:111)(cid:105) ≥ E (cid:110) (cid:96) ( σZ ) (cid:111) for any estimator (cid:98) β , where (cid:96) : R p (cid:55)→ [0 , ∞ ) is any subconvex loss function. Corollary 5 follows from Corollary 2.6 of van der Vaart [2002]. It shows that the worst-case mean squared error of any estimator, locally near the true P , cannot be smaller than theeﬃciency bound, asymptotically and after scaling by √ n . This gives an important benchmarkfor eﬃcient estimation of projection parameters of the counterfactual density: no estimator canhave mean squared error uniformly better than the variance of the eﬃcient inﬂuence function(divided by n ), without adding extra assumptions to the nonparametric model we consider. Now we give the eﬃcient inﬂuence function for the density eﬀect parameters in (11). Unlikethe projected densities in the previous subsection, the density eﬀect parameters depend onboth counterfactual densities of interest for comparison.

Theorem 2.

In an unrestricted nonparametric model, the eﬃcient inﬂuence function for thedensity eﬀect ψ f = (cid:82) f ( p ( y ) , p ( y )) p ( y ) dy is given by φ (cid:16) λ ( Y ) (cid:17) + φ (cid:16) λ ( Y ) (cid:17) where λ ( y ) = p ( y ) f (cid:48) (cid:16) p ( y ) , p ( y ) (cid:17) λ ( y ) = f (cid:16) p ( y ) , p ( y ) (cid:17) + p ( y ) f (cid:48) (cid:16) p ( y ) , p ( y ) (cid:17) . As with the result for β in Theorem 1, the eﬃcient inﬂuence function for ψ f in Lemma2 consists of inverse probability weighted residuals, plus a “plug-in”-type term, similar toATE parameters. However, again this corresponds to the inﬂuence function for a transformedversion of the outcome, depending on the counterfactual densities and choice of distance f .The eﬃcient inﬂuence function simpliﬁes somewhat for L and KL divergence, as indicated inthe following corollary. Expressions for other f -divergences are in Section B.1 in the Appendix.14 orollary 6. If D f = L , then the eﬃcient inﬂuence function for ψ f is φ − φ ) (cid:16) p ( Y ) − p ( Y ) (cid:17) . If D f = KL, then the eﬃcient inﬂuence function for ψ f is φ (cid:18) log (cid:18) p ( Y ) p ( Y ) (cid:19)(cid:19) − φ (cid:18) p ( Y ) p ( Y ) (cid:19) . The fact that λ = − λ for L projections simpliﬁes the form of our proposed estimators,as we will detail further in the next section. We also note that the inﬂuence function reducesto zero when p = p , which presents some complications for inference; this will be discussedin the next section as well.As mentioned in Section 3.3, for the purposes of model selection and aggregation it is alsouseful to consider the distance between p a and a ﬁxed candidate g ; we give the correspondingeﬃcient inﬂuence function here. Proposition 2.

In an unrestricted nonparametric model, the eﬃcient inﬂuence function for ∆ f ( g ) = (cid:82) f ( p a ( y ) , g ( y )) g ( y ) dy for g ﬁxed and known is given by φ a (cid:16) g ( Y ) f (cid:48) (cid:16) p a ( Y ) , g ( Y ) (cid:17)(cid:17) . If D f = L then this inﬂuence function reduces to φ a (cid:16) p a ( Y ) − g ( Y ) (cid:17) . In this section we present doubly robust-style estimators of the proposed density functions anddensity eﬀects, based on the functional expansions from Lemma 1 and the eﬃcient inﬂuencefunction results in Theorems 1–2. We study their rates of convergence, and show they can be n − / consistent and asymptotically eﬃcient under weak nonparametric conditions. Here let (cid:98) π a ( x ) and (cid:98) η a ( y | x ) denote initial estimators of the propensity score and conditionaldensity functions π a ( x ) = P ( A = a | X = x ) and η a ( y | x ) = ∂∂y P ( Y ≤ y | X = x, A = a ) ,for example based on generic regression estimators and their numerical derivatives (or forthe latter one can use a regression of a kernel transformed version of the outcome). Also let (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } denote the plug-in estimator of the counterfactual density under A = a ,where P n { h ( Z ) } = n − (cid:80) i h ( Z i ) , and let (cid:98) m ( β ) ≡ (cid:90) ∂g ( y ; β ) ∂β (cid:110) f (cid:16)(cid:98) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16)(cid:98) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) dy. (18)denote the plug-in estimator of the moment condition m ( β ) , and similarly for ψ f .15 emark . Although we suggest basing (18) on the plug-in estimator (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } ofthe counterfactual density, one could just as well use other estimators (e.g., inverse-probability-weighted, or doubly robust, as in Kim et al. [2018]). Nonetheless, all results in this paper willonly depend on high-level second-order rate conditions for estimating p a ( y ) , which would besatisﬁed for the simple plug-in estimator as long as similar conditions hold for the underlyingdensity estimator (cid:98) η a ( y | x ) . We prove this in Appendix B.5, showing that the mean squarederror of (cid:98) p a ( y ) is upper bounded by an integrated version of that of (cid:98) η a ( y | x ) .To ease notation we let (cid:98) φ a ( T ) = φ a ( T ; (cid:98) P ) denote the estimated version of the eﬃcient inﬂu-ence function given in (15). Then our proposed projection estimators are given by approximatesolutions in β (up to o P (1 / √ n ) error) to (cid:98) m ( β ) + P n (cid:110) (cid:98) φ a (cid:16)(cid:98) γ f ( Y ; β ) (cid:17)(cid:111) = o P (1 / √ n ) (19)In other words the estimators are one-step bias-corrected estimators (of the moment conditionand the parameter itself, respectively), which take the plug-in estimator and add an estimateof the bias by averaging an estimate of the inﬂuence function. Remark . For simplicity, in the following results we assume the various nuisance estimates in (cid:98) P are constructed from a single separate independent sample, of the same size n as the estima-tion sample on which P n operates. Alternatively, if the same observations are used both forestimating nuisance functions and averaging estimates of the inﬂuence function, one generallyneeds to rely on empirical process conditions to avoid overﬁtting. In practice, with iid data,one can always obtain separate independent samples by randomly splitting the data in half (orin folds); further, to regain full sample size eﬃciency one can always swap the samples, repeatthe procedure, and average the results, popularly called cross-ﬁtting and used for example byBickel and Ritov [1988], Chernozhukov et al. [2018a], Robins et al. [2008], Zheng and van derLaan [2010]. In this paper, to simplify notation we always analyze a single split procedure,with the understanding that extending to an analysis of an average across independent splitsis straightforward.Our ﬁrst propositions give the form of the plug-in and bias-corrected projection estima-tors when using a linear series with L distance, and an exponential family model with KLdivergence, which take a particularly simple form. Proposition 3. If D f = L , the support of Y is [0 , , and g ( y ; β ) = 1+ β T b ( y ) is the truncatedseries in Example 1b, with b ( · ) an orthogonal series with (cid:82) b j ( y ) dy = 0 and (cid:82) b j ( y ) b k ( y ) dy = ( j = k ) , then the plug-in estimator of β is (cid:98) β = P n { (cid:98) µ a ( X ; b ) } , where (cid:98) µ a ( x ; b ) is an estimate of µ a ( x ; b ) = E { b ( Y ) | X = x, A = a ) . In contrast, the proposedone-step estimator in (19) is given by (cid:98) β = P n (cid:20) ( A = a ) (cid:98) π a ( X ) (cid:110) b ( Y ) − (cid:98) µ a ( X ; b ) (cid:111) + (cid:98) µ a ( X ; b ) (cid:21) . (20)16 roposition 4. If D f = KL and g ( y ; β ) = exp { β T b ( y ) − C ( β ) } is the exponential family inExample 1a, then the plug-in estimator solving (cid:98) m ( (cid:98) β ) = 0 is the solution in β to (cid:90) (cid:104) b ( y ) − P n { (cid:98) µ a ( X ; b ) } (cid:105) exp (cid:110) β T b ( y ) (cid:111) dy = 0 where (cid:98) µ a ( x ; b ) is an estimate of µ a ( x ; b ) = E { b ( Y ) | X = x, A = a ) . In contrast, the proposedone-step estimator in (19) is given by the solution in β to (cid:90) (cid:18) b ( y ) − P n (cid:20) ( A = a ) (cid:98) π a ( X ) (cid:110) b ( Y ) − (cid:98) µ a ( X ; b ) (cid:111) + (cid:98) µ a ( X ; b ) (cid:21)(cid:19) exp (cid:110) β T b ( y ) (cid:111) dy = 0 . (21)Propositions 3-4 shows that the plug-in and bias-corrected estimators for L and KL pro-jections solve simple estimating equations, which only require one to ﬁrst estimate the com-ponents E { µ a ( X ; b ) } ; importantly, straightforward doubly robust estimators as in (21) areavailable, and do not depend on the estimating equation parameter β . This is not necessarilytrue for other model/distance combinations; in general (cid:98) γ f would have to be estimated at each β in order to solve (19), which could be quite computationally intensive.Next we give the main result of this section, which shows the rate of convergence for theproposed estimator. Importantly the rate involves products of nuisance estimation errors,allowing for n − / consistency and asymptotic normality in nonparametric models, and evenwhen the nuisance estimators are generic and ﬂexibly ﬁt. Theorem 3.

Let η = ( π a , η a ) , and ϕ ( Z ; β, η ) = m ( β ; η ) + φ a ( γ f ( Y ; β ) , η ) . Assume:1. The functions γ f and / (cid:98) π a are bounded above by some constant, and γ f is diﬀerentiablein p a ( y ) , with derivative bounded uniformly above by δ .2. The function class { ϕ ( z ; β, η ) : β ∈ R p } is Donsker in β for any ﬁxed η .3. The estimators are consistent in the sense that (cid:98) β − β = o P (1) and (cid:107) (cid:98) η − η (cid:107) = o P (1) .4. The map β (cid:55)→ P { ϕ ( Z ; β, η ) } is diﬀerentiable at β uniformly in η , with nonsingularderivative matrix ∂∂β P { ϕ ( Z ; β, η ) }| β = β = V ( β , η ) , where V ( β , (cid:98) η ) p → V ( β , η ) .Then (cid:98) β − β = − V ( β , η ) − ( P n − P ) (cid:110) φ a (cid:16) γ f ( Y ; β ) (cid:17)(cid:111) + O P (cid:18) (cid:107) (cid:98) π a − π a (cid:107)(cid:107) (cid:98) η a − η a (cid:107) + δ (cid:107) (cid:98) p a − p a (cid:107) + o P (cid:18) √ n (cid:19)(cid:19) . Remark . In a slight abuse of notation, Theorem 3 holds when we deﬁne (cid:107) (cid:98) η a − η a (cid:107) = (cid:107) ζ a (cid:107) ≡ (cid:82) ζ a ( x ) d P ( x ) for integrated error ζ a ( x ) = (cid:82) | (cid:98) η a ( y | x ) − η a ( y | x ) | dy . Thisimplies it also holds if we deﬁne (cid:107) (cid:98) η a − η a (cid:107) = (cid:82) { (cid:98) η a ( y | x ) − η a ( y | x ) } dy d P ( x ) , or (cid:107) (cid:98) η a − η a (cid:107) = (cid:82) { (cid:98) η a ( y | x ) − η a ( y | x ) } d P ( y, x ) if η a ( y | x ) is bounded from below.17mportantly, Theorem 3 shows that (cid:98) β attains substantially faster rates than its nuisanceestimators (cid:98) η , and can be asymptotically eﬃcient under weak nonparametric conditions, forexample attaining the minimax lower bound in Corollary 5. First we give some description ofthe assumed conditions. The ﬁrst condition ensures the inﬂuence function is not too complexas a function of β (though allowing arbitrary complexity in η ). The second condition merelyrequires consistency of ( (cid:98) β, (cid:98) η ) at any rate. The third condition requires some smoothness in β , so as to allow a delta method argument. These conditions ensure (cid:98) β has a rate of conver-gence that is second-order in the nuisance estimation error, thus attaining faster rates thanthe nuisance estimators. Thus, for example, under standard n − / -type rate conditions on (cid:98) η ,the estimator (cid:98) β is n − / -consistent, asymptotically normal, and eﬃcient. Importantly, theserates can be attained under smoothness, sparsity, or other structural conditions (e.g., addi-tive modeling or bounded variation assumptions, etc.). For instance, if it is assumed that all d -dimensional nuisance functions lie in a Holder class with smoothness index s (i.e., partialderivatives up to order s exist and are Lipschitz) then the assumption of Theorem 3 would besatisﬁed when s > d/ , i.e., the smoothness index is at least half the dimension. Alternatively,if the functions are s -sparse then one would need s = o ( √ n ) up to log factors, as in Farrell[2015]. In these cases, asymptotically valid 95% conﬁdence intervals can be constructed viathe simple Wald form, (cid:98) β ± . (cid:113) diag [ (cid:99) cov { (cid:98) φ a ( (cid:98) γ f ( Y ; (cid:98) β )) } /n ] . Remark . In some prominent cases (for example, L and KL projections, as shown in Corol-lary 4), the function γ f does not depend on the counterfactual density p a ( y ) at all, so itsderivative is exactly zero and δ = 0 . In this case the second term in the second-order remain-der in Theorem 3 drops out, making the proposed approach doubly robust in the usual sense,requiring no rate conditions on the initial pilot estimate of the counterfactual density. Here we present doubly robust-style estimators of the density eﬀects described in Section 3.2,and study their rate of convergence. As before we ﬁrst construct initial estimators (cid:98) π a ( x ) , (cid:98) η a ( y | x ) , and (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } of the propensity score and conditional and counter-factual densities. Estimated versions of φ a ( T ) and λ a deﬁned in Theorem 2 follow accordingly.Then the density eﬀect estimators we propose are deﬁned as (cid:98) ψ f = (cid:90) f (cid:16)(cid:98) p ( y ) , (cid:98) p ( y ) (cid:17)(cid:98) p ( y ) dy + P n (cid:110) (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17) + (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17)(cid:111) , (22)which can again be viewed as one-step bias-corrected estimators, with plug-in bias estimatedvia an average of the estimated inﬂuence function. In practice, rather than estimating theconditional density η a and integrating over its y argument, one could instead regress for ex-ample (cid:98) λ a on X for the integral terms in the estimated inﬂuence function. Proposition 5. If D f = L then the proposed density eﬀect estimator can be written as (cid:98) ψ f = 2 P n (cid:18) A − (cid:98) π A ( X ) (cid:20)(cid:110)(cid:98) p ( Y ) − (cid:98) p ( Y ) (cid:111) − (cid:90) (cid:110)(cid:98) p ( y ) − (cid:98) p ( y ) (cid:111)(cid:98) η A ( y | X ) dy (cid:21) + (cid:90) (cid:110)(cid:98) p ( y ) − (cid:98) p ( y ) (cid:111)(cid:110)(cid:98) η ( y | X ) − (cid:98) η ( y | X ) (cid:111) dy (cid:19) − (cid:90) (cid:110)(cid:98) p ( y ) − (cid:98) p ( y ) (cid:111) dy. ( (cid:98) p ( Y ) − (cid:98) p ( Y )) − ( (cid:98) p ( Y ) − (cid:98) p ( Y )) , which is (cid:82) ( (cid:98) p − (cid:98) p )( p − p ) , andsubtracting a plug-in estimate of the L distance. This is analogous to the standard one-stepestimator of the expected (observational) density (cid:82) p ( x ) dx [Bickel and Ritov, 1988], whichtakes twice an estimate of the mean of (cid:98) p ( X ) , i.e., (cid:82) (cid:98) pp , and subtracts the plug-in estimate (cid:82) (cid:98) p . For the expected density, the bias is just the integrated squared diﬀerence between (cid:98) p and p ; in contrast, in our setting, we show next that there is an additional doubly robusterror term, due to the confounding adjustment required for estimating counterfactual densities. Theorem 4.

Assume λ a and / (cid:98) π a are bounded above by some constant for a = 0 , , and λ a is diﬀerentiable in p a ( y ) , with derivative bounded uniformly above by δ a . Then (cid:98) ψ f − ψ f = ( P n − P ) (cid:110) φ (cid:16) λ ( Y ) (cid:17) + φ (cid:16) λ ( Y ) (cid:17)(cid:111) + O P (cid:32) (cid:88) a =0 (cid:16) (cid:107) (cid:98) π a − π a (cid:107)(cid:107) (cid:98) η a − η a (cid:107) + δ a (cid:107) (cid:98) p a − p a (cid:107) (cid:17) + o P (cid:18) √ n (cid:19)(cid:33) . Theorem 4 (whose proof mimics that of Theorem 3) shows that (cid:98) ψ f can attain faster ratesthan its nuisance estimators, and can be asymptotically eﬃcient under weak nonparametricconditions. The conditions and the form of the convergence rate are similar to those of The-orem 3, so we refer to our discussion there for more details. However we do comment on afew diﬀerences. First, for the density functions targeted in Theorem 3, the moment condition m ( β ) , and resulting inﬂuence functions and estimators, can have a complicated dependenceon β ; in contrast, this is not an issue for the density eﬀect ψ f since the inﬂuence functionis linear in the parameter. Thus extra smoothness conditions on the inﬂuence function usedin Theorem 3 are not required in Theorem 4. Second, although in Theorem 3 the derivativebound δ can be exactly zero in some prominent cases, in general in Theorem 4 this will notbe the case (e.g., for L distance the derivative of λ a has absolute value equal to one). There-fore, for eﬃcient estimation of density eﬀects, we in general need an initial density estimatorconverging at n − / rate. However recall that, as described in Remark 4, there exist nonpara-metric counterfactual density estimators with error upper bounded by (cid:107) (cid:98) π a − π a (cid:107) or (cid:107) (cid:98) η a − η a (cid:107) , sothat (cid:107) (cid:98) p a − p a (cid:107) would be of smaller or similar order compared to the product error preceding it.There is a third distinction in density eﬀect estimation. Under usual n − / rate conditionson the nuisance estimators, Theorem 4 suggests 95% conﬁdence intervals of the form (cid:98) ψ f ± . (cid:114) (cid:99) cov (cid:110) (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17) + (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17)(cid:111) /n (23)These intervals are asymptotically valid as usual when p (cid:54) = p , but not when p = p ,since then the inﬂuence function of ψ f reduces to zero, as mentioned in Section 4.2. Thisinvalidates inference because the ﬁrst sample average in Theorem 4 is no longer dominant, aswith degenerate U-statistics or other estimators whose higher-order terms dominate their vonMises expansions (cf. Sections 12.3 and 20.1.1 of van der Vaart [2000]). However, the presenceof nuisance functions complicates things substantially, as noted in other similarly complexfunctional estimation problems [Luedtke et al., 2019, Williamson et al., 2020], but we are notaware of a general solution. Thus we only recommend using the interval (23) in non-nullsettings when p (cid:54) = p . A simple albeit ad-hoc ﬁx is to use the interval (cid:98) ψ ± z α/ ( s ∨ / √ n ) where s = (cid:113) (cid:99) cov { (cid:98) φ ( (cid:98) λ ( Y )) + (cid:98) φ ( (cid:98) λ ( Y )) } /n . This is valid but conservative near the null.19 .3 Model Selection & Aggregation Here we brieﬂy describe how the methods of the previous subsections can be used for thepurposes of model selection and aggregation, in the same spirit as Tsybakov [2003], van derLaan and Dudoit [2003], and others. We leave technical details to future work.First we consider the linear aggregation goal as deﬁned in (14), where B = R K . In thissetup the methods from Section 5.1 can be straightforwardly adapted, by adding an extra stepof sample splitting. We focus on L projections since f -divergences may not be well-deﬁnedfor general linear combinations of candidate estimators. Our proposed approach is as follows: Step 1.

Randomly split the sample into a training set D n and test set D n . Step 2.

On the training set D n , estimate K diﬀerent models (e.g., K diﬀerent numbers of basisfunctions, or K diﬀerent combinations of linear, exponential family, Gaussian mixturemodels, etc.), using the estimator in (19) to compute (cid:98) g k ( y ) = g ( y ; (cid:98) β k ) , k = 1 , ..., K . Step 3.

On the test set D n , estimate the ( L ) projection onto an orthonormal basis of the linearspan of ( (cid:98) g , ..., (cid:98) g K ) , again using the estimator in (19), e.g., with the series model inExample 1b with q ( y ) = 0 , to compute an aggregated estimator (cid:98) g ( y ) = (cid:80) k (cid:98) θ k (cid:98) g k ( y ) . Step 4.

Reverse the roles of D n and D n and average the two resulting aggregates.Note that inside Steps 2-3, another layer of sample splitting is required to avoid empiricalprocess conditions in estimating the nuisance functions, as discussed in Remark 5. We alsonote that the cross-ﬁtting in Step 4 could be considered optional if the corresponding eﬃciencyloss was considered negligible, or alternatively one could instead implement Steps 1–4 with M diﬀerent folds, at each step using M − for training and the other fold for the test set. Weconjecture that the above approach can attain the optimal K/n rates for linear density ag-gregation in the observational case [Rigollet and Tsybakov, 2007], under standard n − / -typeconditions on the nuisance estimators (or weaker, depending on how K scales with n ).For model selection and convex aggregation, we propose a similiar procedure, except wherein Step 3 variants of the density eﬀect estimators from Section 5.2 are used to estimate thedistance between p a and each of the k candidates estimated from the training split (afterprojecting each onto the space of valid densities). One can then pick the minimum distancecandidate or an appropriately weighted combination, e.g., by ﬁnding the convex weights thatminimize the estimated distance in the test split. For example, our proposed estimator of the L error of a candidate g k based on Proposition 2 is given by (cid:98) ∆ f ( g k ) = (cid:90) (cid:16)(cid:98) p a ( y ) − g k ( y ) (cid:17) dy + 2 P n (cid:110) (cid:98) φ a (cid:16)(cid:98) p a ( Y ) − g k ( Y ) (cid:17)(cid:111) . For the purposes of model selection, one can instead use the simpler pseudo- L risk (cid:98) ∆ ∗ f ( g k ) = − P n (cid:20) ( A = a ) (cid:98) π a ( X ) (cid:26) g k ( Y ) − (cid:90) g k ( y ) (cid:98) η a ( y | X ) dy (cid:27) (24) + (cid:90) g k ( y ) (cid:98) η a ( y | X ) dy (cid:21) + (cid:90) g k ( y ) dy, based on the fact that the L distance (cid:82) ( p a − g k ) equals (cid:82) g k − (cid:82) g k p a plus a term (cid:82) p a thatdoes not depend on g k . This is the estimator we use in the data analysis in the next section.20 Illustration

Here we apply our proposed methods to analyze the eﬀect of combined antiretroviral therapyfor treating HIV. All code is given in Appendix A, and the methods are implemented in the npcausal

R package on GitHub (https://github.com/ehkennedy/npcausal).The data we use come from the ACTG 175 randomized trial [Hammer et al., 1996], and areavailable in the speff2trial

R package. The treatment is whether patients received combina-tion therapy ( A = 1 ) versus zidovudine alone ( A = 0 ), and the outcome Y is CD4 count at 96weeks post-baseline. Baseline covariates X include age, weight, Karnofsky score, indicators forrace, gender, hemophilia, homosexual activity, drug use, whether symptomatic, and previouszidovudine and antiretroviral use. There are a total of n = 2319 patients in the trial, 797 ofwhich do not have outcome data (we use R = 1 to denote an observed outcome).Since we are interested in the density of outcomes had all versus none been treated in theentire population (i.e., had all outcomes been measured), we can view the product indicator ( A = a, R = 1) as a joint “treatment” variable [van der Laan and Robins, 2003]. In otherwords our goal is to estimate counterfactual densities under A = 1 and R = 1 , versus A = 0 and R = 1 . Our methods therefore rely on no unmeasured confounding of A (which holds bydesign due to the experimental design) and missingness at random of Y (i.e., R ⊥⊥ Y | X, A ),which is untestable regardless of whether treatment is randomized. For more details on thetrial and data, we refer to Hammer et al. [1996] and Wang et al. [2018].Throughout our analysis, we used 5-fold cross-ﬁtting, with all nuisance functions estimatedby random forests (via the R package ranger [Wright and Ziegler, 2015]). This includes condi-tional densities η a , which we estimated by regressing a Gaussian kernel weighted outcome oncovariates and treatment, on a grid of y values, with bandwidth chosen by Silverman’s rule.Alternative approaches could also be used [Díaz and van der Laan, 2011, Hansen, 2004, Izbickiand Lee, 2017], potentially at the expense of some extra computational burden.First we used the density eﬀect methods from Section 5.2 to check for evidence of an ef-fect of combination therapy on the density of CD4 count. Speciﬁcally, we used the cross-ﬁtversion of the estimator in Proposition 5 to estimate the L distance between p and p , withasymptotic variance estimated as usual, via the empirical variance of the estimated inﬂuencefunction. To ease interpretability we rescaled Y to be on the unit interval. The estimated L distance was 0.279 with a 95% conﬁdence interval of [0 . , . , indicating a statisticallysigniﬁcant eﬀect of combination therapy on CD4 count.To more precisely understand how combination therapy impacted the CD4 distribution, weestimated the counterfactual densities using the methods of Sections 5.1 and 5.3. Speciﬁcally,we used L projections onto the linear series in Example 1b with the cosine basis (5). Weconsidered a range of models for both densities, including up to 15 basis terms (more than15 terms did not improve ﬁt). Figure 3 shows estimates of model ﬁt via the pseudo- L risk(24), along with conﬁdence intervals, indicating that four basis terms does best for bothcounterfactual densities. Figure 4 shows the estimated counterfactual CD4 densities usingfour basis terms, along with pointwise CIs. Since the densities diﬀer more substantially in thelowest CD4 range (e.g., 0-200), this suggests combination therapy may have increased CD4count most for the high-risk patients with the lowest counts under control (zidovudine).21 − . − . − . − . Y Basis dimension P s eudo L2 r i sk − . − . − . − . − . − . − . Y Basis dimension P s eudo L2 r i sk Figure 3:

Estimates of pseudo- L risk for models of increasing dimension (using L projectionsonto linear models with a cosine basis), with gray bars denoting conﬁdence intervals. . . . . . . Counterfactual density (& pointwise CI) y E s t i m a t e Y Y Y Y Figure 4:

Estimated counterfactual CD4 densities for combination therapy versus zidovudine. Discussion

In this paper we proposed methods for estimating counterfactual densities and correspondingdistances and other functionals. We gave nonparametric eﬃciency bounds and ﬂexible opti-mal estimators for a wide class of models and projection distances, and for new eﬀects thatquantify treatment impacts on the density scale. We also gave methods for data-driven modelselection and aggregation in this context, and illustrated the ideas in an application studyingeﬀects of antiretroviral therapy on CD4 count.There are many interesting avenues for future work. In upcoming companion papers, weconsider the nonparametric version of the problem (where the target is the density p a itself andnot a projection) as well as non-discrete treatments (where A is for example a continuous dose).Much more work is needed on the computational side since, outside of L projections on linearmodels and KL projections on exponential families, our methods require solving somewhatcomplicated estimating equations. Other extensions could involve time-varying treatments,instrumental variables, conditional eﬀects, density-optimal treatment regimes, mediation, sen-sitivity analysis, and more. It is also of interest to apply the methods more broadly, to seeif they bring any new insights about treatment mechanisms or ways to adapt treatment policies. Acknowledgements

Edward Kennedy gratefully acknowledges support from NSF Grant DMS1810979, and Sivara-man Balakrishnan and Larry Wasserman from NSF Grant DMS1713003.

A Appendix: R Code set.seed(100) eferences A. Abadie. Bootstrap tests for distributional treatment eﬀects in instrumental variable models.

Journal of the American statistical Association , 97(457):284–292, 2002.S. M. Ali and S. D. Silvey. A general class of coeﬃcients of divergence of one distribution fromanother.

Journal of the Royal Statistical Society: Series B (Methodological) , 28(1):131–142,1966.R. Beran. Minimum hellinger distance estimates for parametric models.

The Annals of Statis-tics , 5(3):445–463, 1977.P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best orderof convergence estimates.

Sankhy¯a , pages 381–393, 1988.P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner.

Eﬃcient and Adaptive Estimationfor Semiparametric Models . Baltimore: Johns Hopkins University Press, 1993.A. Buja, L. Brown, R. Berk, E. George, E. Pitkin, M. Traskin, K. Zhang, and L. Zhao. Modelsas approximations i: Consequences illustrated with linear regression.

Statistical Science , 34(4):523–544, 2019a.A. Buja, L. Brown, A. K. Kuchibhotla, R. Berk, E. George, and L. Zhao. Models as approxi-mations ii: A model-free theory of parametric regression.

Statistical Science , 34(4):545–565,2019b.J. Chen, D. Zhang, and M. Davidian. A monte carlo em algorithm for generalized linear mixedmodels with ﬂexible random eﬀects distribution.

Biostatistics , 3(3):347–360, 2002.V. Chernozhukov and C. Hansen. An IV model of quantile treatment eﬀects.

Econometrica ,73(1):245–261, 2005.V. Chernozhukov, I. Fernández-Val, and B. Melly. Inference on counterfactual distributions.

Econometrica , 81(6):2205–2268, 2013.V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, W. Newey, and J. M.Robins. Double/debiased machine learning for treatment and structural parameters.

TheEconometrics Journal , 21(1):C1–C68, 2018a.V. Chernozhukov, M. Demirer, E. Duﬂo, and I. Fernandez-Val. Generic machine learninginference on heterogenous treatment eﬀects in randomized experiments. Technical report,National Bureau of Economic Research, 2018b.I. Csiszár. Information-type measures of diﬀerence of probability distributions and indirectobservation. studia scientiarum Mathematicarum Hungarica , 2:229–318, 1967.M. Cuellar and E. H. Kennedy. A non-parametric projection-based estimator for the prob-ability of causation, with application to water sanitation in kenya.

Journal of the RoyalStatistical Society: Series A (Statistics in Society) , 183(4):1793–1818, 2020.I. Díaz. Eﬃcient estimation of quantiles in missing data models.

Journal of Statistical Planningand Inference , 190:39–51, 2017. 24. Díaz and M. J. van der Laan. Super learner based conditional density estimation withapplication to marginal structural models. 2011.J. DiNardo, N. M. Fortin, and T. Lemieux. Labor market institutions and the distribution ofwages, 1973-1992: A semiparametric approach.

Econometrica , pages 1001–1044, 1996.M. H. Farrell. Robust inference on average treatment eﬀects with possibly more covariatesthan observations.

Journal of Econometrics , 189(1):1–23, 2015.S. Firpo. Eﬃcient semiparametric estimation of quantile treatment eﬀects.

Econometrica , 75(1):259–276, 2007.N. Fortin, T. Lemieux, and S. Firpo. Decomposition methods in economics.

Handbook ofLabor Economics , 4:1–102, 2011.M. Frölich and B. Melly. Unconditional quantile treatment eﬀects under endogeneity.

Journalof Business & Economic Statistics , 31(3):346–357, 2013.S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H.Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, and M. Niu. A trial comparingnucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell countsfrom 200 to 500 per cubic millimeter.

New England Journal of Medicine , 335(15):1081–1090,1996.B. E. Hansen. Nonparametric conditional density estimation.

Unpublished manuscript , 2004.P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions.In

Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability ,volume 1, pages 221–233. University of California Press, 1967.R. Izbicki and A. B. Lee. Converting high-dimensional regression to high-dimensional condi-tional density estimation.

Electronic Journal of Statistics , 11(2):2800–2831, 2017.E. H. Kennedy, S. Lorch, and D. S. Small. Robust causal inference with continuous instrumentsusing the local instrumental variable curve.

Journal of the Royal Statistical Society: SeriesB , 81(1):121–143, 2019.E. H. Kennedy, S. Balakrishnan, and M. G’Sell. Sharp instruments for classifying compliersand generalizing causal eﬀects.

The Annals of Statistics , 48(4):2008–2030, 2020.K. Kim, J. Kim, and E. H. Kennedy. Causal eﬀects based on distributional distances. arXiv1806.02935 , 2018.A. Luedtke, M. Carone, and M. J. van der Laan. An omnibus non-parametric test of equalityin distribution for unknown functions.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 81(1):75–99, 2019.J. A. Machado and J. Mata. Counterfactual decomposition of changes in wage distributionsusing quantile regression.

Journal of Applied Econometrics , 20(4):445–465, 2005.B. Melly. Decomposition of diﬀerences in distribution using quantile regression.

Labour Eco-nomics , 12(4):577–590, 2005. 25. Neugebauer and M. J. van der Laan. Nonparametric causal eﬀects based on marginalstructural models.

Journal of Statistical Planning and Inference , 137(2):419–434, 2007.D. J. Newman et al. Rational approximation to | x | . Michigan Mathematical Journal , 11(1):11–14, 1964.A. Pinheiro and B. Vidakovic. Estimating the square root of a density via compactly supportedwavelets.

Computational Statistics & Data Analysis , 25(4):399–415, 1997.A. Rakhlin, K. Sridharan, A. B. Tsybakov, et al. Empirical entropy, minimax regret andminimax risk.

Bernoulli , 23(2):789–824, 2017.A. Rényi et al. On measures of entropy and information.

Proceedings of the Fourth BerkeleySymposium on Mathematical Statistics and Probability, Volume 1: Contributions to theTheory of Statistics , 1961.P. Rigollet and A. B. Tsybakov. Linear and convex aggregation of density estimators.

Math-ematical Methods of Statistics , 16(3):260–280, 2007.A. Rinaldo and L. Wasserman. Generalized density clustering.

The Annals of Statistics , 38(5):2678–2722, 2010.J. M. Robins and A. Rotnitzky. Comments on: Inference for semiparametric models: Somequestions and an answer.

Statistica Sinica , 11:920–936, 2001.J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Higher order inﬂuencefunctions and minimax estimation of nonlinear functionals.

Probability and Statistics: Essaysin Honor of David A. Freedman , pages 335–421, 2008.J. M. Robins, E. J. Tchetgen Tchetgen, L. Li, and A. W. van der Vaart. Semiparametricminimax rates.

Electronic Journal of Statistics , 3:1305–1321, 2009.J. M. Robins, L. Li, R. Mukherjee, E. Tchetgen Tchetgen, and A. W. van der Vaart. Minimaxestimation of a functional on a structured high dimensional model.

The Annals of Statistics ,45(5):1951–1987, 2017.C. Rothe. Nonparametric estimation of distributional policy eﬀects.

Journal of Econometrics ,155(1):56–70, 2010.D. B. Rubin and M. J. van der Laan. Extending marginal structural models through local,penalized, and additive learning.

UC Berkeley Division of Biostatistics Working PaperSeries , 212:1–20, 2006.I. Sason and S. Verdú. f -divergence inequalities. IEEE Transactions on Information Theory ,62(11):5973–6006, 2016.V. Semenova and V. Chernozhukov. Debiased machine learning of conditional average treat-ment eﬀects and other causal functions.

The Econometrics Journal , 2020.A. B. Tsybakov. Optimal rates of aggregation.

Learning theory and kernel machines , pages303–313, 2003.A. B. Tsybakov.

Introduction to Nonparametric Estimation . New York: Springer, 2009.26. J. van der Laan. Statistical inference for variable importance.

The International Journalof Biostatistics , 2(1), 2006.M. J. van der Laan and S. Dudoit. Uniﬁed cross-validation methodology for selection amongestimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracleinequalities and examples.

UC Berkeley Division of Biostatistics Working Paper Series ,Paper 130, 2003.M. J. van der Laan and J. M. Robins.

Uniﬁed Methods for Censored Longitudinal Data andCausality . New York: Springer, 2003.A. W. van der Vaart.

Asymptotic Statistics . Cambridge: Cambridge University Press, 2000.A. W. van der Vaart. Semiparametric statistics.

In: Lectures on Probability Theory andStatistics , pages 331–457, 2002.L. Wang, Y. Zhou, R. Song, and B. Sherwood. Quantile-optimal treatment regimes.

Journalof the American Statistical Association , 113(523):1243–1254, 2018.Q. Wang and Y. Qin. Empirical likelihood conﬁdence bands for distribution functions withmissing responses.

Journal of Statistical Planning and Inference , 140(9):2778–2789, 2010.L. Wasserman.

All of Nonparametric Statistics . Springer, 2006.T. Westling and M. Carone. A uniﬁed study of nonparametric inference for monotone func-tions.

Annals of Statistics , 48(2):1001, 2020.H. White. Using least squares to approximate unknown regression functions.

InternationalEconomic Review , pages 149–170, 1980.H. White. Maximum likelihood estimation of misspeciﬁed models.

Econometrica , pages 1–25,1982.H. White.

Estimation, inference and speciﬁcation analysis . Number 22. Cambridge UniversityPress, 1996.B. D. Williamson, P. B. Gilbert, N. R. Simon, and M. Carone. A uniﬁed approach for inferenceon algorithm-agnostic variable importance. arXiv preprint arXiv:2004.03683 , 2020.M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for highdimensional data in c++ and r. arXiv preprint arXiv:1508.04409 , 2015.Z. Zhang, Z. Chen, J. F. Troendle, and J. Zhang. Causal inference on quantiles with anobstetric application.

Biometrics , 68(3):697–706, 2012.W. Zheng and M. J. van der Laan. Asymptotic theory for cross-validated targeted maximumlikelihood estimation.

UC Berkeley Division of Biostatistics Working Paper Series , Paper273:1–58, 2010. 27

Appendix: Proofs

B.1 Proof of Corollaries 1–4 and 6

These corollaries all follow from the distance-speciﬁc form of f . For reference we list the rele-vant quantities here.For L distance we have f ( p, q ) = ( p − q ) q f (cid:48) ( p, q ) = 2 (cid:18) pq − (cid:19) f (cid:48) ( p, q ) = 1 − (cid:18) pq (cid:19) f (cid:48)(cid:48) ( p, q ) = − pq . For KL divergence we have f ( p, q ) = (cid:18) pq (cid:19) log (cid:18) pq (cid:19) f (cid:48) ( p, q ) = 1 q (cid:26) log (cid:18) pq (cid:19) + 1 (cid:27) f (cid:48) ( p, q ) = − pq (cid:26) log (cid:18) pq (cid:19) + 1 (cid:27) f (cid:48)(cid:48) ( p, q ) = − q (cid:26) log (cid:18) pq (cid:19) + 2 (cid:27) . For χ divergence we have f ( p, q ) = (cid:18) pq − (cid:19) f (cid:48) ( p, q ) = 2( p − q ) q f (cid:48) ( p, q ) = − pq ( p − q ) f (cid:48)(cid:48) ( p, q ) = 2( q − p ) q . For Hellinger divergence we have f ( p, q ) = (cid:18)(cid:114) pq − (cid:19) f (cid:48) ( p, q ) = 1 √ q (cid:18) √ q − √ p (cid:19) f (cid:48) ( p, q ) = √ pq ( √ q − √ p ) f (cid:48)(cid:48) ( p, q ) = (cid:112) q/p − q . For TV ∗ divergence we have f ( p, q ) = 12 q ν t ( p − q ) f (cid:48) ( p, q ) = 12 q ν (cid:48) t ( p − q ) f (cid:48) ( p, q ) = − q (cid:26) ν t ( p − q ) q + ν (cid:48) t ( p − q ) (cid:27) f (cid:48)(cid:48) ( p, q ) = − q (cid:26) ν (cid:48) t ( p − q ) q + ν (cid:48)(cid:48) t ( p − q ) (cid:27) . .2 Proof of Lemma 1 Here we let ψ = ψ ( P ) = (cid:82) h ( p a ( y )) dy , for some twice continuously diﬀerentiable function h .We will show that ψ satisﬁes the von Mises expansion given in Lemma 1.Let p a ( y ) = (cid:82) η a ( y | x ) d P ( x ) denote the marginal counterfactual density under P . Note forthe posited inﬂuence function given by ϕ ( z ; P ) = ( A = a ) π a ( X ) (cid:26) h (cid:48) ( p a ( Y )) − (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy − (cid:90) h (cid:48) ( p a ( y )) η a ( y | x ) dy d P ( x ) , we have, by iterated expectation, that it has mean under P equal to (cid:90) ϕ ( z ; P ) d P = (cid:90) (cid:20) ( A = a ) π a ( X ) (cid:26) h (cid:48) ( p a ( Y )) − (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy − (cid:90) (cid:90) h (cid:48) ( p a ( y )) η a ( y | x ) dy d P ( x ) (cid:21) d P = (cid:90) π a ( x ) π a ( x ) (cid:90) (cid:110) h (cid:48) ( p a ( y )) η a ( y | x ) − h (cid:48) ( p a ( y )) η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) (cid:90) h (cid:48) ( p a ( y )) η a ( y | x ) dy (cid:110) d P ( x ) − d P ( x ) (cid:111) = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) h (cid:48) ( p a ( y )) (cid:90) (cid:110) η a ( y | x ) d P ( x ) − η a ( y | x ) d P ( x ) (cid:111) dy = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) h (cid:48) ( p a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy. Therefore the second-order remainder term in the von Mises expansion is R ( P , P ) ≡ ψ ( P ) − ψ ( P ) − (cid:90) ϕ ( z ; P ) d ( P − P ) = ψ ( P ) − ψ ( P ) + (cid:90) ϕ ( z ; P ) d P = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) h (cid:48) ( p a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy + (cid:90) (cid:110) h ( p a ( y )) − h ( p a ( y )) (cid:111) dy = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ 12 (cid:90) h (cid:48)(cid:48) ( p ∗ a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy, where the last line follows by a Taylor expansion with remainder of the mean-value form, with p ∗ a ( y ) lying between p a ( y ) and p a ( y ) . 29 .3 Proof of Theorem 1 First, for any ﬁxed β , we have that each element of the p -vector m ( β ) = (cid:90) ∂g ( y ; β ) ∂β (cid:110) f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) dy, can be viewed as a density functional (cid:82) h ( p a ( y )) dy for a speciﬁc function h . In particular, let g (cid:48) j ( y ; β ) denote the j th element of ∂g ( y ; β ) ∂β so that h ( p a ( y )) = { h ( p a ( y )) , ..., h d ( p a ( y )) } T for h j ( p a ( y )) = g (cid:48) j ( y ; β ) (cid:110) f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) , (25)noting that, for a given β value, g ( y ; β ) is a known constant not depending on P .Now we apply Lemma 1 to each component of m . First note that h (cid:48) j ( p a ( y )) = g (cid:48) j ( y ; β ) (cid:110) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48)(cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) , by the chain rule, so that γ f ( y ; β ) = ∂g ( y ; β ) ∂β (cid:110) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48)(cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) = (cid:110) h (cid:48) ( p a ( y )) , ..., h (cid:48) p ( p a ( y )) (cid:111) T . Therefore Lemma 1 implies that m ( β ) = (cid:82) h ( p a ( y )) dy satisﬁes the von Mises expansion m ( β ) − m ( β ) = (cid:90) ϕ m ( z ; P ) d ( P − P ) + R ( P , P ) , (26)where ϕ m ( Z, β ; P ) = ( A = a ) π a ( X ) (cid:26) γ f ( Y ; β ) − (cid:90) γ f ( y ; β ) η a ( y | X ) dy (cid:27) + (cid:90) γ f ( y ; β ) η a ( y | X ) dy − (cid:90) (cid:90) γ f ( y ; β ) η a ( y | x ) dy d P ( x ) , and where the j th component of R ( P , P ) is given by R ,j ( P , P ) = (cid:90) (cid:90) h (cid:48) j ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ 12 (cid:90) h (cid:48)(cid:48) j ( p ∗ a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy. (27)Now we give a lemma showing why ﬁnding a von Mises expansion like the above, with second-order remainder, is equivalent to ﬁnding the eﬃcient inﬂuence function in a nonparametricmodel. This will prove ϕ m is the eﬃcient inﬂuence function for m ( β ) , and will also be usefulfor later results. Lemma 2.

Let ψ : P → R denote some real-valued functional on a nonparametric model, sothe set of distributions P does not constrain the tangent space. Assume the functional satisﬁes ψ ( P ) − ψ ( P ) = (cid:90) ϕ ( z ; P ) ( d P − d P ) + R ( P , P ) for some mean-zero and ﬁnite variance function ϕ ( z ; P ) . Then ϕ is the eﬃcient inﬂuenceinﬂuence function if dd(cid:15) R ( P , P (cid:15) ) | (cid:15) =0 = 0 for any smooth parametric submodel. roof. Recall from Bickel et al. [1993] and van der Vaart [2002] that the eﬃcient inﬂuencefunction is the mean-zero function whose variance equals the nonparametric eﬃciency bound,and is given by the unique function φ that is a valid submodel score (or limit of such scores)satisfying pathwise diﬀerentiability, i.e., dd(cid:15) ψ ( P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 = (cid:90) φ ( z ; P ) (cid:18) dd(cid:15) log d P (cid:15) (cid:19) (cid:12)(cid:12)(cid:12) (cid:15) =0 d P ( z ) (28)for P (cid:15) any smooth parametric submodel. (e.g., diﬀerentiable in quadratic mean) In a non-parametric model only one such function φ satisﬁes the above. We will show that the aboveis satisﬁed by the function ϕ in the statement of the lemma.First note that the assumed expansion implies ψ ( P ) − ψ ( P (cid:15) ) = − (cid:90) ϕ ( z ; P ) d P (cid:15) + R ( P , P (cid:15) ) for any submodel P (cid:15) . Diﬀerentiating with respect to (cid:15) gives dd(cid:15) ψ ( P (cid:15) ) = dd(cid:15) (cid:90) ϕ ( z ; P ) d P (cid:15) + dd(cid:15) R ( P , P (cid:15) )= (cid:90) ϕ ( z ; P ) (cid:18) dd(cid:15) log d P (cid:15) (cid:19) d P (cid:15) + dd(cid:15) R ( P , P (cid:15) ) , where the second line follows from the dominated convergence theorem and uses the fact that dd(cid:15) log d P (cid:15) = dd(cid:15) d P (cid:15) /d P (cid:15) . Therefore evaluating at (cid:15) = 0 we have dd(cid:15) ψ ( P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 = (cid:90) ϕ ( z ; P ) (cid:18) dd(cid:15) log d P (cid:15) (cid:19) (cid:12)(cid:12)(cid:12) (cid:15) =0 d P + dd(cid:15) R ( P , P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 , which yields the desired pathwise diﬀerentiability by the fact that dd(cid:15) R ( P , P (cid:15) ) | (cid:15) =0 = 0 .Now we can immediately apply Lemma 2, noting that dd(cid:15) R ( P , P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 = 0 by virtue of the fact that the remainder R ,j ( P , P (cid:15) ) in (27) consists of only second-orderproducts of errors between P and P (cid:15) . This follows since applying the product rule yields asum of two terms, each of which is a product of a derivative term (which may not be zeroat (cid:15) = 0 ) and an error term involving diﬀerences of components of P (cid:15) and P (which will bezero at (cid:15) = 0 ). Therefore ϕ m is the eﬃcient inﬂuence function for the parameter m ( β ) . Theeﬃcient inﬂuence functions for β and g ( y ; β ) follow similarly, via the chain rule. B.4 Proof of Theorem 2

From Lemmas 1 and 2, the eﬃcient inﬂuence function of ψ f = (cid:82) f ( p ( y ) , p ( y )) p ( y ) dy if p ( y ) were known would be ϕ ( z ; P ) = ( A = 1) π (1 | X ) (cid:26) h (cid:48) ( Y ) − (cid:90) h (cid:48) ( y ) η ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( y ) η ( y | X ) dy − (cid:90) h (cid:48) ( y ) η ( y | x ) dy d P ( x ) h (cid:48) ( y ) = h (cid:48) ( y ; p , p ) = p ( y ) f (cid:48) ( p ( y ) , p ( y )) . Similarly, if p ( y ) were known, the eﬃcient inﬂuence function of ψ f would be ϕ ( z ; P ) = ( A = 0) π (0 | X ) (cid:26) h (cid:48) ( y ) − (cid:90) h (cid:48) ( y ) η ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( y ) η ( y | X ) dy − (cid:90) h (cid:48) ( y ) η ( y | x ) dy d P ( x ) , where h (cid:48) ( y ) = h (cid:48) ( y ; p , p ) = f ( p ( y ) , p ( y )) + p ( y ) f (cid:48) ( p ( y ) , p ( y )) . The result then follows from the fact that the inﬂuence function when p and p are bothunknown is the sum of the two inﬂuence functions when p and p are known, separately. B.5 Proof of Claim in Remark 4

Here we show why rates for estimating p a ( y ) with the plug-in estimator (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } will not be slower than those for estimating (cid:98) η a ( y | x ) , by bounding the mean squared error ofthe former in terms of the latter. To this end we denote the pointwise bias and variance of (cid:98) η a as E { (cid:98) η a ( y | x ) } − η a ( y | x ) = b ( y | x ) and var { (cid:98) η a ( y | x ) } = v ( y | x ) , respectively. First notefor the bias that E { (cid:98) p a ( y ) } − p a ( y ) = E (cid:104) P n { (cid:98) η a ( y | X ) } (cid:105) − (cid:90) η a ( y | x ) d P ( x )= E (cid:90) (cid:98) η a ( y | x ) d P ( x ) − (cid:90) η a ( y | x ) d P ( x ) = (cid:90) b ( y | x ) d P ( x ) , where in the second line we used iterated expectation, conditioning on the training sample D n used to construct (cid:98) η a . For the variance we similarly havevar { (cid:98) p a ( y ) } = var (cid:16) E (cid:104) P n { (cid:98) η a ( y | X ) } | D n (cid:105)(cid:17) + E (cid:16) var (cid:104) P n { (cid:98) η a ( y | X ) } | D n (cid:105)(cid:17) = var (cid:90) (cid:98) η a ( y | x ) d P ( x ) + 1 n E (cid:104) var (cid:110)(cid:98) η a ( y | X ) | D n (cid:111)(cid:105) . For the ﬁrst term above, by Cauchy-Schwarz we havevar (cid:90) (cid:98) η a ( y | x ) d P ( x ) = E (cid:26)(cid:90) (cid:104)(cid:98) η a ( y | x ) − E { (cid:98) η a ( y | x ) } (cid:105) d P ( x ) (cid:27) ≤ E (cid:90) (cid:104)(cid:98) η a ( y | x ) − E { (cid:98) η a ( y | x ) } (cid:105) d P ( x ) = (cid:90) v ( y | x ) d P ( x ) And for the second term note that E (cid:104) var (cid:110)(cid:98) η a ( y | X ) | D n (cid:111)(cid:105) ≤ E (cid:90) (cid:98) η a ( y | x ) d P ( x ) = (cid:90) (cid:18) v ( y | x ) + (cid:104) E { (cid:98) η a ( y | x ) } (cid:105) (cid:19) d P ( x )= (cid:90) (cid:18) v ( y | x ) + (cid:104) E (cid:110)(cid:98) η a ( y | x ) − η a ( y | x ) + η a ( y | x ) (cid:111)(cid:105) (cid:19) d P ( x ) ≤ (cid:90) (cid:110) v ( y | x ) + 2 b ( y | x ) + 2 η a ( y | x ) (cid:111) d P ( x ) . Therefore as long as (cid:82) η a ( y | x ) d P ( x ) ≤ C , we have E (cid:20)(cid:110)(cid:98) p a ( y ) − p a ( y ) (cid:111) (cid:21) ≤ (cid:18) n (cid:19) (cid:90) E (cid:20)(cid:110)(cid:98) η a ( y | x ) − η a ( y | x ) (cid:111) (cid:21) d P ( x ) + 2 Cn .6 Proof of Theorem 3 First we present a master lemma giving the rate of convergence of the solution to a sample-splitestimating equation. The logic parallels that of Theorem 5.31 of van der Vaart [2000].

Lemma 3.

Let ϕ ( z ; θ, η ) denote a vector estimating function for target parameter θ ∈ R p andnuisance functions η ∈ H for some function space H . Suppose the true values ( θ , η ) satisfy P { ϕ ( Z ; θ , η ) } = 0 , and deﬁne the estimator (cid:98) θ as an approximate solution to the estimatingequation satisfying P n { ϕ ( Z ; (cid:98) θ, (cid:98) η ) } = o P (1 / √ n ) where (cid:98) η is estimated on a separate independent sample. Assume:1. The function class { ϕ ( z ; θ, η ) : θ ∈ R p } is Donsker in θ for any ﬁxed η .2. The estimators are consistent, i.e., (cid:98) θ − θ = o P (1) and (cid:107) (cid:98) η − (cid:98) η (cid:107) = o P (1) .3. The map θ (cid:55)→ P { ϕ ( Z ; θ, η ) } is diﬀerentiable at θ uniformly in η , with nonsingularderivative matrix ∂∂θ P { ϕ ( Z ; θ, η ) }| θ = θ = V ( θ , η ) , where V ( θ , (cid:98) η ) p → V ( θ , η ) .Then (cid:98) θ − θ = − V ( θ , η ) − ( P n − P ) { ϕ ( Z ; θ , η ) } + O P (cid:16) R n (cid:17) + o P (cid:18) √ n (cid:19) for R n = P { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } .Proof. First note that, since ( (cid:98) θ, (cid:98) η ) and ( θ, η ) are approximate and exact solutions of theempirical and population moment conditions, respectively, we have o P (1 / √ n ) = P n { ϕ ( Z ; (cid:98) θ, (cid:98) η ) } − P { ϕ ( Z ; θ , η ) } = ( P n − P ) { ϕ ( Z ; θ , η ) } + ( P n − P ) { ϕ ( Z ; (cid:98) θ, (cid:98) η ) − ϕ ( Z ; θ , (cid:98) η ) } (29) + ( P n − P ) { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } (30) + P { ϕ ( Z ; (cid:98) θ, (cid:98) η ) − ϕ ( Z ; θ , (cid:98) η ) } + P { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } (31)where the second equality follows by simply adding and subtracting terms. The ﬁrst term in(29) is a simple sample average of a ﬁxed function and so will be asymptotically Gaussian bythe central limit theorem. The second term in (29) and the term in (30) are empirical processterms. The ﬁrst term in (31) will be linearized in ( (cid:98) θ − θ ) , while the second term in (31)captures the eﬀect of the nuisance estimation error. We will tackle each of these in turn.Under the Donsker and consistency conditions for (cid:98) θ in Assumptions 1 and 2, the second termin (29) is o P (1 / √ n ) by Lemma 19.24 of van der Vaart [2000]. Under the consistency of (cid:98) η inAssumption 2 and the sample splitting, the term in (30) is o P (1 / √ n ) by Lemma 2 of Kennedyet al. [2020].By the diﬀerentiability of the map θ (cid:55)→ P { ϕ ( Z ; θ, η ) } in Assumption 3, the ﬁrst term in (31)can be expressed as P { ϕ ( Z ; (cid:98) θ, (cid:98) η ) − ϕ ( Z ; θ , (cid:98) η ) } = V ( θ , (cid:98) η )( (cid:98) θ − θ ) + o P ( (cid:107) (cid:98) θ − θ (cid:107) )= V ( θ , η )( (cid:98) θ − θ ) + o P ( (cid:107) (cid:98) θ − θ (cid:107) ) where the last line follows by the consistency of V ( θ , (cid:98) η ) in Assumption 3.33herefore we have o P (1 / √ n ) = ( P n − P ) { ϕ ( Z ; θ , η ) } + V ( θ , η )( (cid:98) θ − θ ) + R n + o P ( (cid:107) (cid:98) θ − θ (cid:107) ) where we let R n = P { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } denote the second term in (31), or equivalently (cid:98) θ − θ = − V ( θ , η ) − ( P n − P ) { ϕ ( Z ; θ , η ) } + O P ( R n ) + o P ( (cid:107) (cid:98) θ − θ (cid:107) ) + o P (1 / √ n ) by the nonsingularity of the derivative matrix in Assumption 3. This implies (cid:107) (cid:98) θ − θ (cid:107) (1 + o P (1)) = O P (1 / √ n + R n ) so that (cid:107) (cid:98) θ − θ (cid:107) = O P (1 / √ n + R n ) , which gives the result after noting that o P ( O P (1 / √ n + R n )) = o P (1 / √ n + R n ) and that O P ( R n ) + o P ( R n ) = O P ( R n ) .Now we can apply Lemma 3 to prove Theorem 3. First note that by deﬁnition the estimatorsatisﬁes P n { φ ( Z ; (cid:98) β, (cid:98) η ) } = o P (1 / √ n ) for φ ( Z ; β, η ) = m ( β ; η ) + ϕ ( Z ; β, η ) , and the true values ( β , η ) satisfy P { φ ( Z ; β , η ) } = 0 ,again by deﬁnition.Conditions 1–3 of Lemma 3 hold by Assumptions 1–4 of Theorem 3, so the result follows byvirtue of the fact that R n ≡ P { φ ( Z ; β , (cid:98) η ) − φ ( Z ; β , η ) } = m ( β ; (cid:98) η ) + P { ϕ m ( Z ; β , (cid:98) η ) } = P (cid:90) h (cid:48) ( (cid:98) p a ( y )) (cid:26) π a ( X ) (cid:98) π a ( X ) − (cid:27) (cid:110) η a ( y | X ) − (cid:98) η a ( y | X ) (cid:111) dy + 12 (cid:90) h (cid:48)(cid:48) ( (cid:98) p ∗ a ( y )) (cid:110)(cid:98) p a ( y ) − p a ( y ) (cid:111) dy (cid:46) (cid:107) (cid:98) π − π (cid:107)(cid:107) (cid:98) η a − η a (cid:107) + δ (cid:107) (cid:98) p a − p a (cid:107) where the second to last line follows from the result given in equation (27) of the proof ofTheorem 1, for the vectors h and h (cid:48) as deﬁned in (25), and the last line follows by the Cauchy-Schwarz inequality and boundedness assumptions on h (cid:48) , h (cid:48)(cid:48) , and / (cid:98) ππ