Semiparametric counterfactual density estimation
SSemiparametric Counterfactual Density Estimation
Edward H. Kennedy, Sivaraman Balakrishnan, Larry WassermanDepartment of Statistics & Data ScienceCarnegie Mellon University {edward, siva, larry} @ stat.cmu.edu
Abstract
Causal effects are often characterized with averages, which can give an incompletepicture of the underlying counterfactual distributions. Here we consider estimating theentire counterfactual density and generic functionals thereof. We focus on two kinds oftarget parameters. The first is a density approximation, defined by a projection onto afinite-dimensional model using a generalized distance metric, which includes f -divergencesas well as L p norms. The second is the distance between counterfactual densities, whichcan be used as a more nuanced effect measure than the mean difference, and as a toolfor model selection. We study nonparametric efficiency bounds for these targets, givingresults for smooth but otherwise generic models and distances. Importantly, we showhow these bounds connect to means of particular non-trivial functions of counterfactuals,linking the problems of density and mean estimation. We go on to propose doubly robust-style estimators for the density approximations and distances, and study their rates ofconvergence, showing they can be optimally efficient in large nonparametric models. Wealso give analogous methods for model selection and aggregation, when many modelsmay be available and of interest. Our results all hold for generic models and distances,but throughout we highlight what happens for particular choices, such as L projectionson linear models, and KL projections on exponential families. Finally we illustrate byestimating the density of CD4 count among patients with HIV, had all been treated withcombination therapy versus zidovudine alone, as well as a density effect. Our resultssuggest combination therapy may have increased CD4 count most for high-risk patients.Our methods are implemented in the freely available R package npcausal on GitHub. Keywords: causal inference, density estimation, influence function, model misspecification,semiparametric theory.
It is very common in causal inference to quantify causal effects with means. The classic aver-age treatment effect (ATE) parameter, for instance, measures the difference in mean outcomehad all versus none in a population been treated. This can certainly be a useful summary,but it can also miss potentially important differences in the distributions of the counterfactualoutcomes, beyond a simple mean shift. To illustrate, consider the densities in Figure 1, whichall have exactly the same mean and variance. These would be indistinguishable with the ATE,or any other measure that did not look past the first two moments. a r X i v : . [ s t a t . M E ] F e b . . . . −4 −2 0 2 4 . . . . . D en s i t y −4 −2 0 2 4 . . . . . . D en s i t y −4 −2 0 2 4 . . . . . −4 −2 0 2 4 . . . . D en s i t y −4 −2 0 2 4 . . . . D en s i t y Figure 1:
Densities for six distributions, all with mean zero and variance four.
In general, it can be very practically useful to know the shape of the counterfactual density.If the counterfactual density differed at all under treatment versus control (or under any othergeneric interventions), this would imply treatment had some effect, even if the ATE were zero.The presence of skew would indicate that some subjects have relatively extreme responses totreatment; next steps could include trying to understand who these subjects are, and whytheir responses are unusual. Similarly, multimodal structure could point to the existence ofunderlying subgroups with differential responses to treatment, which could be important foroptimizing treatment policies. Contrasting the shape of the density under different interven-tions could inform hypotheses about how treatment works, e.g., perhaps it works by reducingvariance, or driving up negative outcomes. This could help enhance future versions of treat-ment, or motivate the development of new treatments altogether.There is a large literature on distributional treatment effects defined in terms of quantiles,or cumulative distribution functions (CDFs), with a similar goal of moving beyond simplemean summaries to study the entire counterfactual distribution [Abadie, 2002, Chernozhukovand Hansen, 2005, Chernozhukov et al., 2013, Díaz, 2017, Firpo, 2007, Fortin et al., 2011,Frölich and Melly, 2013, Machado and Mata, 2005, Melly, 2005, Rothe, 2010, Wang et al.,2018, Wang and Qin, 2010, Zhang et al., 2012]. However, the challenges and methods aresubstantially different for density estimation. This is largely a result of the fact that the CDFat y is the mean of the thresholded outcome ( Y ≤ y ) , so that counterfactual CDF estima-tion mostly reduces to counterfactual mean estimation, after replacing the outcome with anindicator. A related difference is that the CDF is pathwise differentiable in a nonparametricmodel, whereas the density function is not [Bickel et al., 1993, van der Laan and Robins, 2003].This is also true in the standard observational setup, where CDFs can be estimated at n − / rates with sample averages, while density estimation requires more careful balancing of biasand variance, with slower rates arising depending on underlying smoothness [Tsybakov, 2009,Wasserman, 2006]. Beyond this issue of statistical complexity, there are other trade-offs intargeting CDFs versus densities. One is that, although CDFs are easier to estimate nonpara-metrically, densities are arguably more visually appealing and interpretable to practitioners.We view CDFs and densities as complementary pieces of the distributional puzzle.2nlike distribution function estimation, the literature on counterfactual density estima-tion appears much more sparse. In what was perhaps the first study of the problem, DiNardoet al. [1996] used a reweighted kernel estimator to estimate effects of US labor market factorson wages. However, the statistical properties of this proposed approach were not examined.Robins and Rotnitzky [2001] proposed a doubly robust version of the reweighted kernel esti-mator, and conjectured it would achieve usual density estimation rates under smoothness andother conditions. van der Laan and Dudoit [2003] and Rubin and van der Laan [2006] stud-ied general cross-validation-based approaches for model selection in the presence of nuisancefunctions, and suggested minimizing counterfactual KL or L loss for density estimation, butdid not detail the statistical properties. More recently, Westling and Carone [2020] tackled therelated problem of density estimation for right-censored outcomes, proposing new estimatorsthat can attain n − / rates, but under an assumption that the density is monotone. Kim et al.[2018] analyzed a version of the doubly robust estimator from Robins and Rotnitzky [2001],showing its conjectured oracle properties, and used it to estimate the (nonsmooth) L distancebetween counterfactual densities.Somewhat surprisingly, none of the known work above on counterfactual density estima-tion considers a semiparametric approach, where the density is approximated with a finite-dimensional model. Our work aims to fill this gap in the literature, while also providingdata-driven model selection and aggregation tools. A separate contribution is our study ofgeneric density-based effects, which characterize the distance between counterfactual densities,using a generalized notion of distance that includes f -divergences as well as L p norms.The structure of our paper is as follows. After introducing some basics and causal as-sumptions in Section 2, in Section 3 we detail the different kinds of target parameters weconsider. The first (described in Section 3.1) is an approximation of the density itself, definedby a projection onto a finite-dimensional model (3.1.1) using a generalized distance metric(3.1.2), which includes f -divergences as well as L p norms. Importantly, we show in Section3.1.3 that projection parameters for smooth models and distances can be framed as solutionsto moment conditions, providing a link between counterfactual densities and means (of func-tions of counterfactuals). The second parameter we consider (described in Section 3.2) is thedistance between counterfactual densities, which can be used as a new more nuanced effectmeasure, or as a tool for model selection (as in Section 3.3). In Section 4 we study nonpara-metric efficiency bounds, by characterizing the efficient influence functions of approximateddensity functions in Section 4.1, and density effects in Section 4.2. These follow from a masterlemma in Section 4, which gives a von Mises expansion for generic integral functionals of thecounterfactual density, and so may be of independent interest. In Section 5 we propose doublyrobust-style estimators for the density approximations and distances, and study their rates ofconvergence, showing for example that they can be n − / consistent, asymptotically normal,and optimally efficient under weak high-level conditions on nuisance estimation error. All ourresults hold for smooth but otherwise arbitrary models and distances. However, in variouscorollaries, we also highlight specific expressions for typical choices of models and distances,such as L projections on linear models, and KL projections on exponential families. Finallyin Section 6 we use our proposed methods to estimate counterfactual densities and densityeffects of combination therapy (versus zidovudine alone) on CD4 count, among patients withHIV. Our results show treatment effects beyond a mean shift, suggesting that combinationtherapy may have increased CD4 count most for high-risk patients.3 Setup
We assume access to an iid sample ( Z , ..., Z n ) of Z = ( X, A, Y ) ∼ P where X ∈ R d arecovariates, A ∈ R is a treatment or exposure, and Y ∈ R is a continuous outcome. We let π a ( x ) = P ( A = a | X = x ) (1) (cid:90) B η a ( y | x ) dy = P ( Y ∈ B | X = x, A = a ) for measurable B (2)denote the propensity score (i.e., chance of being treated at level A = a given covariates) andconditional outcome density, respectively. In this work we focus on discrete treatments, butin a companion paper we consider the continuous case.We study “semiparametric” estimation of the covariate-adjusted marginal density p a ( y ) = (cid:90) η a ( y | x ) d P ( x ) (3)i.e., the conditional outcome density averaged over the covariates, as well as functionals thereof. Remark . Although we refer to our work in this paper as semiparametric, in reality it is alldone within a fully nonparametric model. As described in more detail starting in Section 3.1,the models we consider are only ever used as tools for defining nonparametric approximations,and corresponding projection parameters, and are never assumed to be correct descriptions ofthe underlying true data-generating process. Further, our results on estimating counterfactualdensity functionals (e.g., Sections 3.2 and 4.2) do not require any approximating models, andso are nonparametric in the usual sense.We note that the density (3) is different from the marginal density of Y , since the treatmentis fixed at A = a in the conditioning; it is also not equal to the unadjusted conditional density p ( y | a ) . Instead, (3) is the density p ( y a ) of the counterfactual variable Y a (i.e., the outcomethat would have been observed if treatment were set to A = a ), if the following assumptionshold: Assumption . P { π a ( X ) ≥ (cid:15) } = 1 for some (cid:15) > . Assumption . Y = Y a if A = a . Assumption . A ⊥⊥ Y a | X .Positivity ensures all subjects have some chance at receiving treatment level A = a . Con-sistency can be viewed as ruling out interference, for example, where a subject’s counterfactualcan depend not only on how they were treated, but how other subjects were treated as well.Exchangeability says the treatment is as good as randomized within levels of the observed co-variates, and requires that sufficiently many relevant confounders are collected. Each of theseassumptions can be weakened in various ways, at the expense of losing point identification ofthe marginal counterfactual distribution. Nonetheless, under only the positivity assumption,all our statistical results will hold relative to the observational quantity in (3), regardless ofwhether the causal Assumptions 2–3 are violated or not.4 Target Parameters
In this section we detail the two kinds of quantities we consider estimating. The first is an ap-proximation of the counterfactual density itself, defined via a projection in some distributionaldistance. The second is a distance measure, e.g., a density-based causal effect measuring thedifference between counterfactual densities in terms of general f - or other divergences. Thelatter gives a more nuanced picture of how the counterfactual densities differ, compared tothe usual ATE, for example. Finally in Section 3.3 we describe how these two kinds of targetquantities can be adapted for the purposes of model selection and aggregation. First we consider approximations of the counterfactual density p a ( y ) based on some specifiedmodel { g ( y ; β ) : β ∈ R d } . We mostly focus on the finite-dimensional parametric case with β ∈ R d , but more generally one could take β to be infinite-dimensional in some L p space, orto belong to a subset of R d such as the standard simplex. Note that β ( a ) depends on a but fornow we suppress this dependence in the notation and simply write β . Here are some examples. Example 1a (Exponential family) . Let b ( y ) = { b ( y ) , ..., b d ( y ) } T denote a vector of knownbasis functions. Then we can project onto the exponential family g ( y ; β ) = exp (cid:110) β T b ( y ) − C ( β ) (cid:111) (4)where C ( β ) = log (cid:82) exp { β T b ( y ) } dy so that (cid:82) g ( y ; β ) dy = 1 . Typical exponential familynotation takes β = 1 and sets b ( y ) = log h ( y ) for some known base measure h .Although we refer to Example 1a as an exponential family, it can just as well be viewedas a truncated series expansion used together with a log link function. In the next examplewe consider a truncated series with an identity link function. Example 1b (Truncated series) . Let b ( y ) = { b ( y ) , ..., b d ( y ) } T denote a vector of known basisfunctions, and q ( y ) a known base density (e.g., uniform). Then we can project onto the linearbasis expansion g ( y ; β ) = q ( y ) + d (cid:88) j =1 β j b j ( y ) where we can take (cid:82) b j ( y ) dy = 0 so that the projection integrates to one. A natural choicewhen Y ∈ [0 , would be to take q ( y ) = 1 and b ( y ) the cosine basis b j ( y ) = √ πjy ) (5)which satisfies (cid:82) b j ( y ) dy = 0 and (cid:82) b j ( y ) b k ( y ) dy = ( j = k ) on the unit interval. One couldalternatively take q ( y ) = 0 and let b j ( y ) be (the linear span of) a collection of d candidatedensities, in which case the above could be viewed as a linear aggregation [Rigollet and Tsy-bakov, 2007]. Another related option would be to use a linear approximation for the squareroot of the density (cid:112) g ( y ; β ) = (cid:80) j β j b j ( y ) , so that g ( y ; β ) = (cid:80) j (cid:80) k β j β k b j ( y ) b k ( y ) [Chenet al., 2002, Pinheiro and Vidakovic, 1997]. Then the model would integrate to one if the basisfunctions were orthonormal ( (cid:82) b j ( y ) b k ( y ) dy = 0 and (cid:82) b j ( y ) dy = 1 ) and (cid:80) j β j = 1 .5 xample 1c (Gaussian mixture model) . Let ( µ , ..., µ k ) denote a vector of means, ( σ , ..., σ k ) a vector of positive standard deviations, ( (cid:36) , ..., (cid:36) k ) positive mixing proportions with (cid:80) j (cid:36) j =1 , and φ the standard normal density. Then the standard Gaussian mixture model is g ( y ; β ) = k (cid:88) j =1 (cid:36) j (cid:18) σ j (cid:19) φ (cid:18) y − µ j σ j (cid:19) where β = { ( (cid:36) , µ , σ ) , ...., ( (cid:36) k , µ k , σ k ) } .Now, based on the above approximations, a primary goal is to estimate the projectionparameter β = arg min β ∈ R p D f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) (6)where D f is a distributional distance measure of the form D f ( p, q ) = (cid:90) f ( p, q ) q ( y ) dy (7)for some given discrepancy function f : R → R . Remark . In contrast to typical f -divergences [Ali and Silvey, 1966, Csiszár, 1967, Rényiet al., 1961, Sason and Verdú, 2016], we allow the function f to have two arguments, one foreach distribution; this allows us to capture not only f -divergences but also other distancessuch as those based on L p norms. (The usual f -divergence takes f ( p, q ) = f ( p/q ) for somesingle-argument function, and so only depends on the density ratio). In a slight abuse ofterminology we sometimes refer to (7) as a distance, even though for some of our choices of f it will be an asymmetric divergence not satisfying the triangle inequality.Before giving examples of distances, we first discuss the interpretation of our projectionparameter (6). Mathematically, β is the parameter of the best-fitting model of the form g ( y ; β ) , i.e., the parameter value that makes g ( y ; β ) closest (in corresponding distance) to thetrue density p a ( y ) . If the model g is correctly specified, then D f ( p a ( y ) , g ( y ; β )) = 0 and so g ( y ; β ) = p a ( y ) is simply the true counterfactual density; however, the projection (6) remainswell-defined even under model misspecification. This is akin to the well-known concept of a best linear predictor in standard linear regression [White, 1980]. The projection approach,where a model is not assumed correct but instead only used for defining approximations, hasbeen used widely throughout statistics [Beran, 1977, Buja et al., 2019a,b, Huber, 1967, Rakhlinet al., 2017, Rinaldo and Wasserman, 2010, Tsybakov, 2003, Wasserman, 2006, White, 1982,1996] as well as in causal inference [Chernozhukov et al., 2018b, Cuellar and Kennedy, 2020,Kennedy et al., 2019, Neugebauer and van der Laan, 2007, Semenova and Chernozhukov, 2020,van der Laan, 2006], though not in the counterfactual density estimation context. Remark . Since we only use models as tools to define approximations, all our results are for-mally nonparametric, as mentioned in Remark 1 and illustrated in subsequent theorems. Thisraises some interesting philosophical issues about the role of assumptions and correspondingbias-variance trade-offs. In particular, we can imagine a rough taxonomy of stances one mighttake in estimation problems like this one: 6i) model-ist: My finite-dimensional/parametric representation is the correct one.(ii) model-agnostic: I may use a finite-dimensional model, but I do not know or require thatit is a perfectly accurate picture of the truth.(iii) anti-model-ist: No parametric model I can imagine contains the truth, and I do not careabout approximations.The model agnostic view is often captured by the famous quotes “All models are wrong butsome are useful” (George Box) and “Use models but don’t believe them” (possibly due to JohnTukey). Of course, in practice, how much one relies on models is a continuum, and so anyparticular approach may not fall entirely in one of the three camps above. Similarly, ourtaxonomy uses parametric models as a benchmark, but one could just as well replace with adifferent assumption set (e.g., Hölder-smooth with index s ≥ versus s < ). Neverthelesswe find the above framing useful if imperfect. In this paper, we mostly take the stance ofthe model-agnostic, though we flirt with anti-model-ism in the data-driven model selectionapproaches of Sections 3.3 and 5.3 (and we are fully anti-model-ist in a companion paper).We also accept that each approach has advantages and disadvantages. The model-ist willdo well when the model is correct, but could unknowingly suffer large bias otherwise. Theanti-model-ist is most free from the constraints of human imagination (as they do not needto posit a parametric model), but with a more ambitious target can also suffer larger errors.The model-agnostic has a bit of the best of both worlds: when the model is correct, they mayhope to do nearly as well as the model-ist, and when the model is wrong, their inference canstill be valid for a still well-defined approximation. Of course, if the model is very wrong, theapproximation may not be practically useful, no matter how well-defined it is; thus there canbe important challenges in defining a useful approximating model and distance. Now we give some examples of the distances we focus on in this paper:
Example 2a ( L ) . If f ( p, q ) = ( p − q ) /q then D f ( p, q ) = (cid:107) p − q (cid:107) is the squared L distance (cid:107) p a ( y ) − g ( y ; β ) (cid:107) = (cid:90) (cid:16) p a ( y ) − g ( y ; β ) (cid:17) dy. Example 2b (Kullback-Leibler) . If f ( p, q ) = ( p/q ) log( p/q ) then D f ( p, q ) = KL ( p, q ) is theKullback-Leibler divergenceKL (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = (cid:90) log (cid:18) p a ( y ) g ( y ; β ) (cid:19) p a ( y ) dy. Example 2c ( χ ) . If f ( p, q ) = ( p/q − then D f ( p, q ) = χ ( p, q ) is the χ divergence χ (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = (cid:90) { p a ( y ) − g ( y ; β ) } g ( y ; β ) dy. Example 2d (Hellinger) . If f ( p, q ) = ( (cid:112) p/q − then D f ( p, q ) = H ( p, q ) is the squaredHellinger divergence H (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = (cid:90) (cid:16)(cid:112) p a ( y ) − (cid:112) g ( y ; β ) (cid:17) dy. xample 2e (Smoothed Total Variation) . If f ( p, q ) = q | p − q | = ( p − q )2 q sgn( p − q ) then D f ( p, q ) = TV ( p, q ) = (cid:107) p − q (cid:107) is the total variation distance (and half the L distance). Note f ( p, q ) is not differentiable at p/q = 1 . Smooth versions can be obtained by approximatingthe absolute value or sign functions in f . For example, let ν t ( y ) be an approximation of theabsolute value function | y | , with parameter t controlling the approximation error. For exampleone could use ν t ( y ) = y tanh( ty ) or ν t ( y ) = y erf ( ty ) or a best polynomial approximation ofdegree t . Then taking f ( p, q ) = f t ( p, q ) = q ν t ( p − q ) gives a smoothed total variation D f ( p, q ) = TV ∗ ( p, q ) withTV ∗ (cid:16) p a ( y ) , g ( y ; β ) (cid:17) = 12 (cid:90) ν t (cid:110) p a ( y ) − g ( y ; β ) (cid:111) dy. There exist polynomial and rational approximations ν t ( y ) of degree t ensuring that | TV ( p, q ) − TV ∗ ( p, q ) | is of order t − and exp( − t ) , respectively [Newman et al., 1964]. We also note thatthe Hellinger divergence is closely related to total variation in the sense that H ( p, q ) / ≤ TV ( p, q ) ≤ H ( p, q ) for any densities p, q .Figure 2 shows a few projections of a true density onto a truncated trigonometric serieswith six terms, using four different distances ( L , Kullback-Leibler, χ , and Hellinger). Theprojections are all very similar in both cases. However, we note that, as discussed for examplein Beran [1977], Hellinger projections should be more stable and robust to outliers or contam-ination, compared to for example KL. The projections are closer to the true density for thefirst simpler Gaussian mixture, and are more of a rough approximation for the second morecomplex mixture. . . . y D en s i t y TrueKLChi−squaredHellingerL2 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . y D en s i t y TrueKLChi−squaredHellingerL2
Figure 2:
Projections of a truncated Gaussian mixture (left) and the Bart Simpson density(right) onto a trigonometric basis with six terms, using L distance, along with Kullback-Leibler, χ , and Hellinger divergences. .1.3 Moment Conditions The next proposition shows how, for smooth distances, the projection parameter β can bedefined more explicitly than in equation (6), as a solution to a population moment condition,involving derivatives of the model g ( y ; β ) and the function f . This links projection parametersto integral functionals of the counterfactual density (i.e., moments of transformations of coun-terfactuals), which is why our efficiency bounds and estimators in the next section resemblethose for means of particular non-trivial functions of counterfactuals. Proposition 1.
Assume g is differentiable in β , f is differentiable in its second argumentwith derivative f (cid:48) ( q , q ) = ∂∂q f ( q , q ) , and that the minimizer in (6) is unique. Then theprojection parameter β = arg min β ∈ R p D f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) can be expressed as a solution to the moment condition m ( β ) = 0 , where m ( β ) ≡ (cid:90) ∂g ( y ; β ) ∂β (cid:110) f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) dy. (8)The proof of Proposition 1 follows from the chain rule; all subsequent proofs are given inAppendix B. Throughout we assume there is a unique solution to m ( β ) = 0 . Next we showhow the moment condition defining β simplifies for particular distances. Corollary 1.
The quantity f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) in the integrand ofthe moment (8) equals (cid:110) g ( y ; β ) − p a ( y ) (cid:111) if D f = L − p a ( y ) g ( y ; β ) if D f = KL − (cid:26) p a ( y ) g ( y ; β ) (cid:27) if D f = χ − (cid:115) p a ( y ) g ( y ; β ) if D f = H − ν (cid:48) t (cid:110) p a ( y ) − g ( y ; β ) (cid:111) / if D f = TV ∗ . Corollary 1 shows how the moment m ( β ) essentially reduces to functionals of the counter-factual density for particular distances: simple means for L and KL, a quadratic functionalfor χ , and a square root functional for H . For the smoothed TV distance, it depends on theform of the absolute value approximation (e.g., for ν t a t degree polynomial approximation,the moment m ( β ) would be an integral of a t − degree polynomial in the countef).In the following corollaries we show how the form of the moment condition is particularlystraightforward when based on L or KL divergence with series models and exponential fam-ilies, respectively. 9 orollary 2. If D f = L then m ( β ) = 2 (cid:90) ∂g ( y ; β ) ∂β (cid:110) g ( y ; β ) − p a ( y ) (cid:111) dy. Therefore if the support of Y is [0 , , and g ( y ; β ) = 1 + β T b ( y ) is the truncated series inExample 1b then β = E (cid:110) b ( Y a ) (cid:111) (9) when b ( · ) is an orthogonal series with (cid:82) b j ( y ) dy = 0 and (cid:82) b j ( y ) b k ( y ) dy = ( j = k ) . Corollary 2 shows that when using orthogonal series with L projections, there is a closedform for β , given by a simple mean of a known function of the counterfactual outcome. Esti-mation and inference for parameters like (9) is relatively well-understood [Robins et al., 2009,2017], which allows exploiting existing theory and methods in the density estimation context. Corollary 3. If D f = KL then m ( β ) = − E (cid:26) ∂∂β log g ( Y a ; β ) (cid:27) , and so if g ( y ; β ) = exp { β T b ( y ) − C ( β ) } is the exponential family in Example 1a then m ( β ) = ∂∂β C ( β ) − E (cid:110) b ( Y a ) (cid:111) = (cid:90) b ( y ) (cid:104) exp { β T b ( y ) − C ( β ) } − p a ( y ) (cid:105) dy. (10)Similarly, for KL divergence, the moment m ( β ) is simply the expected score under coun-terfactual density g ( y ; β ) . Therefore, just as in the non-counterfactual setting, the parametervalues that maximize a posited likelihood are also those that minimize KL divergence [Huber,1967, White, 1982]. When one also uses an exponential family, the solution to m ( β ) = 0 corresponds to an intuitive “moment matching”, i.e., finding the value of β that equates ex-pectations of b ( · ) under g to those under the distribution of Y a . In addition to estimating projections of the counterfactual density onto a finite-dimensionalmodel, in this section we also consider estimation of distributional distances themselves. Themain focus is on density-based effects measuring the distance between counterfactual densitiesin terms of L pp and f -divergences. These effects can detect more nuanced disinctions betweenthe distributions of Y and Y , beyond simple differences-in-means captured by standard av-erage treatment effects.More specifically, we consider the distance between p and p given by ψ f = D f (cid:16) p ( y ) , p ( y ) (cid:17) = (cid:90) f (cid:16) p ( y ) , p ( y ) (cid:17) p ( y ) dy (11)for discrepancy functions f as discussed in the previous subsection. In this setup we do notrequire approximating the densities p a ( y ) with finite-dimensional models, and instead considerestimating ψ f in a fully nonparametric model.10 .3 Model Selection & Aggregation In practice one may not have an approximating model such as (4) available a priori . In thesecases it would be natural to instead set up a sequence of models, and use the data to chooseamong them. In standard regression and density estimation problems, simple cross-validationprocedures are available for this task; however, because our goal is estimation of a more nu-anced counterfactual density, these require some refinement, in the same spirit as van der Laanand Dudoit [2003]. Thus in this section we describe how the target quantities of Sections 3.1and 3.2 can be adapted for the purposes of model selection and aggregation.Specifically, for a set of estimators { (cid:98) g k ( y ) : k = 1 , ..., K } of p a ( y ) (e.g., estimated fromsome initial training sample, with each projected onto the space of valid densities), we candefine the risk for a given estimator as R ( (cid:98) g k ) = D f (cid:16) p a ( y ) , (cid:98) g k ( y ) (cid:17) (12)The minimum risk oracle estimator (cid:98) g k ( y ) can then be defined via k = arg min k R ( (cid:98) g k ) = arg min k D f (cid:16) p a ( y ) , (cid:98) g k ( y ) (cid:17) . (13)A model aggregation oracle can be defined more generally as (cid:101) g ( y ) = (cid:80) k β k (cid:98) g k ( y ) where β = arg min β ∈ B D f (cid:32) p a ( y ) , K (cid:88) k =1 β k (cid:98) g k ( y ) (cid:33) . (14)for some appropriate selection set, e.g., the standard simplex B = { ( β , ..., β K ) ∈ R K : β k ≥ , (cid:80) k β k = 1 } for convex aggregation [Rigollet and Tsybakov, 2007, Tsybakov, 2003]. If onetakes B = R K for linear aggregation, then f -divergences may not be well-defined, so thismight naturally only be used in the D f = L setting.Note that the proposed target parameters in Section 3.1 correspond to the aggregationtarget in (14) if we replace R d with the relevant space B . However, since model selection asdefined in Equation (13) does not satisfy the smoothness assumptions we relied on in Section3.1.3, it can be useful in practice to estimate the risk separately for all K candidates; this ismore akin to the effect estimation problem in Section 3.2, except where the density p ( y ) in(11) is replaced with a candidate estimator (cid:98) g k ( y ) (e.g., which may be estimated on a separateindependent sample/fold and conditioned upon, and so treated as fixed). In this section we present a crucial von Mises expansion (i.e., distributional Taylor expansion)for generic density functionals, which yields efficient influence functions for the projectionparameters and density effects of interest, and thus nonparametric efficiency bounds [Bickelet al., 1993, van der Laan and Robins, 2003]. The latter can be further formalized as localminimax lower bounds [van der Vaart, 2002]. 11hroughout we make reference to the linear map T (cid:55)→ φ a ( T ; P ) defined as φ a ( T ; P ) = ( A = a ) π a ( X ) (cid:110) T − E ( T | X, A = a ) (cid:111) + E ( T | X, A = a ) − E { E ( T | X, A = a ) } (15)which takes a random variable T (and distribution P ) and outputs the efficient influence func-tion for the functional E { E ( T | X, A = a ) } . Note we drop the dependence of φ a ( T ; P ) on ( X, A ) for simplicity; at times we also drop the dependence on P if the context is clear. Inall our examples, T = h ( Y ) will be a known or P -dependent function of Y ; the functionalswe consider all have influence functions consisting of terms of the above form, but with dif-ferent and non-standard choices of T = h ( Y ) , depending on the model and distance being used.Recall that in Corollary 1 we showed the relevant moment m ( β ) reduces to a functionalof the counterfactual density for particular distances. Therefore our first result gives a vonMises-style expansion for generic smooth integral functionals of the counterfactual density.This result paves the way for later expansions and efficiency bounds, and may be of indepen-dent interest in other problems involving different counterfactual density functionals. Lemma 1.
Let ψ = ψ ( P ) = (cid:82) h ( p a ( y )) dy for some twice continuously differentiable function h . Then ψ satisfies the von Mises expansion ψ ( P ) − ψ ( P ) = (cid:90) φ a (cid:16) h (cid:48) (cid:16) p a ( Y ) (cid:17) ; P (cid:17) d ( P − P ) + R ( P , P ) (16) where R ( P , P ) = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ 12 (cid:90) h (cid:48)(cid:48) ( p ∗ a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy, where p ∗ a ( y ) lies between p a ( y ) and p a ( y ) . Lemma 1 has several important consequences. First, it indicates how one can correct thefirst-order bias of a plug-in estimator ψ ( (cid:98) P ) of counterfactual density functionals: by estimatingthe first term in the expansion and subtracting it off. This is how standard semiparametricestimators (particularly of the one-step variety) based on influence functions are constructed[Bickel et al., 1993, Chernozhukov et al., 2018a, van der Laan and Robins, 2003], and ourproposed estimators in the next section do precisely this. Second, since the remainder term isquadratic in the nuisance functions, it implies that ψ ( P ) is pathwise differentiable with efficientinfluence function φ a ( h (cid:48) ( p a ( Y )) ; for this fact we refer to Lemma 2 in the Appendix. In this subsection we use Lemma 1 to detail the efficient influence function for the moment m ( β ) at a fixed β , as well as the projection parameter β and projected density g ( y ; β ) . Theseefficient influence functions yield local minimax lower bounds, as well as estimators that canattain the nonparametric efficiency bounds under generic high-level rate conditions on nui-sance estimators, which will be proved in Section 5.12 heorem 1. Assume f is twice differentiable and denote partial derivatives as f (cid:48) j ( q , q ) = ∂∂q j f ( q , q ) and similarly f (cid:48)(cid:48) jk ( q , q ) = ∂ ∂q j ∂q k f ( q , q ) . Then, under an unrestricted nonpara-metric model, the efficient influence function for m ( β ) is given by φ a (cid:16) γ f ( Y ; β ) (cid:17) where γ f ( y ; β ) ≡ γ f ( y ; β, p a ) = ∂g ( y ; β ) ∂β (cid:110) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48)(cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) . The efficient influence functions for β and g ( y ; β ) are similarly given by − ∂m ( β ) ∂β − φ a (cid:16) γ f ( Y ; β ) (cid:17) (cid:12)(cid:12)(cid:12) β = β and − ∂g ( y ; β ) ∂β T ∂m ( β ) ∂β − φ a (cid:16) γ f ( Y ; β ) (cid:17) (cid:12)(cid:12)(cid:12) β = β (17) respectively. The efficient influence functions given in Theorem 1 are analogous to those of usual ATE-type parameters, but with the crucial difference that they correspond to means of γ f ( Y a ; β ) ,not Y a itself. This is what we should expect based on the result in Lemma 1, since the γ f transformation is the derivative of the integrand in the moment condition (8) given in Propo-sition 1. Note also that the form of γ f indicates that the efficiency bound for β (i.e., thevariance of the efficient influence function) will be adversely affected when the model g is sen-sitive to small changes in β , or when the distance is sensitive to small changes in its arguments,since then the derivatives in γ f will be large.In the next corollary, we give the particular form of the efficient influence functions when D f is the L and KL divergence, and the approximating models are a linear series and expo-nential family. Corollary 4.
For L and KL divergence the quantity γ f from Theorem 1 reduces to γ f ( y ; β ) = (cid:40) − ∂g ( y ; β ) ∂β if D f = L − ∂ log g ( y ; β ) ∂β if D f = KL . Further, if either1. D f = L and g ( y ; β ) = q ( y ) + β T b ( y ) is the truncated series in Example 1b, or2. D f = KL and g ( y ; β ) = exp { β T b ( y ) − C ( β ) } is the exponential family in Example 1athen the efficient influence function for m ( β ) is proportional to φ a (cid:16) b ( Y ) (cid:17) . The proportionality constant is − for D f = L , and − for D f = KL. L distance, and for projections onto an exponential family using the KL diver-gence. Further, this efficient influence function simply corresponds to that of the counterfactualmean vector E { b ( Y a ) } , for b a known basis function vector. Thus the influence function con-veniently reduces to that of the mean of a transformed version of the counterfactual outcome,with no dependence on β . As mentioned after Corollary 2, this allows for adapting existingtheory and methods for average treatment effects to the density estimation context.The following theorem summarizes the local minimax lower bound implied by the form ofthe efficient influence function in Theorem 1, as in Corollary 2.6 of van der Vaart [2002]. Corollary 5.
Let σ = σ P denote the variance of the efficient influence function from (17).The local minimax risk for β is lower bounded as inf δ> lim inf n →∞ sup TV ( P , P ) <δ E P (cid:104) (cid:96) (cid:110) √ n (cid:16) (cid:98) β − β ( P ) (cid:17)(cid:111)(cid:105) ≥ E (cid:110) (cid:96) ( σZ ) (cid:111) for any estimator (cid:98) β , where (cid:96) : R p (cid:55)→ [0 , ∞ ) is any subconvex loss function. Corollary 5 follows from Corollary 2.6 of van der Vaart [2002]. It shows that the worst-case mean squared error of any estimator, locally near the true P , cannot be smaller than theefficiency bound, asymptotically and after scaling by √ n . This gives an important benchmarkfor efficient estimation of projection parameters of the counterfactual density: no estimator canhave mean squared error uniformly better than the variance of the efficient influence function(divided by n ), without adding extra assumptions to the nonparametric model we consider. Now we give the efficient influence function for the density effect parameters in (11). Unlikethe projected densities in the previous subsection, the density effect parameters depend onboth counterfactual densities of interest for comparison.
Theorem 2.
In an unrestricted nonparametric model, the efficient influence function for thedensity effect ψ f = (cid:82) f ( p ( y ) , p ( y )) p ( y ) dy is given by φ (cid:16) λ ( Y ) (cid:17) + φ (cid:16) λ ( Y ) (cid:17) where λ ( y ) = p ( y ) f (cid:48) (cid:16) p ( y ) , p ( y ) (cid:17) λ ( y ) = f (cid:16) p ( y ) , p ( y ) (cid:17) + p ( y ) f (cid:48) (cid:16) p ( y ) , p ( y ) (cid:17) . As with the result for β in Theorem 1, the efficient influence function for ψ f in Lemma2 consists of inverse probability weighted residuals, plus a “plug-in”-type term, similar toATE parameters. However, again this corresponds to the influence function for a transformedversion of the outcome, depending on the counterfactual densities and choice of distance f .The efficient influence function simplifies somewhat for L and KL divergence, as indicated inthe following corollary. Expressions for other f -divergences are in Section B.1 in the Appendix.14 orollary 6. If D f = L , then the efficient influence function for ψ f is φ − φ ) (cid:16) p ( Y ) − p ( Y ) (cid:17) . If D f = KL, then the efficient influence function for ψ f is φ (cid:18) log (cid:18) p ( Y ) p ( Y ) (cid:19)(cid:19) − φ (cid:18) p ( Y ) p ( Y ) (cid:19) . The fact that λ = − λ for L projections simplifies the form of our proposed estimators,as we will detail further in the next section. We also note that the influence function reducesto zero when p = p , which presents some complications for inference; this will be discussedin the next section as well.As mentioned in Section 3.3, for the purposes of model selection and aggregation it is alsouseful to consider the distance between p a and a fixed candidate g ; we give the correspondingefficient influence function here. Proposition 2.
In an unrestricted nonparametric model, the efficient influence function for ∆ f ( g ) = (cid:82) f ( p a ( y ) , g ( y )) g ( y ) dy for g fixed and known is given by φ a (cid:16) g ( Y ) f (cid:48) (cid:16) p a ( Y ) , g ( Y ) (cid:17)(cid:17) . If D f = L then this influence function reduces to φ a (cid:16) p a ( Y ) − g ( Y ) (cid:17) . In this section we present doubly robust-style estimators of the proposed density functions anddensity effects, based on the functional expansions from Lemma 1 and the efficient influencefunction results in Theorems 1–2. We study their rates of convergence, and show they can be n − / consistent and asymptotically efficient under weak nonparametric conditions. Here let (cid:98) π a ( x ) and (cid:98) η a ( y | x ) denote initial estimators of the propensity score and conditionaldensity functions π a ( x ) = P ( A = a | X = x ) and η a ( y | x ) = ∂∂y P ( Y ≤ y | X = x, A = a ) ,for example based on generic regression estimators and their numerical derivatives (or forthe latter one can use a regression of a kernel transformed version of the outcome). Also let (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } denote the plug-in estimator of the counterfactual density under A = a ,where P n { h ( Z ) } = n − (cid:80) i h ( Z i ) , and let (cid:98) m ( β ) ≡ (cid:90) ∂g ( y ; β ) ∂β (cid:110) f (cid:16)(cid:98) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16)(cid:98) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) dy. (18)denote the plug-in estimator of the moment condition m ( β ) , and similarly for ψ f .15 emark . Although we suggest basing (18) on the plug-in estimator (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } ofthe counterfactual density, one could just as well use other estimators (e.g., inverse-probability-weighted, or doubly robust, as in Kim et al. [2018]). Nonetheless, all results in this paper willonly depend on high-level second-order rate conditions for estimating p a ( y ) , which would besatisfied for the simple plug-in estimator as long as similar conditions hold for the underlyingdensity estimator (cid:98) η a ( y | x ) . We prove this in Appendix B.5, showing that the mean squarederror of (cid:98) p a ( y ) is upper bounded by an integrated version of that of (cid:98) η a ( y | x ) .To ease notation we let (cid:98) φ a ( T ) = φ a ( T ; (cid:98) P ) denote the estimated version of the efficient influ-ence function given in (15). Then our proposed projection estimators are given by approximatesolutions in β (up to o P (1 / √ n ) error) to (cid:98) m ( β ) + P n (cid:110) (cid:98) φ a (cid:16)(cid:98) γ f ( Y ; β ) (cid:17)(cid:111) = o P (1 / √ n ) (19)In other words the estimators are one-step bias-corrected estimators (of the moment conditionand the parameter itself, respectively), which take the plug-in estimator and add an estimateof the bias by averaging an estimate of the influence function. Remark . For simplicity, in the following results we assume the various nuisance estimates in (cid:98) P are constructed from a single separate independent sample, of the same size n as the estima-tion sample on which P n operates. Alternatively, if the same observations are used both forestimating nuisance functions and averaging estimates of the influence function, one generallyneeds to rely on empirical process conditions to avoid overfitting. In practice, with iid data,one can always obtain separate independent samples by randomly splitting the data in half (orin folds); further, to regain full sample size efficiency one can always swap the samples, repeatthe procedure, and average the results, popularly called cross-fitting and used for example byBickel and Ritov [1988], Chernozhukov et al. [2018a], Robins et al. [2008], Zheng and van derLaan [2010]. In this paper, to simplify notation we always analyze a single split procedure,with the understanding that extending to an analysis of an average across independent splitsis straightforward.Our first propositions give the form of the plug-in and bias-corrected projection estima-tors when using a linear series with L distance, and an exponential family model with KLdivergence, which take a particularly simple form. Proposition 3. If D f = L , the support of Y is [0 , , and g ( y ; β ) = 1+ β T b ( y ) is the truncatedseries in Example 1b, with b ( · ) an orthogonal series with (cid:82) b j ( y ) dy = 0 and (cid:82) b j ( y ) b k ( y ) dy = ( j = k ) , then the plug-in estimator of β is (cid:98) β = P n { (cid:98) µ a ( X ; b ) } , where (cid:98) µ a ( x ; b ) is an estimate of µ a ( x ; b ) = E { b ( Y ) | X = x, A = a ) . In contrast, the proposedone-step estimator in (19) is given by (cid:98) β = P n (cid:20) ( A = a ) (cid:98) π a ( X ) (cid:110) b ( Y ) − (cid:98) µ a ( X ; b ) (cid:111) + (cid:98) µ a ( X ; b ) (cid:21) . (20)16 roposition 4. If D f = KL and g ( y ; β ) = exp { β T b ( y ) − C ( β ) } is the exponential family inExample 1a, then the plug-in estimator solving (cid:98) m ( (cid:98) β ) = 0 is the solution in β to (cid:90) (cid:104) b ( y ) − P n { (cid:98) µ a ( X ; b ) } (cid:105) exp (cid:110) β T b ( y ) (cid:111) dy = 0 where (cid:98) µ a ( x ; b ) is an estimate of µ a ( x ; b ) = E { b ( Y ) | X = x, A = a ) . In contrast, the proposedone-step estimator in (19) is given by the solution in β to (cid:90) (cid:18) b ( y ) − P n (cid:20) ( A = a ) (cid:98) π a ( X ) (cid:110) b ( Y ) − (cid:98) µ a ( X ; b ) (cid:111) + (cid:98) µ a ( X ; b ) (cid:21)(cid:19) exp (cid:110) β T b ( y ) (cid:111) dy = 0 . (21)Propositions 3-4 shows that the plug-in and bias-corrected estimators for L and KL pro-jections solve simple estimating equations, which only require one to first estimate the com-ponents E { µ a ( X ; b ) } ; importantly, straightforward doubly robust estimators as in (21) areavailable, and do not depend on the estimating equation parameter β . This is not necessarilytrue for other model/distance combinations; in general (cid:98) γ f would have to be estimated at each β in order to solve (19), which could be quite computationally intensive.Next we give the main result of this section, which shows the rate of convergence for theproposed estimator. Importantly the rate involves products of nuisance estimation errors,allowing for n − / consistency and asymptotic normality in nonparametric models, and evenwhen the nuisance estimators are generic and flexibly fit. Theorem 3.
Let η = ( π a , η a ) , and ϕ ( Z ; β, η ) = m ( β ; η ) + φ a ( γ f ( Y ; β ) , η ) . Assume:1. The functions γ f and / (cid:98) π a are bounded above by some constant, and γ f is differentiablein p a ( y ) , with derivative bounded uniformly above by δ .2. The function class { ϕ ( z ; β, η ) : β ∈ R p } is Donsker in β for any fixed η .3. The estimators are consistent in the sense that (cid:98) β − β = o P (1) and (cid:107) (cid:98) η − η (cid:107) = o P (1) .4. The map β (cid:55)→ P { ϕ ( Z ; β, η ) } is differentiable at β uniformly in η , with nonsingularderivative matrix ∂∂β P { ϕ ( Z ; β, η ) }| β = β = V ( β , η ) , where V ( β , (cid:98) η ) p → V ( β , η ) .Then (cid:98) β − β = − V ( β , η ) − ( P n − P ) (cid:110) φ a (cid:16) γ f ( Y ; β ) (cid:17)(cid:111) + O P (cid:18) (cid:107) (cid:98) π a − π a (cid:107)(cid:107) (cid:98) η a − η a (cid:107) + δ (cid:107) (cid:98) p a − p a (cid:107) + o P (cid:18) √ n (cid:19)(cid:19) . Remark . In a slight abuse of notation, Theorem 3 holds when we define (cid:107) (cid:98) η a − η a (cid:107) = (cid:107) ζ a (cid:107) ≡ (cid:82) ζ a ( x ) d P ( x ) for integrated error ζ a ( x ) = (cid:82) | (cid:98) η a ( y | x ) − η a ( y | x ) | dy . Thisimplies it also holds if we define (cid:107) (cid:98) η a − η a (cid:107) = (cid:82) { (cid:98) η a ( y | x ) − η a ( y | x ) } dy d P ( x ) , or (cid:107) (cid:98) η a − η a (cid:107) = (cid:82) { (cid:98) η a ( y | x ) − η a ( y | x ) } d P ( y, x ) if η a ( y | x ) is bounded from below.17mportantly, Theorem 3 shows that (cid:98) β attains substantially faster rates than its nuisanceestimators (cid:98) η , and can be asymptotically efficient under weak nonparametric conditions, forexample attaining the minimax lower bound in Corollary 5. First we give some description ofthe assumed conditions. The first condition ensures the influence function is not too complexas a function of β (though allowing arbitrary complexity in η ). The second condition merelyrequires consistency of ( (cid:98) β, (cid:98) η ) at any rate. The third condition requires some smoothness in β , so as to allow a delta method argument. These conditions ensure (cid:98) β has a rate of conver-gence that is second-order in the nuisance estimation error, thus attaining faster rates thanthe nuisance estimators. Thus, for example, under standard n − / -type rate conditions on (cid:98) η ,the estimator (cid:98) β is n − / -consistent, asymptotically normal, and efficient. Importantly, theserates can be attained under smoothness, sparsity, or other structural conditions (e.g., addi-tive modeling or bounded variation assumptions, etc.). For instance, if it is assumed that all d -dimensional nuisance functions lie in a Holder class with smoothness index s (i.e., partialderivatives up to order s exist and are Lipschitz) then the assumption of Theorem 3 would besatisfied when s > d/ , i.e., the smoothness index is at least half the dimension. Alternatively,if the functions are s -sparse then one would need s = o ( √ n ) up to log factors, as in Farrell[2015]. In these cases, asymptotically valid 95% confidence intervals can be constructed viathe simple Wald form, (cid:98) β ± . (cid:113) diag [ (cid:99) cov { (cid:98) φ a ( (cid:98) γ f ( Y ; (cid:98) β )) } /n ] . Remark . In some prominent cases (for example, L and KL projections, as shown in Corol-lary 4), the function γ f does not depend on the counterfactual density p a ( y ) at all, so itsderivative is exactly zero and δ = 0 . In this case the second term in the second-order remain-der in Theorem 3 drops out, making the proposed approach doubly robust in the usual sense,requiring no rate conditions on the initial pilot estimate of the counterfactual density. Here we present doubly robust-style estimators of the density effects described in Section 3.2,and study their rate of convergence. As before we first construct initial estimators (cid:98) π a ( x ) , (cid:98) η a ( y | x ) , and (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } of the propensity score and conditional and counter-factual densities. Estimated versions of φ a ( T ) and λ a defined in Theorem 2 follow accordingly.Then the density effect estimators we propose are defined as (cid:98) ψ f = (cid:90) f (cid:16)(cid:98) p ( y ) , (cid:98) p ( y ) (cid:17)(cid:98) p ( y ) dy + P n (cid:110) (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17) + (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17)(cid:111) , (22)which can again be viewed as one-step bias-corrected estimators, with plug-in bias estimatedvia an average of the estimated influence function. In practice, rather than estimating theconditional density η a and integrating over its y argument, one could instead regress for ex-ample (cid:98) λ a on X for the integral terms in the estimated influence function. Proposition 5. If D f = L then the proposed density effect estimator can be written as (cid:98) ψ f = 2 P n (cid:18) A − (cid:98) π A ( X ) (cid:20)(cid:110)(cid:98) p ( Y ) − (cid:98) p ( Y ) (cid:111) − (cid:90) (cid:110)(cid:98) p ( y ) − (cid:98) p ( y ) (cid:111)(cid:98) η A ( y | X ) dy (cid:21) + (cid:90) (cid:110)(cid:98) p ( y ) − (cid:98) p ( y ) (cid:111)(cid:110)(cid:98) η ( y | X ) − (cid:98) η ( y | X ) (cid:111) dy (cid:19) − (cid:90) (cid:110)(cid:98) p ( y ) − (cid:98) p ( y ) (cid:111) dy. ( (cid:98) p ( Y ) − (cid:98) p ( Y )) − ( (cid:98) p ( Y ) − (cid:98) p ( Y )) , which is (cid:82) ( (cid:98) p − (cid:98) p )( p − p ) , andsubtracting a plug-in estimate of the L distance. This is analogous to the standard one-stepestimator of the expected (observational) density (cid:82) p ( x ) dx [Bickel and Ritov, 1988], whichtakes twice an estimate of the mean of (cid:98) p ( X ) , i.e., (cid:82) (cid:98) pp , and subtracts the plug-in estimate (cid:82) (cid:98) p . For the expected density, the bias is just the integrated squared difference between (cid:98) p and p ; in contrast, in our setting, we show next that there is an additional doubly robusterror term, due to the confounding adjustment required for estimating counterfactual densities. Theorem 4.
Assume λ a and / (cid:98) π a are bounded above by some constant for a = 0 , , and λ a is differentiable in p a ( y ) , with derivative bounded uniformly above by δ a . Then (cid:98) ψ f − ψ f = ( P n − P ) (cid:110) φ (cid:16) λ ( Y ) (cid:17) + φ (cid:16) λ ( Y ) (cid:17)(cid:111) + O P (cid:32) (cid:88) a =0 (cid:16) (cid:107) (cid:98) π a − π a (cid:107)(cid:107) (cid:98) η a − η a (cid:107) + δ a (cid:107) (cid:98) p a − p a (cid:107) (cid:17) + o P (cid:18) √ n (cid:19)(cid:33) . Theorem 4 (whose proof mimics that of Theorem 3) shows that (cid:98) ψ f can attain faster ratesthan its nuisance estimators, and can be asymptotically efficient under weak nonparametricconditions. The conditions and the form of the convergence rate are similar to those of The-orem 3, so we refer to our discussion there for more details. However we do comment on afew differences. First, for the density functions targeted in Theorem 3, the moment condition m ( β ) , and resulting influence functions and estimators, can have a complicated dependenceon β ; in contrast, this is not an issue for the density effect ψ f since the influence functionis linear in the parameter. Thus extra smoothness conditions on the influence function usedin Theorem 3 are not required in Theorem 4. Second, although in Theorem 3 the derivativebound δ can be exactly zero in some prominent cases, in general in Theorem 4 this will notbe the case (e.g., for L distance the derivative of λ a has absolute value equal to one). There-fore, for efficient estimation of density effects, we in general need an initial density estimatorconverging at n − / rate. However recall that, as described in Remark 4, there exist nonpara-metric counterfactual density estimators with error upper bounded by (cid:107) (cid:98) π a − π a (cid:107) or (cid:107) (cid:98) η a − η a (cid:107) , sothat (cid:107) (cid:98) p a − p a (cid:107) would be of smaller or similar order compared to the product error preceding it.There is a third distinction in density effect estimation. Under usual n − / rate conditionson the nuisance estimators, Theorem 4 suggests 95% confidence intervals of the form (cid:98) ψ f ± . (cid:114) (cid:99) cov (cid:110) (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17) + (cid:98) φ (cid:16)(cid:98) λ ( Y ) (cid:17)(cid:111) /n (23)These intervals are asymptotically valid as usual when p (cid:54) = p , but not when p = p ,since then the influence function of ψ f reduces to zero, as mentioned in Section 4.2. Thisinvalidates inference because the first sample average in Theorem 4 is no longer dominant, aswith degenerate U-statistics or other estimators whose higher-order terms dominate their vonMises expansions (cf. Sections 12.3 and 20.1.1 of van der Vaart [2000]). However, the presenceof nuisance functions complicates things substantially, as noted in other similarly complexfunctional estimation problems [Luedtke et al., 2019, Williamson et al., 2020], but we are notaware of a general solution. Thus we only recommend using the interval (23) in non-nullsettings when p (cid:54) = p . A simple albeit ad-hoc fix is to use the interval (cid:98) ψ ± z α/ ( s ∨ / √ n ) where s = (cid:113) (cid:99) cov { (cid:98) φ ( (cid:98) λ ( Y )) + (cid:98) φ ( (cid:98) λ ( Y )) } /n . This is valid but conservative near the null.19 .3 Model Selection & Aggregation Here we briefly describe how the methods of the previous subsections can be used for thepurposes of model selection and aggregation, in the same spirit as Tsybakov [2003], van derLaan and Dudoit [2003], and others. We leave technical details to future work.First we consider the linear aggregation goal as defined in (14), where B = R K . In thissetup the methods from Section 5.1 can be straightforwardly adapted, by adding an extra stepof sample splitting. We focus on L projections since f -divergences may not be well-definedfor general linear combinations of candidate estimators. Our proposed approach is as follows: Step 1.
Randomly split the sample into a training set D n and test set D n . Step 2.
On the training set D n , estimate K different models (e.g., K different numbers of basisfunctions, or K different combinations of linear, exponential family, Gaussian mixturemodels, etc.), using the estimator in (19) to compute (cid:98) g k ( y ) = g ( y ; (cid:98) β k ) , k = 1 , ..., K . Step 3.
On the test set D n , estimate the ( L ) projection onto an orthonormal basis of the linearspan of ( (cid:98) g , ..., (cid:98) g K ) , again using the estimator in (19), e.g., with the series model inExample 1b with q ( y ) = 0 , to compute an aggregated estimator (cid:98) g ( y ) = (cid:80) k (cid:98) θ k (cid:98) g k ( y ) . Step 4.
Reverse the roles of D n and D n and average the two resulting aggregates.Note that inside Steps 2-3, another layer of sample splitting is required to avoid empiricalprocess conditions in estimating the nuisance functions, as discussed in Remark 5. We alsonote that the cross-fitting in Step 4 could be considered optional if the corresponding efficiencyloss was considered negligible, or alternatively one could instead implement Steps 1–4 with M different folds, at each step using M − for training and the other fold for the test set. Weconjecture that the above approach can attain the optimal K/n rates for linear density ag-gregation in the observational case [Rigollet and Tsybakov, 2007], under standard n − / -typeconditions on the nuisance estimators (or weaker, depending on how K scales with n ).For model selection and convex aggregation, we propose a similiar procedure, except wherein Step 3 variants of the density effect estimators from Section 5.2 are used to estimate thedistance between p a and each of the k candidates estimated from the training split (afterprojecting each onto the space of valid densities). One can then pick the minimum distancecandidate or an appropriately weighted combination, e.g., by finding the convex weights thatminimize the estimated distance in the test split. For example, our proposed estimator of the L error of a candidate g k based on Proposition 2 is given by (cid:98) ∆ f ( g k ) = (cid:90) (cid:16)(cid:98) p a ( y ) − g k ( y ) (cid:17) dy + 2 P n (cid:110) (cid:98) φ a (cid:16)(cid:98) p a ( Y ) − g k ( Y ) (cid:17)(cid:111) . For the purposes of model selection, one can instead use the simpler pseudo- L risk (cid:98) ∆ ∗ f ( g k ) = − P n (cid:20) ( A = a ) (cid:98) π a ( X ) (cid:26) g k ( Y ) − (cid:90) g k ( y ) (cid:98) η a ( y | X ) dy (cid:27) (24) + (cid:90) g k ( y ) (cid:98) η a ( y | X ) dy (cid:21) + (cid:90) g k ( y ) dy, based on the fact that the L distance (cid:82) ( p a − g k ) equals (cid:82) g k − (cid:82) g k p a plus a term (cid:82) p a thatdoes not depend on g k . This is the estimator we use in the data analysis in the next section.20 Illustration
Here we apply our proposed methods to analyze the effect of combined antiretroviral therapyfor treating HIV. All code is given in Appendix A, and the methods are implemented in the npcausal
R package on GitHub (https://github.com/ehkennedy/npcausal).The data we use come from the ACTG 175 randomized trial [Hammer et al., 1996], and areavailable in the speff2trial
R package. The treatment is whether patients received combina-tion therapy ( A = 1 ) versus zidovudine alone ( A = 0 ), and the outcome Y is CD4 count at 96weeks post-baseline. Baseline covariates X include age, weight, Karnofsky score, indicators forrace, gender, hemophilia, homosexual activity, drug use, whether symptomatic, and previouszidovudine and antiretroviral use. There are a total of n = 2319 patients in the trial, 797 ofwhich do not have outcome data (we use R = 1 to denote an observed outcome).Since we are interested in the density of outcomes had all versus none been treated in theentire population (i.e., had all outcomes been measured), we can view the product indicator ( A = a, R = 1) as a joint “treatment” variable [van der Laan and Robins, 2003]. In otherwords our goal is to estimate counterfactual densities under A = 1 and R = 1 , versus A = 0 and R = 1 . Our methods therefore rely on no unmeasured confounding of A (which holds bydesign due to the experimental design) and missingness at random of Y (i.e., R ⊥⊥ Y | X, A ),which is untestable regardless of whether treatment is randomized. For more details on thetrial and data, we refer to Hammer et al. [1996] and Wang et al. [2018].Throughout our analysis, we used 5-fold cross-fitting, with all nuisance functions estimatedby random forests (via the R package ranger [Wright and Ziegler, 2015]). This includes condi-tional densities η a , which we estimated by regressing a Gaussian kernel weighted outcome oncovariates and treatment, on a grid of y values, with bandwidth chosen by Silverman’s rule.Alternative approaches could also be used [Díaz and van der Laan, 2011, Hansen, 2004, Izbickiand Lee, 2017], potentially at the expense of some extra computational burden.First we used the density effect methods from Section 5.2 to check for evidence of an ef-fect of combination therapy on the density of CD4 count. Specifically, we used the cross-fitversion of the estimator in Proposition 5 to estimate the L distance between p and p , withasymptotic variance estimated as usual, via the empirical variance of the estimated influencefunction. To ease interpretability we rescaled Y to be on the unit interval. The estimated L distance was 0.279 with a 95% confidence interval of [0 . , . , indicating a statisticallysignificant effect of combination therapy on CD4 count.To more precisely understand how combination therapy impacted the CD4 distribution, weestimated the counterfactual densities using the methods of Sections 5.1 and 5.3. Specifically,we used L projections onto the linear series in Example 1b with the cosine basis (5). Weconsidered a range of models for both densities, including up to 15 basis terms (more than15 terms did not improve fit). Figure 3 shows estimates of model fit via the pseudo- L risk(24), along with confidence intervals, indicating that four basis terms does best for bothcounterfactual densities. Figure 4 shows the estimated counterfactual CD4 densities usingfour basis terms, along with pointwise CIs. Since the densities differ more substantially in thelowest CD4 range (e.g., 0-200), this suggests combination therapy may have increased CD4count most for the high-risk patients with the lowest counts under control (zidovudine).21 − . − . − . − . Y Basis dimension P s eudo L2 r i sk − . − . − . − . − . − . − . Y Basis dimension P s eudo L2 r i sk Figure 3:
Estimates of pseudo- L risk for models of increasing dimension (using L projectionsonto linear models with a cosine basis), with gray bars denoting confidence intervals. . . . . . . Counterfactual density (& pointwise CI) y E s t i m a t e Y Y Y Y Figure 4:
Estimated counterfactual CD4 densities for combination therapy versus zidovudine. Discussion
In this paper we proposed methods for estimating counterfactual densities and correspondingdistances and other functionals. We gave nonparametric efficiency bounds and flexible opti-mal estimators for a wide class of models and projection distances, and for new effects thatquantify treatment impacts on the density scale. We also gave methods for data-driven modelselection and aggregation in this context, and illustrated the ideas in an application studyingeffects of antiretroviral therapy on CD4 count.There are many interesting avenues for future work. In upcoming companion papers, weconsider the nonparametric version of the problem (where the target is the density p a itself andnot a projection) as well as non-discrete treatments (where A is for example a continuous dose).Much more work is needed on the computational side since, outside of L projections on linearmodels and KL projections on exponential families, our methods require solving somewhatcomplicated estimating equations. Other extensions could involve time-varying treatments,instrumental variables, conditional effects, density-optimal treatment regimes, mediation, sen-sitivity analysis, and more. It is also of interest to apply the methods more broadly, to seeif they bring any new insights about treatment mechanisms or ways to adapt treatment policies. Acknowledgements
Edward Kennedy gratefully acknowledges support from NSF Grant DMS1810979, and Sivara-man Balakrishnan and Larry Wasserman from NSF Grant DMS1713003.
A Appendix: R Code set.seed(100) eferences A. Abadie. Bootstrap tests for distributional treatment effects in instrumental variable models.
Journal of the American statistical Association , 97(457):284–292, 2002.S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution fromanother.
Journal of the Royal Statistical Society: Series B (Methodological) , 28(1):131–142,1966.R. Beran. Minimum hellinger distance estimates for parametric models.
The Annals of Statis-tics , 5(3):445–463, 1977.P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best orderof convergence estimates.
Sankhy¯a , pages 381–393, 1988.P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner.
Efficient and Adaptive Estimationfor Semiparametric Models . Baltimore: Johns Hopkins University Press, 1993.A. Buja, L. Brown, R. Berk, E. George, E. Pitkin, M. Traskin, K. Zhang, and L. Zhao. Modelsas approximations i: Consequences illustrated with linear regression.
Statistical Science , 34(4):523–544, 2019a.A. Buja, L. Brown, A. K. Kuchibhotla, R. Berk, E. George, and L. Zhao. Models as approxi-mations ii: A model-free theory of parametric regression.
Statistical Science , 34(4):545–565,2019b.J. Chen, D. Zhang, and M. Davidian. A monte carlo em algorithm for generalized linear mixedmodels with flexible random effects distribution.
Biostatistics , 3(3):347–360, 2002.V. Chernozhukov and C. Hansen. An IV model of quantile treatment effects.
Econometrica ,73(1):245–261, 2005.V. Chernozhukov, I. Fernández-Val, and B. Melly. Inference on counterfactual distributions.
Econometrica , 81(6):2205–2268, 2013.V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. M.Robins. Double/debiased machine learning for treatment and structural parameters.
TheEconometrics Journal , 21(1):C1–C68, 2018a.V. Chernozhukov, M. Demirer, E. Duflo, and I. Fernandez-Val. Generic machine learninginference on heterogenous treatment effects in randomized experiments. Technical report,National Bureau of Economic Research, 2018b.I. Csiszár. Information-type measures of difference of probability distributions and indirectobservation. studia scientiarum Mathematicarum Hungarica , 2:229–318, 1967.M. Cuellar and E. H. Kennedy. A non-parametric projection-based estimator for the prob-ability of causation, with application to water sanitation in kenya.
Journal of the RoyalStatistical Society: Series A (Statistics in Society) , 183(4):1793–1818, 2020.I. Díaz. Efficient estimation of quantiles in missing data models.
Journal of Statistical Planningand Inference , 190:39–51, 2017. 24. Díaz and M. J. van der Laan. Super learner based conditional density estimation withapplication to marginal structural models. 2011.J. DiNardo, N. M. Fortin, and T. Lemieux. Labor market institutions and the distribution ofwages, 1973-1992: A semiparametric approach.
Econometrica , pages 1001–1044, 1996.M. H. Farrell. Robust inference on average treatment effects with possibly more covariatesthan observations.
Journal of Econometrics , 189(1):1–23, 2015.S. Firpo. Efficient semiparametric estimation of quantile treatment effects.
Econometrica , 75(1):259–276, 2007.N. Fortin, T. Lemieux, and S. Firpo. Decomposition methods in economics.
Handbook ofLabor Economics , 4:1–102, 2011.M. Frölich and B. Melly. Unconditional quantile treatment effects under endogeneity.
Journalof Business & Economic Statistics , 31(3):346–357, 2013.S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H.Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, and M. Niu. A trial comparingnucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell countsfrom 200 to 500 per cubic millimeter.
New England Journal of Medicine , 335(15):1081–1090,1996.B. E. Hansen. Nonparametric conditional density estimation.
Unpublished manuscript , 2004.P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions.In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability ,volume 1, pages 221–233. University of California Press, 1967.R. Izbicki and A. B. Lee. Converting high-dimensional regression to high-dimensional condi-tional density estimation.
Electronic Journal of Statistics , 11(2):2800–2831, 2017.E. H. Kennedy, S. Lorch, and D. S. Small. Robust causal inference with continuous instrumentsusing the local instrumental variable curve.
Journal of the Royal Statistical Society: SeriesB , 81(1):121–143, 2019.E. H. Kennedy, S. Balakrishnan, and M. G’Sell. Sharp instruments for classifying compliersand generalizing causal effects.
The Annals of Statistics , 48(4):2008–2030, 2020.K. Kim, J. Kim, and E. H. Kennedy. Causal effects based on distributional distances. arXiv1806.02935 , 2018.A. Luedtke, M. Carone, and M. J. van der Laan. An omnibus non-parametric test of equalityin distribution for unknown functions.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 81(1):75–99, 2019.J. A. Machado and J. Mata. Counterfactual decomposition of changes in wage distributionsusing quantile regression.
Journal of Applied Econometrics , 20(4):445–465, 2005.B. Melly. Decomposition of differences in distribution using quantile regression.
Labour Eco-nomics , 12(4):577–590, 2005. 25. Neugebauer and M. J. van der Laan. Nonparametric causal effects based on marginalstructural models.
Journal of Statistical Planning and Inference , 137(2):419–434, 2007.D. J. Newman et al. Rational approximation to | x | . Michigan Mathematical Journal , 11(1):11–14, 1964.A. Pinheiro and B. Vidakovic. Estimating the square root of a density via compactly supportedwavelets.
Computational Statistics & Data Analysis , 25(4):399–415, 1997.A. Rakhlin, K. Sridharan, A. B. Tsybakov, et al. Empirical entropy, minimax regret andminimax risk.
Bernoulli , 23(2):789–824, 2017.A. Rényi et al. On measures of entropy and information.
Proceedings of the Fourth BerkeleySymposium on Mathematical Statistics and Probability, Volume 1: Contributions to theTheory of Statistics , 1961.P. Rigollet and A. B. Tsybakov. Linear and convex aggregation of density estimators.
Math-ematical Methods of Statistics , 16(3):260–280, 2007.A. Rinaldo and L. Wasserman. Generalized density clustering.
The Annals of Statistics , 38(5):2678–2722, 2010.J. M. Robins and A. Rotnitzky. Comments on: Inference for semiparametric models: Somequestions and an answer.
Statistica Sinica , 11:920–936, 2001.J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Higher order influencefunctions and minimax estimation of nonlinear functionals.
Probability and Statistics: Essaysin Honor of David A. Freedman , pages 335–421, 2008.J. M. Robins, E. J. Tchetgen Tchetgen, L. Li, and A. W. van der Vaart. Semiparametricminimax rates.
Electronic Journal of Statistics , 3:1305–1321, 2009.J. M. Robins, L. Li, R. Mukherjee, E. Tchetgen Tchetgen, and A. W. van der Vaart. Minimaxestimation of a functional on a structured high dimensional model.
The Annals of Statistics ,45(5):1951–1987, 2017.C. Rothe. Nonparametric estimation of distributional policy effects.
Journal of Econometrics ,155(1):56–70, 2010.D. B. Rubin and M. J. van der Laan. Extending marginal structural models through local,penalized, and additive learning.
UC Berkeley Division of Biostatistics Working PaperSeries , 212:1–20, 2006.I. Sason and S. Verdú. f -divergence inequalities. IEEE Transactions on Information Theory ,62(11):5973–6006, 2016.V. Semenova and V. Chernozhukov. Debiased machine learning of conditional average treat-ment effects and other causal functions.
The Econometrics Journal , 2020.A. B. Tsybakov. Optimal rates of aggregation.
Learning theory and kernel machines , pages303–313, 2003.A. B. Tsybakov.
Introduction to Nonparametric Estimation . New York: Springer, 2009.26. J. van der Laan. Statistical inference for variable importance.
The International Journalof Biostatistics , 2(1), 2006.M. J. van der Laan and S. Dudoit. Unified cross-validation methodology for selection amongestimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracleinequalities and examples.
UC Berkeley Division of Biostatistics Working Paper Series ,Paper 130, 2003.M. J. van der Laan and J. M. Robins.
Unified Methods for Censored Longitudinal Data andCausality . New York: Springer, 2003.A. W. van der Vaart.
Asymptotic Statistics . Cambridge: Cambridge University Press, 2000.A. W. van der Vaart. Semiparametric statistics.
In: Lectures on Probability Theory andStatistics , pages 331–457, 2002.L. Wang, Y. Zhou, R. Song, and B. Sherwood. Quantile-optimal treatment regimes.
Journalof the American Statistical Association , 113(523):1243–1254, 2018.Q. Wang and Y. Qin. Empirical likelihood confidence bands for distribution functions withmissing responses.
Journal of Statistical Planning and Inference , 140(9):2778–2789, 2010.L. Wasserman.
All of Nonparametric Statistics . Springer, 2006.T. Westling and M. Carone. A unified study of nonparametric inference for monotone func-tions.
Annals of Statistics , 48(2):1001, 2020.H. White. Using least squares to approximate unknown regression functions.
InternationalEconomic Review , pages 149–170, 1980.H. White. Maximum likelihood estimation of misspecified models.
Econometrica , pages 1–25,1982.H. White.
Estimation, inference and specification analysis . Number 22. Cambridge UniversityPress, 1996.B. D. Williamson, P. B. Gilbert, N. R. Simon, and M. Carone. A unified approach for inferenceon algorithm-agnostic variable importance. arXiv preprint arXiv:2004.03683 , 2020.M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for highdimensional data in c++ and r. arXiv preprint arXiv:1508.04409 , 2015.Z. Zhang, Z. Chen, J. F. Troendle, and J. Zhang. Causal inference on quantiles with anobstetric application.
Biometrics , 68(3):697–706, 2012.W. Zheng and M. J. van der Laan. Asymptotic theory for cross-validated targeted maximumlikelihood estimation.
UC Berkeley Division of Biostatistics Working Paper Series , Paper273:1–58, 2010. 27
Appendix: Proofs
B.1 Proof of Corollaries 1–4 and 6
These corollaries all follow from the distance-specific form of f . For reference we list the rele-vant quantities here.For L distance we have f ( p, q ) = ( p − q ) q f (cid:48) ( p, q ) = 2 (cid:18) pq − (cid:19) f (cid:48) ( p, q ) = 1 − (cid:18) pq (cid:19) f (cid:48)(cid:48) ( p, q ) = − pq . For KL divergence we have f ( p, q ) = (cid:18) pq (cid:19) log (cid:18) pq (cid:19) f (cid:48) ( p, q ) = 1 q (cid:26) log (cid:18) pq (cid:19) + 1 (cid:27) f (cid:48) ( p, q ) = − pq (cid:26) log (cid:18) pq (cid:19) + 1 (cid:27) f (cid:48)(cid:48) ( p, q ) = − q (cid:26) log (cid:18) pq (cid:19) + 2 (cid:27) . For χ divergence we have f ( p, q ) = (cid:18) pq − (cid:19) f (cid:48) ( p, q ) = 2( p − q ) q f (cid:48) ( p, q ) = − pq ( p − q ) f (cid:48)(cid:48) ( p, q ) = 2( q − p ) q . For Hellinger divergence we have f ( p, q ) = (cid:18)(cid:114) pq − (cid:19) f (cid:48) ( p, q ) = 1 √ q (cid:18) √ q − √ p (cid:19) f (cid:48) ( p, q ) = √ pq ( √ q − √ p ) f (cid:48)(cid:48) ( p, q ) = (cid:112) q/p − q . For TV ∗ divergence we have f ( p, q ) = 12 q ν t ( p − q ) f (cid:48) ( p, q ) = 12 q ν (cid:48) t ( p − q ) f (cid:48) ( p, q ) = − q (cid:26) ν t ( p − q ) q + ν (cid:48) t ( p − q ) (cid:27) f (cid:48)(cid:48) ( p, q ) = − q (cid:26) ν (cid:48) t ( p − q ) q + ν (cid:48)(cid:48) t ( p − q ) (cid:27) . .2 Proof of Lemma 1 Here we let ψ = ψ ( P ) = (cid:82) h ( p a ( y )) dy , for some twice continuously differentiable function h .We will show that ψ satisfies the von Mises expansion given in Lemma 1.Let p a ( y ) = (cid:82) η a ( y | x ) d P ( x ) denote the marginal counterfactual density under P . Note forthe posited influence function given by ϕ ( z ; P ) = ( A = a ) π a ( X ) (cid:26) h (cid:48) ( p a ( Y )) − (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy − (cid:90) h (cid:48) ( p a ( y )) η a ( y | x ) dy d P ( x ) , we have, by iterated expectation, that it has mean under P equal to (cid:90) ϕ ( z ; P ) d P = (cid:90) (cid:20) ( A = a ) π a ( X ) (cid:26) h (cid:48) ( p a ( Y )) − (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( p a ( y )) η a ( y | X ) dy − (cid:90) (cid:90) h (cid:48) ( p a ( y )) η a ( y | x ) dy d P ( x ) (cid:21) d P = (cid:90) π a ( x ) π a ( x ) (cid:90) (cid:110) h (cid:48) ( p a ( y )) η a ( y | x ) − h (cid:48) ( p a ( y )) η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) (cid:90) h (cid:48) ( p a ( y )) η a ( y | x ) dy (cid:110) d P ( x ) − d P ( x ) (cid:111) = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) h (cid:48) ( p a ( y )) (cid:90) (cid:110) η a ( y | x ) d P ( x ) − η a ( y | x ) d P ( x ) (cid:111) dy = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) h (cid:48) ( p a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy. Therefore the second-order remainder term in the von Mises expansion is R ( P , P ) ≡ ψ ( P ) − ψ ( P ) − (cid:90) ϕ ( z ; P ) d ( P − P ) = ψ ( P ) − ψ ( P ) + (cid:90) ϕ ( z ; P ) d P = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ (cid:90) h (cid:48) ( p a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy + (cid:90) (cid:110) h ( p a ( y )) − h ( p a ( y )) (cid:111) dy = (cid:90) (cid:90) h (cid:48) ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ 12 (cid:90) h (cid:48)(cid:48) ( p ∗ a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy, where the last line follows by a Taylor expansion with remainder of the mean-value form, with p ∗ a ( y ) lying between p a ( y ) and p a ( y ) . 29 .3 Proof of Theorem 1 First, for any fixed β , we have that each element of the p -vector m ( β ) = (cid:90) ∂g ( y ; β ) ∂β (cid:110) f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) dy, can be viewed as a density functional (cid:82) h ( p a ( y )) dy for a specific function h . In particular, let g (cid:48) j ( y ; β ) denote the j th element of ∂g ( y ; β ) ∂β so that h ( p a ( y )) = { h ( p a ( y )) , ..., h d ( p a ( y )) } T for h j ( p a ( y )) = g (cid:48) j ( y ; β ) (cid:110) f (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) , (25)noting that, for a given β value, g ( y ; β ) is a known constant not depending on P .Now we apply Lemma 1 to each component of m . First note that h (cid:48) j ( p a ( y )) = g (cid:48) j ( y ; β ) (cid:110) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48)(cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) , by the chain rule, so that γ f ( y ; β ) = ∂g ( y ; β ) ∂β (cid:110) f (cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17) + g ( y ; β ) f (cid:48)(cid:48) (cid:16) p a ( y ) , g ( y ; β ) (cid:17)(cid:111) = (cid:110) h (cid:48) ( p a ( y )) , ..., h (cid:48) p ( p a ( y )) (cid:111) T . Therefore Lemma 1 implies that m ( β ) = (cid:82) h ( p a ( y )) dy satisfies the von Mises expansion m ( β ) − m ( β ) = (cid:90) ϕ m ( z ; P ) d ( P − P ) + R ( P , P ) , (26)where ϕ m ( Z, β ; P ) = ( A = a ) π a ( X ) (cid:26) γ f ( Y ; β ) − (cid:90) γ f ( y ; β ) η a ( y | X ) dy (cid:27) + (cid:90) γ f ( y ; β ) η a ( y | X ) dy − (cid:90) (cid:90) γ f ( y ; β ) η a ( y | x ) dy d P ( x ) , and where the j th component of R ( P , P ) is given by R ,j ( P , P ) = (cid:90) (cid:90) h (cid:48) j ( p a ( y )) (cid:26) π a ( x ) π a ( x ) − (cid:27) (cid:110) η a ( y | x ) − η a ( y | x ) (cid:111) dy d P ( x )+ 12 (cid:90) h (cid:48)(cid:48) j ( p ∗ a ( y )) (cid:110) p a ( y ) − p a ( y ) (cid:111) dy. (27)Now we give a lemma showing why finding a von Mises expansion like the above, with second-order remainder, is equivalent to finding the efficient influence function in a nonparametricmodel. This will prove ϕ m is the efficient influence function for m ( β ) , and will also be usefulfor later results. Lemma 2.
Let ψ : P → R denote some real-valued functional on a nonparametric model, sothe set of distributions P does not constrain the tangent space. Assume the functional satisfies ψ ( P ) − ψ ( P ) = (cid:90) ϕ ( z ; P ) ( d P − d P ) + R ( P , P ) for some mean-zero and finite variance function ϕ ( z ; P ) . Then ϕ is the efficient influenceinfluence function if dd(cid:15) R ( P , P (cid:15) ) | (cid:15) =0 = 0 for any smooth parametric submodel. roof. Recall from Bickel et al. [1993] and van der Vaart [2002] that the efficient influencefunction is the mean-zero function whose variance equals the nonparametric efficiency bound,and is given by the unique function φ that is a valid submodel score (or limit of such scores)satisfying pathwise differentiability, i.e., dd(cid:15) ψ ( P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 = (cid:90) φ ( z ; P ) (cid:18) dd(cid:15) log d P (cid:15) (cid:19) (cid:12)(cid:12)(cid:12) (cid:15) =0 d P ( z ) (28)for P (cid:15) any smooth parametric submodel. (e.g., differentiable in quadratic mean) In a non-parametric model only one such function φ satisfies the above. We will show that the aboveis satisfied by the function ϕ in the statement of the lemma.First note that the assumed expansion implies ψ ( P ) − ψ ( P (cid:15) ) = − (cid:90) ϕ ( z ; P ) d P (cid:15) + R ( P , P (cid:15) ) for any submodel P (cid:15) . Differentiating with respect to (cid:15) gives dd(cid:15) ψ ( P (cid:15) ) = dd(cid:15) (cid:90) ϕ ( z ; P ) d P (cid:15) + dd(cid:15) R ( P , P (cid:15) )= (cid:90) ϕ ( z ; P ) (cid:18) dd(cid:15) log d P (cid:15) (cid:19) d P (cid:15) + dd(cid:15) R ( P , P (cid:15) ) , where the second line follows from the dominated convergence theorem and uses the fact that dd(cid:15) log d P (cid:15) = dd(cid:15) d P (cid:15) /d P (cid:15) . Therefore evaluating at (cid:15) = 0 we have dd(cid:15) ψ ( P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 = (cid:90) ϕ ( z ; P ) (cid:18) dd(cid:15) log d P (cid:15) (cid:19) (cid:12)(cid:12)(cid:12) (cid:15) =0 d P + dd(cid:15) R ( P , P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 , which yields the desired pathwise differentiability by the fact that dd(cid:15) R ( P , P (cid:15) ) | (cid:15) =0 = 0 .Now we can immediately apply Lemma 2, noting that dd(cid:15) R ( P , P (cid:15) ) (cid:12)(cid:12)(cid:12) (cid:15) =0 = 0 by virtue of the fact that the remainder R ,j ( P , P (cid:15) ) in (27) consists of only second-orderproducts of errors between P and P (cid:15) . This follows since applying the product rule yields asum of two terms, each of which is a product of a derivative term (which may not be zeroat (cid:15) = 0 ) and an error term involving differences of components of P (cid:15) and P (which will bezero at (cid:15) = 0 ). Therefore ϕ m is the efficient influence function for the parameter m ( β ) . Theefficient influence functions for β and g ( y ; β ) follow similarly, via the chain rule. B.4 Proof of Theorem 2
From Lemmas 1 and 2, the efficient influence function of ψ f = (cid:82) f ( p ( y ) , p ( y )) p ( y ) dy if p ( y ) were known would be ϕ ( z ; P ) = ( A = 1) π (1 | X ) (cid:26) h (cid:48) ( Y ) − (cid:90) h (cid:48) ( y ) η ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( y ) η ( y | X ) dy − (cid:90) h (cid:48) ( y ) η ( y | x ) dy d P ( x ) h (cid:48) ( y ) = h (cid:48) ( y ; p , p ) = p ( y ) f (cid:48) ( p ( y ) , p ( y )) . Similarly, if p ( y ) were known, the efficient influence function of ψ f would be ϕ ( z ; P ) = ( A = 0) π (0 | X ) (cid:26) h (cid:48) ( y ) − (cid:90) h (cid:48) ( y ) η ( y | X ) dy (cid:27) + (cid:90) h (cid:48) ( y ) η ( y | X ) dy − (cid:90) h (cid:48) ( y ) η ( y | x ) dy d P ( x ) , where h (cid:48) ( y ) = h (cid:48) ( y ; p , p ) = f ( p ( y ) , p ( y )) + p ( y ) f (cid:48) ( p ( y ) , p ( y )) . The result then follows from the fact that the influence function when p and p are bothunknown is the sum of the two influence functions when p and p are known, separately. B.5 Proof of Claim in Remark 4
Here we show why rates for estimating p a ( y ) with the plug-in estimator (cid:98) p a ( y ) = P n { (cid:98) η a ( y | X ) } will not be slower than those for estimating (cid:98) η a ( y | x ) , by bounding the mean squared error ofthe former in terms of the latter. To this end we denote the pointwise bias and variance of (cid:98) η a as E { (cid:98) η a ( y | x ) } − η a ( y | x ) = b ( y | x ) and var { (cid:98) η a ( y | x ) } = v ( y | x ) , respectively. First notefor the bias that E { (cid:98) p a ( y ) } − p a ( y ) = E (cid:104) P n { (cid:98) η a ( y | X ) } (cid:105) − (cid:90) η a ( y | x ) d P ( x )= E (cid:90) (cid:98) η a ( y | x ) d P ( x ) − (cid:90) η a ( y | x ) d P ( x ) = (cid:90) b ( y | x ) d P ( x ) , where in the second line we used iterated expectation, conditioning on the training sample D n used to construct (cid:98) η a . For the variance we similarly havevar { (cid:98) p a ( y ) } = var (cid:16) E (cid:104) P n { (cid:98) η a ( y | X ) } | D n (cid:105)(cid:17) + E (cid:16) var (cid:104) P n { (cid:98) η a ( y | X ) } | D n (cid:105)(cid:17) = var (cid:90) (cid:98) η a ( y | x ) d P ( x ) + 1 n E (cid:104) var (cid:110)(cid:98) η a ( y | X ) | D n (cid:111)(cid:105) . For the first term above, by Cauchy-Schwarz we havevar (cid:90) (cid:98) η a ( y | x ) d P ( x ) = E (cid:26)(cid:90) (cid:104)(cid:98) η a ( y | x ) − E { (cid:98) η a ( y | x ) } (cid:105) d P ( x ) (cid:27) ≤ E (cid:90) (cid:104)(cid:98) η a ( y | x ) − E { (cid:98) η a ( y | x ) } (cid:105) d P ( x ) = (cid:90) v ( y | x ) d P ( x ) And for the second term note that E (cid:104) var (cid:110)(cid:98) η a ( y | X ) | D n (cid:111)(cid:105) ≤ E (cid:90) (cid:98) η a ( y | x ) d P ( x ) = (cid:90) (cid:18) v ( y | x ) + (cid:104) E { (cid:98) η a ( y | x ) } (cid:105) (cid:19) d P ( x )= (cid:90) (cid:18) v ( y | x ) + (cid:104) E (cid:110)(cid:98) η a ( y | x ) − η a ( y | x ) + η a ( y | x ) (cid:111)(cid:105) (cid:19) d P ( x ) ≤ (cid:90) (cid:110) v ( y | x ) + 2 b ( y | x ) + 2 η a ( y | x ) (cid:111) d P ( x ) . Therefore as long as (cid:82) η a ( y | x ) d P ( x ) ≤ C , we have E (cid:20)(cid:110)(cid:98) p a ( y ) − p a ( y ) (cid:111) (cid:21) ≤ (cid:18) n (cid:19) (cid:90) E (cid:20)(cid:110)(cid:98) η a ( y | x ) − η a ( y | x ) (cid:111) (cid:21) d P ( x ) + 2 Cn .6 Proof of Theorem 3 First we present a master lemma giving the rate of convergence of the solution to a sample-splitestimating equation. The logic parallels that of Theorem 5.31 of van der Vaart [2000].
Lemma 3.
Let ϕ ( z ; θ, η ) denote a vector estimating function for target parameter θ ∈ R p andnuisance functions η ∈ H for some function space H . Suppose the true values ( θ , η ) satisfy P { ϕ ( Z ; θ , η ) } = 0 , and define the estimator (cid:98) θ as an approximate solution to the estimatingequation satisfying P n { ϕ ( Z ; (cid:98) θ, (cid:98) η ) } = o P (1 / √ n ) where (cid:98) η is estimated on a separate independent sample. Assume:1. The function class { ϕ ( z ; θ, η ) : θ ∈ R p } is Donsker in θ for any fixed η .2. The estimators are consistent, i.e., (cid:98) θ − θ = o P (1) and (cid:107) (cid:98) η − (cid:98) η (cid:107) = o P (1) .3. The map θ (cid:55)→ P { ϕ ( Z ; θ, η ) } is differentiable at θ uniformly in η , with nonsingularderivative matrix ∂∂θ P { ϕ ( Z ; θ, η ) }| θ = θ = V ( θ , η ) , where V ( θ , (cid:98) η ) p → V ( θ , η ) .Then (cid:98) θ − θ = − V ( θ , η ) − ( P n − P ) { ϕ ( Z ; θ , η ) } + O P (cid:16) R n (cid:17) + o P (cid:18) √ n (cid:19) for R n = P { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } .Proof. First note that, since ( (cid:98) θ, (cid:98) η ) and ( θ, η ) are approximate and exact solutions of theempirical and population moment conditions, respectively, we have o P (1 / √ n ) = P n { ϕ ( Z ; (cid:98) θ, (cid:98) η ) } − P { ϕ ( Z ; θ , η ) } = ( P n − P ) { ϕ ( Z ; θ , η ) } + ( P n − P ) { ϕ ( Z ; (cid:98) θ, (cid:98) η ) − ϕ ( Z ; θ , (cid:98) η ) } (29) + ( P n − P ) { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } (30) + P { ϕ ( Z ; (cid:98) θ, (cid:98) η ) − ϕ ( Z ; θ , (cid:98) η ) } + P { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } (31)where the second equality follows by simply adding and subtracting terms. The first term in(29) is a simple sample average of a fixed function and so will be asymptotically Gaussian bythe central limit theorem. The second term in (29) and the term in (30) are empirical processterms. The first term in (31) will be linearized in ( (cid:98) θ − θ ) , while the second term in (31)captures the effect of the nuisance estimation error. We will tackle each of these in turn.Under the Donsker and consistency conditions for (cid:98) θ in Assumptions 1 and 2, the second termin (29) is o P (1 / √ n ) by Lemma 19.24 of van der Vaart [2000]. Under the consistency of (cid:98) η inAssumption 2 and the sample splitting, the term in (30) is o P (1 / √ n ) by Lemma 2 of Kennedyet al. [2020].By the differentiability of the map θ (cid:55)→ P { ϕ ( Z ; θ, η ) } in Assumption 3, the first term in (31)can be expressed as P { ϕ ( Z ; (cid:98) θ, (cid:98) η ) − ϕ ( Z ; θ , (cid:98) η ) } = V ( θ , (cid:98) η )( (cid:98) θ − θ ) + o P ( (cid:107) (cid:98) θ − θ (cid:107) )= V ( θ , η )( (cid:98) θ − θ ) + o P ( (cid:107) (cid:98) θ − θ (cid:107) ) where the last line follows by the consistency of V ( θ , (cid:98) η ) in Assumption 3.33herefore we have o P (1 / √ n ) = ( P n − P ) { ϕ ( Z ; θ , η ) } + V ( θ , η )( (cid:98) θ − θ ) + R n + o P ( (cid:107) (cid:98) θ − θ (cid:107) ) where we let R n = P { ϕ ( Z ; θ , (cid:98) η ) − ϕ ( Z ; θ , η ) } denote the second term in (31), or equivalently (cid:98) θ − θ = − V ( θ , η ) − ( P n − P ) { ϕ ( Z ; θ , η ) } + O P ( R n ) + o P ( (cid:107) (cid:98) θ − θ (cid:107) ) + o P (1 / √ n ) by the nonsingularity of the derivative matrix in Assumption 3. This implies (cid:107) (cid:98) θ − θ (cid:107) (1 + o P (1)) = O P (1 / √ n + R n ) so that (cid:107) (cid:98) θ − θ (cid:107) = O P (1 / √ n + R n ) , which gives the result after noting that o P ( O P (1 / √ n + R n )) = o P (1 / √ n + R n ) and that O P ( R n ) + o P ( R n ) = O P ( R n ) .Now we can apply Lemma 3 to prove Theorem 3. First note that by definition the estimatorsatisfies P n { φ ( Z ; (cid:98) β, (cid:98) η ) } = o P (1 / √ n ) for φ ( Z ; β, η ) = m ( β ; η ) + ϕ ( Z ; β, η ) , and the true values ( β , η ) satisfy P { φ ( Z ; β , η ) } = 0 ,again by definition.Conditions 1–3 of Lemma 3 hold by Assumptions 1–4 of Theorem 3, so the result follows byvirtue of the fact that R n ≡ P { φ ( Z ; β , (cid:98) η ) − φ ( Z ; β , η ) } = m ( β ; (cid:98) η ) + P { ϕ m ( Z ; β , (cid:98) η ) } = P (cid:90) h (cid:48) ( (cid:98) p a ( y )) (cid:26) π a ( X ) (cid:98) π a ( X ) − (cid:27) (cid:110) η a ( y | X ) − (cid:98) η a ( y | X ) (cid:111) dy + 12 (cid:90) h (cid:48)(cid:48) ( (cid:98) p ∗ a ( y )) (cid:110)(cid:98) p a ( y ) − p a ( y ) (cid:111) dy (cid:46) (cid:107) (cid:98) π − π (cid:107)(cid:107) (cid:98) η a − η a (cid:107) + δ (cid:107) (cid:98) p a − p a (cid:107) where the second to last line follows from the result given in equation (27) of the proof ofTheorem 1, for the vectors h and h (cid:48) as defined in (25), and the last line follows by the Cauchy-Schwarz inequality and boundedness assumptions on h (cid:48) , h (cid:48)(cid:48) , and / (cid:98) ππ