Adversarial Estimation of Riesz Representers
Victor Chernozhukov, Whitney Newey, Rahul Singh, Vasilis Syrgkanis
aa r X i v : . [ ec on . E M ] D ec Adversarial Estimation of Riesz Representers
Victor Chernozhukov Whitney Newey Rahul Singh Vasilis SyrgkanisDecember 2020
Abstract
We provide an adversarial approach to estimating Riesz representers of linear function-als within arbitrary function spaces. We prove oracle inequalities based on the localizedRademacher complexity of the function space used to approximate the Riesz representer andthe approximation error. These inequalities imply fast finite sample mean-squared-error ratesfor many function spaces of interest, such as high-dimensional sparse linear functions, neuralnetworks and reproducing kernel Hilbert spaces. Our approach offers a new way of estimatingRiesz representers with a plethora of recently introduced machine learning techniques. Weshow how our estimator can be used in the context of de-biasing structural/causal parame-ters in semi-parametric models, for automated orthogonalization of moment equations and forestimating the stochastic discount factor in the context of asset pricing.
Contents ℓ -Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Computation 14
B.1 Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33B.2 Asset Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C Local Riesz Representer Convergence Rate 36D Proofs from Section 3 37
D.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38D.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42D.3 Proof of Corollary 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
E Proofs from Section 5 45
E.1 Proof of Proposition 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45E.2 Proof of Proposition 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46E.3 Proof of Proposition 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47E.4 Proof of Proposition 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48E.5 Proof of Proposition 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48E.6 Proof of Proposition 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49E.7 Proof of Proposition 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50E.8 Proof of Corollary 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Proofs from Section 6 54
F.1 Proof of Lemma 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54F.2 Proof of Normality without Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 55F.3 Proof of Lemma 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58F.4 Proof of Lemma 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Many problems in econometrics, statistics, causal inference, and finance involve linear functionalsof unknown functions: θ ( g ) = E [ m ( Z ; g )] where Z denotes a random vector, and g : X → R is a function in some space G . A continuouslinear functional that is mean square continuous with respect to ℓ norm can be written in a morebenign and useful manner. Formally, for a given linear functional θ ( · ) , there exists a function a such that for any g ∈ G : θ ( g ) = E [ a ( X ) g ( X )] This result is known as the Riesz representation theorem, and the function a is the Riesz representerof the linear functional. Evaluation of a linear functional θ ( g ) can be achieved by simply taking theinner product between a and g .Knowing the Riesz representation of a linear functional is a critical building block in a variety oflearning problems. For instance, in semi-parametric models, g is an unknown regression functionand θ ( g ) is a causal or structural parameter of interest. The Riesz representer a of the functional θ ( · ) can be used to debias the plug-in estimator and construct semi-parametrically efficient estima-tors of the parameter θ ( g ) . In asset pricing applications, the Riesz representer corresponds to thestochastic discount factor, which is of primary interest when pricing financial derivatives.Irrespective of the downstream application, the goal of this paper is to derive an estimator for theRiesz representer of any linear functional, when given access to n samples of the random vector Z and a target function space A that can well approximate the function a . We propose and analyzean estimator ˆ a , with small mean-squared-error. Formally, with probability (w.p.) − ζ : k ˆ a − a k = r E h (ˆ a ( X ) − a ( X )) i ≤ ǫ n,ζ We consider estimation of the Riesz representer within some function space A and propose anadversarial estimator based on regularized variants of the following min-max criterion: ˆ a = arg min a ∈A max f ∈F n n X i =1 (cid:0) m ( Z i ; f ) − a ( X i ) · f ( X i ) − f ( X i ) (cid:1) For simplicity of exposition, throughout the paper we consider scalar-valued functions g . All our resultsnaturally extend to vector-valued functions g , and estimate a vector valued Riesz representer that satisfies that θ ( g ) = E [ a ( X ) ′ g ( X )] .
3e derive oracle inequalities for this estimator as a function of the localized Rademacher complexityof the function space A and the approximation error ǫ = min a ∈A k a − a k .We show that as long as the function class F contains the star-hull of differences of functions in A , i.e. F := { r ( a − a ′ ) : a, a ′ ∈ A , r ∈ [0 , } , then the estimation rate of the adversarial estimatorachieves w.p. − ζ : k ˆ a − a k = O ǫ + δ n + r log(1 /ζ ) n ! where δ n is the critical radius of the function classes F and m ◦ F = { Z → m ( Z ; f ) : f ∈ F} . Thecritical radius of a function class is a widely used quantity in statistical learning theory that allowsone to argue fast estimation rates that are nearly optimal. For instance, for parametric functionclasses, the critical radius is of order n − / , leading to fast parametric rates (as compared to n − / which would be achievable via looser uniform deviation bounds).Moreover, the critical radius has been analyzed and derived for a variety of function spaces of inter-est, such as neural networks, high-dimensional linear functions, reproducing kernel Hilbert spaces,and VC-subgraph classes. Thus our general theorem allows us to appeal to these characterizationsand provide oracle rates for a family of Riesz representer estimators. Prior work on estimating Rieszrepresenters only considered particular high-dimensional parametric classes and derived specializedestimators for the function space of interest. Our adversarial estimator provides a single approachthat tackles generic function spaces in a uniform manner.We also examine the computational aspect of our estimator. We provide examples of how estimationcan be achieved in a computationally efficient manner for several function spaces of interest.Finally, we show how our estimator can be used in the context of estimating causal or structuralparameters in semi-parametric models. Specifically, our mean square rate for the Riesz representeris sufficiently fast to achieve semi-parametric efficiency and asymptotic normality of the causal orstructural parameter. This learning problem arises in two important domains for economic research: causal inference andasset pricing.
Automated De-biasing of Causal Estimates.
In causal inference, a variety of treatmenteffects and policy effects can be formulated as functionals–i.e., scalar summaries–of an underlyingregression [36]. Formally, the causal parameter θ = θ ( g ) = E [ m ( Z ; g )] is a functional θ ( · ) of thenuisance parameter g ( x ) := E [ Y | X = x ] . In this paper, we consider a variety of treatment andpolicy effects including1. Average treatment effect (ATE): θ = E [ g (1 , W ) − g (0 , W )] , where X = ( D, W ) consists oftreatment and covariates.2. Average policy effect: θ = ´ g ( x ) dµ ( x ) where µ ( x ) = F ( x ) − F ( x )
3. Policy effect from transporting covariates: θ = E [ g ( t ( X )) − g ( X )]
4. Cross effect: θ = E [ Dg (0 , W )] , where X = ( D, W ) consists of treatment and covariates.5. Regression decomposition: E [ Y | D = 1] − E [ Y | D = 0] = θ response + θ composition where θ response = E [ g (1 , W ) | D = 1] − E [ g (0 , W ) | D = 1] θ composition = E [ g (0 , W ) | D = 1] − E [ g (0 , W ) | D = 0]
6. Average treatment on the treated (ATT): θ = E [ g (1 , W ) | D = 1] − E [ g (0 , W ) | D = 1] , where X = ( D, W ) consists of treatment and covariates.7. Local average treatment effect (LATE): θ = E [ g (1 ,W ) − g (0 ,W )] E [ h (1 ,W ) − h (0 ,W )] , where X = ( V, W ) consistsof instrument and covariates and h ( x ) := E [ D | X = x ] is a second regression.More generally, our results extend to parameters defined implicitly by E [ m ( Z ; g ; θ )] , such aspartially linear regression and partially linear instrumental variable regression.If the regression g is learned by a regularized estimator ˆ g , then estimation of the causal parameter θ by a plug-in estimator E n [ m ( Z ; ˆ g )] is badly biased. The solution is to use a de-biased formulationof the causal parameter instead: θ = E [ m ( Z ; g ) + a ( X ) { Y − g ( X ) } ] . Observe that a arises inthe bias correction term. We re-visit this example in Section 6. Fundamental Asset Pricing Equation.
In asset pricing, a variety of financial models deliverthe same fundamental asset pricing equation. This equation is of both theoretical and practicalinterest. Theoretically, it elucidates why asset prices or returns are what they are. Practically,it can be used to identify trading opportunities when assets are mis-priced. The asset pricingequation follows from two weak assumptions: free portfolio formation, and the law of one price. InAppendix B.2, we review the derivation for a general audience. Formally, the fundamental asset pricing equation is p t,i = E t [ m t +1 x t +1 ,i ] where p t,i is the price ofasset i at time t , x t +1 ,i is payoff of asset i at time t + 1 , and m t +1 is the market-wide stochasticdiscount factor (SDF) at time t + 1 . The expectation is conditional on information ( I t , I t,i ) knownat time t : I t are macroeconomic conditioning variables that are not asset specific, e.g. inflationrates and market return; I t,i are asset-specific characteristics, e.g. the size or book-to-market ratioof firm i at time t . The asset pricing equation encompasses stocks, bonds, and options. We clarifyits many instantiations below, where d t +1 is dividend, C is the call price, S T is the stock price atexpiration, K is the strike price. The same asset pricing equation can be derived from either a model of complete markets for contingent claims, ora model of investor utility maximization. Free portfolio formation is a weaker assumption on market structure thanthe existence of complete markets for contingent claims. The law of one price is a weaker assumption on preferencestructure than investor utility maximization. We present these additional derivations in Appendix B.2. The SDF has many additional names: marginal rate of substitution, state price density, and pricing kernel. Eachname corresponds to a different derivation of the asset pricing equation, starting from different first principles. p t Payoff x t +1 Stock p t p t +1 + d t +1 Bond p t Option C max { S T − K, } Return R t +1 Excess return 0 R et +1 Table 1: Generality of asset pricing equationThe fundamental asset pricing equation can also be parametrized in terms of returns. If an investorpays one dollar for an asset i today, the gross rate of return R t +1 ,i is how many dollars the investorreceives tomorrow; formally, the price is p t,i = 1 and the payoff is x t +1 ,i = R t +1 ,i by definition.Next consider what happens when an investor borrows a dollar today at the interest rate R ft +1 andbuys an asset i that gives the gross rate of return R t +1 ,i tomorrow. From the perspective of theinvestor, who paid nothing out-of-pocket, the price is p t,i = 0 while the payoff is the excess rate ofreturn R et +1 ,i := R t +1 ,i − R ft +1 , leading to the asset pricing equation: E t [ m t +1 R et +1 ,i ] .Following [29], we focus on the latter excess return parametrization of the asset pricing equation.Taking expectations yields the unconditional moment restriction E [ m t +1 R et +1 ,i z ( I t , I t,i )] = E [ E [ m t +1 | R et +1 ,i , I t , I t,i ] R et +1 ,i z ( I t , I t,i )] , ∀ z ( · ) Our framework nests this final expression. Specifically, θ ( g ) = 0 , g ( R et +1 ,i , I t , I t,i ) = R et +1 ,i z ( I t , I t,i ) , a ( R et +1 ,i , I t , I t,i ) = E [ m t +1 | R et +1 ,i , I t , I t,i ] By estimating a , which is the projection of the SDF onto excess returns and other availableinformation, one can pin down the price of any hypothetical asset. Classical Semi-parametric Statistics.
Classical semi-parametric statistical theory studies theasymptotic properties of statistical quantities that are functionals of a density or a regression overa low-dimensional domain [82, 60, 65, 101, 77, 108, 128, 23, 91, 106, 129, 24, 92, 3, 93, 4, 123, 79,5]. Any continuous linear functional has a Riesz representer. In this classical theory, the Rieszrepresenter appears in the influence function and therefore in the asymptotic variance of semi-parametric estimators [91]. We depart from classical theory by considering the high-dimensionalsetting.
De-biased Machine Learning and Targeted Maximum Likelihood.
Because the Rieszrepresenter appears in the asymptotic variance of semi-parametric estimators, it can be incorporatedinto estimation to ensure semi-parametric efficiency. In practice, this can be achieved by introducinga de-biasing term into the estimating equation [60, 24, 133, 15, 16, 17, 18, 69, 70, 71, 124, 100, 37,96, 103, 66, 67, 68, 27, 135, 136]. In doubly robust estimating equations for regression functionals,the de-biasing term is the product between the Riesz representer and the regression residual [107,106, 127, 126, 84, 122]. The more general principle at play is Neyman orthogonality: the learning6roblem for the functional of interest becomes orthogonal to the learning problems for both theregression and the Riesz representer [97, 98, 129, 104, 134, 17, 18, 36, 14, 35, 51].De-biased machine learning and targeted maximum likelihood combine the algorithmic insight ofdoubly-robust moment functions with the algorithmic insight of sample splitting [22, 113, 77, 129,104]. In doing so, these frameworks facilitate a general analysis of residuals such that the targetfunctional is √ n -consistent under minimal assumptions on the estimators used for the regressionand Riesz representer [112, 110, 111, 127, 134, 126, 44, 125, 74, 73]. In particular, any machinelearning estimators are permitted that satisfy √ n k ˆ g − g k · k ˆ a − a k → [35, 36].The Riesz representer may be a difficult object to estimate. Even for simple regression functionalssuch as policy effects, its closed form involves ratios of densities. In restricted models, wherethe regression is known to belong to a certain function class, there is the further difficulty ofprojecting the Riesz representer accordingly. A recent literature explores the possibility of directlyestimating the Riesz representer, without estimating its components or even knowing its functionalform [105, 95, 9, 39, 40, 62, 63, 117, 109]. A crucial insight, on which we build, is that the Rieszrepresenter is directly identified from data.[63] observe that to debias an average moment, it is sufficient to estimate an empirical analogue ofthe Riesz representer that approximately satisfies the Riesz representer moment equation on the n samples. They propose a parametric min-max criterion to estimate n parameters corresponding tothe n evaluations of the empirical Riesz representer. Unlike [63], we provide a guarantee on learningthe true Riesz representer, we approximate the Riesz representer within non-parametric functionspaces, and our result therefore has broader application beyond causal inference. Importantly, [63]require that the same sample used to estimate the n parameters is used in final stage estimation ofthe causal parameter. As such, the analysis requires that the regression function g lies in a Donskerclass–a restriction that precludes many machine learning estimators. By contrast, our adversarialestimator provides fast estimation rates with respect to the true Reisz representer and hence can beused in combination with cross-fitting and sample splitting to eliminate the Donsker assumption. Adversarial Estimation.
Riesz representation theorem can be viewed as a continuum of un-conditional moment restrictions. The non-parametric instrumental variable problem, based ona conditional moment restriction, also implies a continuum of unconditional moment restrictions[94, 56, 25, 31, 42, 32, 33, 30]. A central insight of this work is that the min-max approach forconditional moment models may be adapted to the problem of learning the Riesz representer. Ina min-max approach, the continuum of unconditional moment restrictions is enforced adversariallyover a set of test functions [54, 8, 45].The fundamental advantage of the min-max approach is its unified analysis over arbitrary functionclasses. In particular, via local Rademacher analysis, one can derive an abstract bound that encom-passes sparse linear models, neural networks, and RKHS methods [78, 12]. As such, the min-maxapproach is actually a family of algorithms adaptive to a variety of data settings with a unifiedguarantee [90, 80, 81].
Machine Learning Approaches to Causal Inference and Asset Pricing.
By pursuing a min-max approach, our work relates to previous work that incorporates a variety of machine learningmethods into causal inference. Much work on de-biased machine learning focuses on sparse andapproximately sparse models [39, 40, 38]. A neural network estimator with mean square rate hasbeen successfully used to learn the nuisance regression in semiparametric estimation [34, 49] and tolearn the structural function in nonparametric instrumental variable regression [59, 20, 45]. A more7ecent literature incorporates RKHS methods into causal inference due to their convenient closedform solutions and strong performance on smooth designs [99, 116, 89, 118, 88].Finally, our works provides a theoretical foundation for a growing literature that incorporatesmachine learning into asset pricing. We follow the asset pricing literature in framing the problem oflearning a stochastic discount factor as the problem of learning a Riesz representer [57]. Specifically,we propose a deep min-max approach based on free portfolio formation and the law of one price [11,29]. This approach differs from deep learning approaches that predict asset prices via nonparametricregression [86, 50, 55, 21]. Unlike previous work, we prove mean square rates for the stochasticdiscount factor, and we prove √ n -consistency and semiparametric efficiency for expected assetprices. For any function space G , let star ( G ) := { r g : g ∈ G , r ∈ [0 , } , denote the star hull. Let ∂ G := { g − g ′ : g, g ′ ∈ G} denote the space of differences. We will consider estimators that estimateRiesz representers within some function space A , equipped with some norm k · k A . Moreover, let h· , ·i be the inner product associated with the ℓ norm, i.e. h a, a ′ i := E X [ a ( X ) a ′ ( X )] . Giventhis notation, we define the class: F := star ( ∂ A ) := { r ( a − a ′ ) : a, a ′ ∈ A , r ∈ [0 , } and assume that the norm k · k A extends naturally to the larger space F . Moreover, let E n [ · ] denotethe empirical average and k · k ,n the empirical ℓ norm, i.e. k g k ,n := p E n [ g ( X ) ] . Consider the following adversarial estimator: ˆ a = arg min a ∈A max f ∈F E n [ m ( Z ; f ) − a ( X ) · f ( X )] − k f k ,n − λ k f k A + µ k a k A (1) Remark 1 (Population limit) . Consider the population limit of our criterion where n → ∞ and λ, µ → . Then our criterion is: max f ∈F E [ m ( Z ; f ) − a ( X ) · f ( X )] − k f k By the definition of the Riesz representer we thus have: max f ∈F E [ m ( Z ; f ) − a ( X ) · f ( X )] − k f k = max f ∈F E (cid:2) ( a ( X ) − a ( X )) · f ( X ) − f ( X ) (cid:3) = 14 E (cid:2) ( a ( X ) − a ( X )) (cid:3) =: 14 k ˆ a − a k Thus our empirical criterion converges to the mean-squared-error criterion in the population limit,even though we don’t have access to unbiased samples from a ( X ) . In Appendix A, we examine the relationship between G and the Riesz representer space A . emark 2 (Norm-Based Regularization) . The extra vanishing norm-based regularization can beavoided if one knows a bound on the norm of the true a . In that case, one can impose a hardnorm constraint on the hypothesis space A and ¯ A and optimize over these norm-constrained sub-spaces. However, regularization allows the estimator to be adaptive to the true norm of a , withoutknowledge of it. Remark 3 (Mis-specification) . We in fact allow for a / ∈ A , and incur an extra bias part in ourestimation error of the form of: min a ∈A k a − a k . Thus A need only be an ℓ -norm approximatingsequence of function spaces. We now provide fast convergence rates of our regularized minimax estimator, parameterized by thecritical radii of the function classes: F B := { f ∈ F : k f k A ≤ B } m ◦ F B := { m ( · ; f ) : f ∈ F B } for some appropriately defined constant B . The critical radius of a function class F with range in [ − , is defined as any solution δ n to the inequality: R ( δ ; F ) ≤ δ with: R ( δ ; F ) = E " sup f ∈F : k f k ≤ δ n n X i =1 ǫ i f ( X i ) with ǫ n are independent Rademacher random variables drawn equiprobably in {− , } . ForVC-subgraph function classes with constant VC dimension the critical radius is of the order of p log( n ) /n . The critical radius has been characterized by many other function classes such as re-producing kernel Hilbert spaces, neural networks and high-dimensional linear functions (c.f. [130]and Section 4).We will also require the following norm-dominance condition: ASSUMPTION 1 (Mean-Squared Continuity) . For some constant M ≥ , the following propertyholds: ∀ f ∈ F : p E [ m ( Z ; f ) ] ≤ M k f k Observe that the fact that the operator θ ( g ) is bounded, implies that | E [ m ( Z ; g )] | ≤ M k g k . Mean-squared continuity is a stronger condition than boundedness, since: | E [ m ( Z ; g )] | ≤ E [ | m ( Z ; g ) | ] ≤ p E [ m ( Z ; g ) ] . In Appendix B.1, we verify this condition for a variety of popular functionals. Example 1 (Mean-Squared Continuity for ATE) . Let X = ( D, W ) consist of treatment and covari-ates. In the case of treatment effect estimation, the above is implied by a non-parametric overlapcondition, i.e. Pr[ D = 1 | w ] ∈ (1 /M, − /M ) for some M ∈ (1 , ∞ ) . Then observe that: E [( g (1 , W ) − g (0 , W )) ] ≤ E [ g (1 , W ) + g (0 , W ) ] ≤ M E (cid:2) Pr[ D = 1 | W ] g (1 , W ) + Pr[ D = 0 | W ] g (0 , W ) (cid:3) = 2 M k g k heorem 1. Assume that mean-squared continuity holds for some constant M ≥ and that forsome B ≥ , the functions in F B and m ◦ F B have uniformly bounded ranges in [ − , . Let: δ := δ n + ǫ n + c r log( c /ζ ) n , for universal constants c , c , where δ n upper bounds the critical radii of F B , m ◦ F B and ǫ n up-per bounds the bias min a ∈A k a − a k . Let a ∗ = arg min a ∈A k a − a k . Then the estimator inEquation (1) , with µ ≥ λ ≥ δ /B , satisfies w.p. − ζ : k ˆ a − a k ≤ O (cid:16) M δ + µδ k a ∗ k A (cid:17) For µ ≤ Cδ /B , for some constant C , the latter is: O (cid:16) δ max n M , k a ∗ k A B o(cid:17) . Remark 4.
Suppose we only want to approximate the Riesz representer with respect to the weakerdistance metric k · k F defined as: k a k F = sup f ∈F h a, f i − k f k ≤ k a k Then Theorem 1 can be adapted to show that: k ˆ a − a k F ≤ δ max (cid:8) M , k a ∗ k A /B (cid:9) , where now theapproximation rate is ǫ n = inf a ∈A k a − a k F . Observe that k · k F satisfies: k a k F = inf f ∈F k f k − h a, f i + k a k − k a k = inf f ∈F k a − f / k − k a k ≤ inf f ∈F k a − f k − k a k where in the last inequality we used the fact that F is star-convex. Thus it is at most the projectionof a on F . Hence, it is sufficient that A approximates a in this weak sense that for some a ∗ ∈ A theprojection of a ∗ − a on F is at most ǫ n . Thus any component of a that is orthogonal to F can beignored, since if we denote with a = a ⊥ + a k with sup f ∈F h a ⊥ , f i = 0 , then k a − a ∗ k F = k a k − a ∗ k F . Our proof uses similar ideas as in the proof of Theorem 1 of [45], where an adversarial estimatorwas considered for the case of non-parametric instrumental variable regression. Theorem 1 of [45]provides bounds on a weaker metric than the mean-squared-error metric and requires bounds onthe critical radius of more complicated function spaces.As a corollary of Theorem 1, we can obtain a bound for the un-regularized estimator with λ = µ = 0 ,where the function classes F and G are already norm constrained, e.g. k f k A ≤ U for all f ∈ F ,which also implies that k a k A ≤ U for all a ∈ A , such that functions in F and G have uniformlybounded range. This can be achieved by using the above norm-constrained definitions of F and G and taking the limit of Theorem 3 when B → ∞ . In that case, F B → F , m ◦ F B → m ◦ F and λ, µ are allowed to take zero value. This leads to the corollary: The metric k · k F satisfies the triangle inequality: k a + b k F ≤ s sup f ∈F h a, f i − k f k + sup f ∈F h b, f i − k f k ≤ k a k F + k b k F and is positive definite i.e. k k F = 0 , but not necessarily homogeneous, i.e. k λa k F =? | λ |k a k F for λ ∈ R . orollary 2. Assume that mean-squared continuity holds for some constant M ≥ and that thefunctions in F and m ◦ F have uniformly bounded ranges in [ − , . Let: δ := δ n + ǫ n + c r log( c /ζ ) n , for universal constants c , c , where δ n upper bounds the critical radii of F , m ◦ F and ǫ n upperbounds the bias min a ∈A k a − a k . The estimator in Equation (1) , with λ = µ = 0 , satisfies: k ˆ a − a k ≤ O (cid:0) M δ (cid:1) ℓ -Penalty We will use the following notation:span κ ( F ) := ( p X i =1 w i f i : f i ∈ F , k w k ≤ κ, p ≤ ∞ ) Theorem 3.
Consider a set of test functions F := ∪ di =1 F i , that is de-composable as a union of d symmetric test function spaces F i and suppose that A is star-convex. Consider the adversarialestimator: ˆ a = arg min a ∈A sup f ∈F E n [ m ( Z ; f ) − a ( X ) · f ( X )] + λ k a k A (2) Let m ◦ F i = { m ( · ; f ) : f ∈ F i } and δ n,ζ := 2 d max i =1 (cid:0) R ( F i ) + R ( m ◦ F i ) (cid:1) + c r log( c d/ζ ) n , for some universal constants c , c and B n,λ,ζ := ( k a k A + δ n,ζ /λ ) . Suppose that λ ≥ δ n,ζ and: ∀ a ∈ A B n,λ,ζ with k a − a k ≥ δ n,ζ : a − a k a − a k ∈ span κ ( F ) Then ˆ a satisfies that w.p. − ζ : k ˆ a − a k ≤ κ (2 ( k a k A + 1) R ( A ) + δ n,ζ + λ ( k a k A − k ˆ a k A )) We now instantiate our two main theorems for several function classes of interest. Throughoutthis section we will use the following convenient characterization of the critical radius of a functionclass. Corollary 14.3 and Proposition 14.25 of [130] imply that the critical radius of any functionclass F , uniformly bounded in [ − b, b ] , is of the same order as any solution to the inequality: √ n ˆ δ δ b p log ( N n ( ǫ ; B n ( δ ; F )) dǫ ≤ δ b (3)where B n ( δ ; F ) = { f ∈ F : k f k ,n ≤ δ } and N n ( ǫ ; F ) is the empirical ℓ -covering number atapproximation level ǫ , i.e. the size of the smallest ǫ -cover of F , with respect to the empirical ℓ metric. 11 .1 Sparse Linear Functions Consider the class of s -sparse linear function classes in p dimensions, with bounded coefficients, i.e., A splin := { x → h θ, x i : k θ k ≤ s, k θ k ∞ ≤ b } , then observe that F is also the class of s -sparse linear functions, with bounded coefficients in [ − b, b ] . Moreover, suppose that the ℓ -norm of the covariates x is bounded. The critical radius δ n is of order O (cid:18)q s log( p n ) n (cid:19) . It is easy to see that the ǫ -covering number of such a functionclass is of order N n ( ǫ ; F ) = O (cid:16)(cid:0) ps (cid:1) (cid:0) bǫ (cid:1) s (cid:17) ≤ O (cid:16)(cid:16) p bǫ (cid:17) s (cid:17) , since it suffices to choose the support ofthe coefficients and then place a uniform ǫ -grid on the support. Thus we get that Equation (3) issatisfied for δ = O (cid:18)q s log( p b ) log( n ) n (cid:19) . Moreover, observe that if m ( Z ; f ) is L -Lipschitz in f withrespect to the ℓ ∞ norm, then the covering number of m ◦ F is also of the same order. Thus we canapply Corollary 2 to get: Corollary 4 (Sparse Linear Riesz Representer) . The estimator presented in Corollary 2, with A = A splin , satisfies w.p. − ζ : k ˆ a − a k ≤ O min a ∈ A splin k a − a k + r s log( p b ) log( n ) n + r log(1 /ζ ) n ! The latter Theorem required a hard sparsity constraint. However, our second main theorem, Theo-rem 3, allows us to prove a similar guarantee for the relaxed version of ℓ -bounded high-dimensionallinear function classes. For this corollary we require a restricted eigenvalue condition which is typicalfor such relaxations. Corollary 5 (Sparse Linear Riesz Representer with Restricted Eigenvalue) . Suppose that a ( x ) = h θ , x i with k θ k ≤ s and k θ k ≤ B and k θ k ∞ ≤ . Moreover, suppose that the covariance matrix V = E [ xx ′ ] satisfies the restricted eigenvalue condition: ∀ ν ∈ R p s.t. k ν S c k ≤ k ν S k + δ n,ζ /λ : ν ⊤ V ν ≥ γ k ν k Let A = { x → h θ, x i : θ ∈ R p } , kh θ, ·ik A = k θ k , and F = { x → ξx i : i ∈ [ p ] , ξ ∈ {− , }} . Thenthe estimator presented in Equation (2) with λ ≤ γ s , satisfies that w.p. − ζ : k ˆ a − a k ≤ O (cid:18) max (cid:8) , λ γs (cid:9) q sγ (cid:18) ( k θ k + 1) q log( p ) n + q log( p/ζ ) n (cid:19)(cid:19) Remark 5 (Restricted Eigenvalue) . We note that if we have that the unrestricted minimum eigen-value of V is at least γ , then the restricted eigenvalue condition always holds. Moreover, observethat we only require a condition on the population covariance matrix V and not on the empiricalcovariance matrix. .2 Neural Networks Suppose that the function class A can be represented as a RELU activation neural network withdepth L and width W , denoted as A nnet ( L,W ) . Then observe that functions in F can be representedas neural networks with depth L + 1 and width W . Moreover, we assume that functions in m ◦ F are also representable by neural networks of depth O ( L ) and width O ( W ) . Finally, suppose thatthe covariates are distributed in a way that the outputs of F and m ◦ F are uniformly bounded in [ − b, b ] .Then by the L covering number for VC classes of [61], the bounds of theorem 14.1 of [7] andTheorem 6 of [13], one can show that the critical radius of F and m ◦ F is of the order of δ n = O (cid:18)q L W log( W ) log( b ) log( n ) n (cid:19) (c.f. Proof of Example 3 of [51] for a detailed derivation). Thus wecan apply Corollary 2 to get: Corollary 6 (Neural Network Riesz Representer) . Suppose that A = A nnet ( L,W ) , and that m ◦ F isrepresentable as a neural network with depth O ( L ) and width O ( W ) . Moreover, the input covariatesare such that functions in F and m ◦ F are uniformly bounded in [ − b, b ] . Then the estimatorpresented in Corollary 2, satisfies w.p. − ζ : k ˆ a − a k ≤ O min a ∈ A nnet ( L,W ) k a − a k + r L W log( W ) log( b ) log( n ) n + r log(1 /ζ ) n ! If the true Riesz representer a is representable as a RELU neural network, then the first termvanishes and we achieve an almost parametric rate. For non-parametric Holder function classes,one can easily combine the latter corollary with approximation results for RELU activation neuralnetworks presented in [131, 132]. These approximation results typically require that the depthand the width of the neural network grow as some function of the approximation error ǫ , leadingto errors of the form: O (cid:18) ǫ + q L ( ǫ ) W ( ǫ ) log( W ( ǫ )) log( b ) log( n ) n + q log(1 /ζ ) n (cid:19) . Optimally balancing ǫ then typically leads to almost tight non-parametric rates, of the same order as those presented inTheorem 1 of [48]. Suppose that a lies in a Reproducing Kernel Hilbert Space (RKHS) with kernel K , denoted as A rkhs ( K ) and with the norm k · k A being the RKHS norm. Then observe that F is the same functionspace. Moreover, we assume that m ◦ F also lies in an RKHS with a potentially different kernel ˜ K .Finally, suppose that the input covariates are such that for some constant B , functions in F B and m ◦ F B are bounded in [ − , .Let { ˆ λ j } nj =1 be the eigenvalues of the n × n empirical kernel matrix, with K ij = K ( x i , x j ) /n .Similarly, let { ˆ µ j } nj =1 be the eigenvalues of the empirical kernel matrix ˜ K . Then by Corollary 13.18of [130], we can derive the following corollary of Theorem 1: Corollary 7 (RKHS Riesz Representer) . Suppose that A = A rkhs , a ∈ A rkhs , and that m ◦ F ∈A rkhs ( ˜ K ) . Let { ˆ λ j } nj =1 and { ˆ µ j } nj =1 be the egienvalues of the empirical kernel matrices of K and K , correspondingly. Let δ n be any solution to the inequalities: B r n vuut ∞ X j =1 max { ˆ λ j , δ } ≤ δ B r n vuut ∞ X j =1 max { ˆ µ j , δ } ≤ δ Moreover, the input covariates are such that functions in F B and m ◦ F B are uniformly bounded in [ − , . Then the estimator presented in Theorem 1, satisfies w.p. − ζ : k ˆ a − a k ≤ O k a k A δ n + r log(1 /ζ ) n !! We note that the latter estimator does not need to know the RKHS norm of the true function a .Instead it automatically adapts to the unknown RKHS norm. Moreover, note that the bound δ n is solely based on empirically observable quantities, as it is a function of the empirical eigenvalues.Thus these empirical quantities can be used as a data-adaptive diagnostic of the error.Finally, we note that for particular kernels a more explicit bound can be derived as a functionof the eigendecay. For instance, for the Gaussian kernel, which has an exponential eigendecay,Example 13.21 of [130] derives that the solution to the eigenvalue inequality scales as O (cid:18)q log( n ) n (cid:19) ,thus leading to almost parametric rates: k ˆ a − a k ≤ O (cid:18) k a k A q log( n ) n (cid:19) . In this section we discuss computational aspects of the optimization problem implied by our ad-versarial estimator. We show how in many cases, the min-max optimization problem can be solvedcomputationally efficiently and also discuss practical heuristics for cases where the problem is non-convex (e.g. in the case of neural networks).
For the case of sparse linear functions, the estimator in Theorem 3 requires solving the followingoptimization problem: min θ ∈ R p : k θ k ≤ B max i ∈ [2 p ] E n [ m ( Z ; f i ) − f i ( X ) h θ, X i ] + λ k θ k (4)where f i ( X ) = X i for i ∈ { , . . . , p } and f i ( X ) = − X i for i ∈ { p + 1 , . . . , p } . This can be solvedvia sub-gradient descent, which would yield an ǫ -approximate solution after O (cid:0) p/ǫ (cid:1) steps. Thiscan be improved to O (log( p ) /ǫ ) steps if one views it as a zero-sum game and uses simultaneous gra-dient descent, where the θ -player uses Optimistic-Follow-the-Regularized-Leader with an entropicregularizer and the f -player uses Optimistic Hedge over probability distributions on the finite setof test functions (analogous to Proposition 13 of [45]).14o present the algorithm it will be convenient to re-write the problem where the maximizing playeroptimizes over distributions in the p -dimensional simplex, i.e.: min θ ∈ R p : k θ k ≤ B max w ∈ R p ≥ : k w k =1 E n [ m ( Z ; h w, f i ) − h w, f i ( X ) h θ, X i ] + λ k θ k where f = ( f , . . . , f p ) , denote the vector of the p functions. Moreover, to avoid the non-smoothness of the ℓ penalty it will be convenient to introduce the augmented vector V = ( X ; − X ) and for the minimizing player to optimize over the positive orthant of a p -dimensional vector ρ = ( ρ + ; ρ − ) , with an ℓ bounded norm, such that in the end: θ = ρ + − ρ − . Then we can re-writethe problem as: min ρ ∈ R p ≥ : k ρ k ≤ B max w ∈ R p ≥ : k w k =1 E n [ m ( Z ; h w, f i ) − h w, V i h ρ, V i ] + λ p X i =1 ρ i where we also noted that h w, f i ( X ) = h w, V i . Proposition 8.
Consider the algorithm that for t = 1 , . . . , T , sets: ˜ ρ i,t +1 = ˜ ρ i,t e − ηB ( − E n [ V i h V,w t i ]+ λ )+ ηB ( − E n [ V i h V,w t − i ]+ λ ) ρ t +1 = ˜ ρ t +1 min (cid:26) , B k ˜ ρ t +1 k (cid:27) ˜ w i,t +1 = w i,t e η E n [ m ( Z ; f i ) − V i h V,ρ t i ] − η E n [ m ( Z ; f i ) − V i h V,ρ t − i ] w t +1 = ˜ w t +1 k ˜ w t +1 k with ˜ ρ i, − = ˜ ρ i, = 1 /e and ˜ w i, − = ˜ w i, = 1 / (2 p ) and returns ¯ ρ = T P Tt =1 ρ t . Then for η = k E n [ V V ⊤ ] k ∞ , after T = 16 k E n [ V V ⊤ ] k ∞ B log( B ∨
1) + ( B + 1) log(2 p ) ǫ iterations, the parameter ¯ θ = ¯ ρ + − ¯ ρ − is an ǫ -approximate solution to the minimax problem inEquation (4) . When the function space A and F is represented as a deep neural network then the optimizationproblem is highly non-convex. This is the case even if we were just solving a square loss minimizationproblem. On top of this we also need to deal with the non-convexity and non-smoothness introducedby the min-max structure of our estimator.Luckily, the optimization problem that we are facing is similar to the optimization problem thatis encountered in training Generative Adversarial Networks, i.e. we need to solve a non-convex,non-concave zero-sum game, where the strategy of each of the two players are the parametersof a neural net. Luckily, there has been a surge of recent work proposing iterative optimizationalgorithms inspired by the convex-concave zero-sum game theory (see, e.g. the Optimistic Adamalgorithm of [43], also utilized in the recent work of [20, 45] in the context of solving moment For a matrix A , we denote with k A k ∞ = max i,j | A ij | Recall the estimator is ˆ a = arg min a ∈A max f ∈F E n [ m ( Z ; f ) − a ( X ) · f ( X )] − k f k ,n − λ k f k A + µ k a k A In this section, we derive a closed form solution for ˆ a that can be computed from matrix operations.Towards this end, we impose additional structure on the problem. If G = F = H is a reproducingkernel Hilbert space (RKHS), then the projection a min0 of any RR a into G is clearly an element of H as well, so we can take A = H . Also assume that the functional m satisfies m ( z ; f ) = m ( x ; f ) .Moreover, let the functional be such that it evaluates the function in some arguments. For example,in ATE, m ( z ; f ) = f (1 , w ) − f (0 , w ) where z = x = ( d, w ) . This property holds for treatment effectsand policy effects, and it ensures that m ( · ; f ) ∈ H .Denote the kernel k : X × X → R , and denote the feature map φ : x k ( x, · ) . Denote the kernelmatrix K XX with ( i, j ) -th entry k ( x i , x j ) . Denote the feature matrix Φ with i -th row φ ( x i ) ′ . Hence K XX = ΦΦ ′ .By the reproducing property f ( x ) = h f, φ ( x ) i H . Moreover, since m is a linear functional, we candefine the linear operator M : H → H , f ( · ) m ( · ; f ) whereby m ( x ; f ) = [ M f ]( x ) = h M f, φ ( x ) i H = h f, M ∗ φ ( x ) i H where M ∗ is the adjoint of M . Define the matrix Φ ( m ) := Φ M with i -th row φ ( x i ) ′ M . Finallydefine Ψ as the matrix with n rows that is constructed by concatenating Φ and Φ ( m ) . We denotethe induced kernel matrix by K := ΨΨ ′ . Formally, Ψ := (cid:20) ΦΦ ( m ) (cid:21) , K := (cid:20) K (1) K (2) K (3) K (4) (cid:21) := (cid:20) ΦΦ ′ Φ(Φ ( m ) ) ′ Φ ( m ) Φ ′ Φ ( m ) (Φ ( m ) ) ′ (cid:21) { K ( j ) } j ∈ [4] ∈ R n × n and hence K ∈ R n × n can be computed from data, though theydepend on the choice of moment. Proposition 9 (Computing kernel matrices) . For example, for ATE [ K (1) ] ij = k (( d i , w i ) , ( d j , w j ))[ K (2) ] ij = k ((1 , w i ) , ( d j , w j )) − k ((0 , w i ) , ( d j , w j ))[ K (3) ] ij = k (( d i , w i ) , (1 , w j )) − k (( d i , w i ) , (0 , w j ))[ K (4) ] ij = k ((1 , w i ) , (1 , w j )) − k ((1 , w i ) , (0 , w j )) − k ((0 , w i ) , (1 , w j )) + k ((0 , w i ) , (0 , w j )) We proceed in steps. First we prove the existence of a closed form for the maximizer ˆ f =arg max f ∈H E n [ m ( X ; f ) − a ( X ) · f ( X )] − k f k ,n − λ k f k H by extending the classic representationtheorem of [76, 114]. Proposition 10 (Representation of maximizer) . ˆ f = Ψ ′ ˆ γ for some ˆ γ ∈ R n Appealing to this abstract result, we derive the closed form expression for the maximizer in termsof kernel matrices.
Proposition 11 (Closed form of maximizer) . ˆ γ = ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) where ∆ := (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) + nλK ∈ R n × n , ˆ µ := 1 n n X i =1 φ ( x i ) Next we prove the existence of a closed form for the minimizer ˆ a = arg min a ∈H E n [ m ( X ; ˆ f ) − a ( X ) · ˆ f ( X )] − k ˆ f k ,n − λ k ˆ f k H + µ k a k H by appealing to the classic representation theorem of [76, 114]. Proposition 12 (Representation of minimizer) . ˆ a = Φ ′ ˆ β for some ˆ β ∈ R n Again, with this abstract result in hand, we derive the closed form expression for the minimizer interms of kernel matrices.
Proposition 13 (Closed form of minimizer) . ˆ β = (cid:26) n Ω∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) + 2 µ · K (1) (cid:27) − Ω∆ − Ψ M ′ ˆ µ where Ω := (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) − nλ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ∈ R n × n For practical use, we require a way to evaluate the minimizer using only kernel operations. Evalu-ation directly follows from the closed form expression.
Corollary 14 (Evaluation of minimizer) . ˆ a ( x ) = K xX (cid:26) n Ω∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) + 2 µ · K (1) (cid:27) − Ω∆ − V where V ∈ R n is defined such that v j = ( n P ni =1 [ K (2) ] ji if j ∈ [ n ] n P ni =1 [ K (4) ] ji if j ∈ { n + 1 , ..., n } .4 Oracle Based Training Consider the estimator with λ = µ = 0 : min a ∈A max f ∈F E n (cid:2) m ( Z ; f ) − a ( X ) · f ( X ) − f ( X ) (cid:3) =: ℓ ( a, f ) (5)We can solve this optimization problem by treating it as a zero-sum game, where one player controls a and the other player controls f . Observe that the game is convex a (in fact linear) and concavein f . Thus, we can solve this zero-sum game by having the f -player run a no-regret algorithm ateach period t ∈ { , . . . , T } and the a -player best responding to the current choice of the f player.Observe that for any fixed f , the best-response of the a -player is the solution to: a t = arg min a ∈A − E n [ a ( X ) · f ( X )] = arg max a ∈A E n [ a ( X ) · f ( X )] In other words, the a -player wants to match the sign of the function f . Thus the best-response ofthe a -player is equivalent to a weighted classification oracle, where the label is Y i = sign ( f ( X i )) and the weight is w i = | f ( X i ) | .Finally, we need to solve the no-regret problem for the f player. If the function space F is a convexspace, then we can simply run the follow the leader (FTL) algorithm, where at every period thealgorithm maximizes the empirical past reward: f t = arg max f ∈F E n (cid:2) m ( Z ; f ) − ¯ a Suppose that the empirical operator E n [ m ( Z ; · )] is bounded with operator normupper bounded by M n ≥ and that the function class F is convex. Consider the algorithm whereat each period t ∈ { , . . . , T } : f t = arg max f ∈F ℓ (¯ a Suppose our goal is to estimate θ = θ ( g ) , where g = E [ Y | X ] . We have access to an estimate ˆ g of g . We consider the de-biased moment: m a ( Z ; g ) = m ( Z ; g ) + a ( X ) ′ ( Y − g ( X )) For simplicity of exposition, we present the remainder of the section for the case of a single-valuedregression function. Consider the following cross-fitted estimate:• Partition n samples into K folds P , . . . , P K • For each partition, estimate ˆ a k , ˆ g k based on all out-of-fold data.• Construct estimate: ˆ θ = 1 n K X k =1 X i ∈ P k m ˆ a k ( Z i ; ˆ g k ) Lemma 16. Suppose that K = Θ(1) and that: ∀ k ∈ [ K ] : √ n E [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] → p (6) and that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that for all k ∈ [ K ] : k ˆ a k − a ∗ k L → and k ˆ g k − g ∗ k L → . Assume that Condition 1 is satisfied and the variables Y, g ( X ) , a ( X ) are bounded a.s. for all g ∈ G and a ∈ A . Then if we let σ ∗ := Var ( m a ∗ ( Z ; g ∗ )) √ n (cid:16) ˆ θ − θ (cid:17) → d N (cid:0) , σ ∗ (cid:1) A sufficient condition for Condition 6 is that √ n k ˆ a − a k k ˆ g − g k → p , which is a conditionon the product of the two RMSE rates. However, observe that Condition (6) is much weaker as itimplies that our Riesz estimate ˆ a only needs to approximately satisfy the representer moment fortest functions of the form: ˆ g − g . Thus, if we assume that ˆ g satisfies an RMSE consistency rate that k ˆ g − g k ≤ r n , then it suffices that it satisfies the moment for any g ∈ G , with k g − ˆ g k ≤ r n , i.e. itsuffices that it is a local Riesz representer around ˆ g . This can potentially make the Riesz estimationtask much simpler than estimating a global Riesz representer. We formalize this observation inAppendix C.Moreover, observe that the theorem does not require consistency of both nuisance functions. Onlyone of the two nuisance functions needs to be consistent, while the other must simply converge tosome limit function. For instance, as long as √ n k ˆ a − a k → or √ n k ˆ g − g k → , then the result This condition can be relaxed to simply assuming bounded fourth moments of Y, g ( X ) , a ( X ) , as long as westrengthen the requirement to assume -th moment convergence to a ∗ , g ∗ , i.e. that k ˆ a k − a ∗ k , k ˆ g k − g ∗ k → p . n rate for either a or g . However, wecan still show that the de-biased moment satisfies a double robustness property: if one nuisanceis inconsistent, as long as the other is root- n consistent, then asymptotic normality of the causalparameter still holds. This result is presented in Appendix F.2. The result is analogous to theone provided in [19], where an estimator with such a property was presented within the targetedmaximum likelihood framework. Consider the algorithm where no cross-fitting or sample splitting is employed:• Estimate ˆ a, ˆ g on all the samples• Construct estimate: ˆ θ = E n [ m ˆ a ( Z ; ˆ g )] Lemma 17 (Normality via Localized Complexities) . Suppose that: ∀ k ∈ [ K ] : √ n E [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] → p (7) and that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that: k ˆ a k − a ∗ k , k ˆ g k − g ∗ k = o p ( r n ) . Assume that Condition 1 is satisfied and the variables Y, g ( X ) , a ( X ) are bounded a.s.for all g ∈ G and a ∈ A . Moreover, assume that with high probability k ˆ g k G ≤ B and k ˆ a k A ≤ B .Let δ n, ∗ = δ n + c q log( c n ) n for some appropriately defined universal constants c , c , where δ n is abound on the critical radius of G B , m ◦ G B and A B and also at least q log log( n ) n . If √ n (cid:0) δ n, ∗ r n + δ n, ∗ (cid:1) → then if we let σ ∗ := Var ( m a ∗ ( Z ; g ∗ )) √ n (cid:16) ˆ θ − θ (cid:17) → d N (cid:0) , σ ∗ (cid:1) Suppose we use for both ˆ a and ˆ g an ℓ constrained linear function class in p dimensions andthat a ∗ , g ∗ are sparse linear functions with support size s . Moreover if B = k a ∗ k + o (1) and B = k g ∗ k + o (1) , and the covariates satisfy a restricted eigenvalue condition, then we couldshow that δ n, ∗ = O (cid:18)q s log( p ) n (cid:19) (a simplification by assuming s log( p ) > log( n ) ). Then as longas r n → , the condition is satisfied. Moreover, for such function classes, we will typically havethat r n = O (cid:18)q s log( p ) n (cid:19) . Therefore, the required condition is that: s log( p ) √ n = o (1) or equivalently s = o ( √ n/ log( p )) .Of theoretical interest, it seems that without sample splitting, the analysis essentially goes throughfor general function classes that are not Donsker. With sample splitting, we would require from20ondition (8) that √ s a s g log( p ) n = o ( n − / ) , where s a , s g are the sparsity bounds on a and g , respec-tively. Simplifying, with sample splitting we require √ s a s g = o ( √ n/ log( p )) . By contrast, withoutsample splitting, we require this condition for both s a and s g . Beyond this difference, the conditionson the sparsity of the function classes seem comparable.We also provide a proof of asymptotic normality without sample splitting for uniformly stableestimators. This proof technique handles cases beyond Donsker classes or classes with small criticalradius, since stability is not only a property of the function class but also of the estimation algorithm.Thus, it could be potentially apply to large neural net classes trained via few iterations of stochasticgradient descent [58] or sub-bagged ensembles of overfitting estimators [47]. Lemma 18 (Normality via Uniform Stability) . Suppose that: ∀ k ∈ [ K ] : √ n E [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] → p (8) and that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that: E (cid:2) k ˆ a k − a ∗ k (cid:3) , E (cid:2) k ˆ g k − g ∗ k (cid:3) = O ( r n ) Assume that Condition 1 is satisfied and the variables Y, g ( X ) , a ( X ) are bounded a.s. for all g ∈ G and a ∈ A . Suppose that the algorithm for estimating ˆ h := (ˆ a, ˆ g ) is symmetric across samples andsatisfies β n -mean-squared stability, i.e.: E Z (cid:20)(cid:13)(cid:13)(cid:13) ˆ h ( Z ) − ˆ h − i ( Z ) (cid:13)(cid:13)(cid:13) ∞ (cid:21) ≤ β n where ˆ h − i is the function that the estimation algorithm would produce if sample i was removed fromthe training set. If r n − + n β n − r n − → then if we let σ ∗ := Var ( m a ∗ ( Z ; g ∗ )) √ n (cid:16) ˆ θ − θ (cid:17) → d N (cid:0) , σ ∗ (cid:1) Uniform stability of sub-bagged ensemble estimators. If we use sub-bagging and returnas an estimate the average of a base estimator over subsamples of size s < n , then the sub-baggedestimate is β n := sn -uniformly stable (see e.g. [47]). If the bias of the base estimator decays assome function bias ( s ) , then typically sub-bagged estimators will achieve r n = p sn + bias ( s ) (seee.g. [10, 75, 121]). Thus we need that nβ n r n = q s n + s bias ( s ) → . As long as s = o ( n / ) and bias ( s ) = o (1 /s ) , then the conditions of the latter theorem hold. The recent work of [121]shows that in a high-dimensional regression setting, with p ≫ n and only r ≪ p, n of the variablesbeing µ - strictly relevant variables, i.e. leading to a decrease in explained variance of at least µ , (forsome constant µ > ), the bias of a deep Breiman tree trained on s data points decays as exp( − s ) .Moreover, a deep Breiman forest where each tree is trained on s = O (cid:16) r log( p ) µ (cid:17) = o ( n / ) samples,drawn without replacement, will achieve r n = O (cid:16)q s r n (cid:17) . Thus sub-bagged deep Breiman randomforests satisfy the conditions of the theorem in the case of sparse high-dimensional non-parametricregression. The notion was originally defined in [72] and used to derive imporved bounds on k -fold cross-validation. It isweaker than the well-studied uniform stability [26]. See [47, 28, 2] for more discussion. Orthogonalizing Non-Linear Moment Suppose our goal is to estimate the solution θ to a non-linear moment problem that depends on aregression function g , i.e.: E [ m ( Z ; θ , g )] := 0 One way to construct a Neyman orthogonal moment that is robust to first-stage errors of theregression is to introduce a bias correction term that involves the Riesz representer of the functionalderivative of the moment with respect to g , i.e.: m a ( Z ; θ, g ) = m ( Z ; θ, g ) + a ( X ) ′ ( Y − g ( X )) where a ( X ) is the Riesz representer of the functional derivative of m with respect to g , i.e.: f ( g ) := ∂∂τ E [ m ( Z ; θ, g + τ ( g − g ))] (cid:12)(cid:12)(cid:12)(cid:12) τ =0 = E [ a ( X ) ′ g ( X )] The Riesz representer a can be estimated in a first stage as follows:• Estimate the regression function ˆ g • Estimate a preliminary ˜ θ using the non-orthogonal moment condition• Calculate algebraically, or through automatic differentiation, the Gateaux derivative function: ˆ f ( g ) = ∂∂τ E h m ( Z ; ˜ θ, ˆ g + τ ( g − ˆ g )) i (cid:12)(cid:12)(cid:12)(cid:12) τ =0 • Apply the adversarial Riesz representer estimator for functional ˆ f ( g ) , to estimate a Following similar analysis as in Section 5 of [40], one can show that the moment m a satisfies Neymanorthogonality. Moreover, assuming that the moment function is sufficiently smooth, the estimatoroutlined above will achieve faster than n − / rates. These two properties are sufficient to showthat the estimator for θ , based on the orthogonal moment and using cross-fitting, will be root- n asymptotically normal.One caveat of the approach outlined above is the burden of either calculating the Gateaux deriva-tive algebraically or auto-differentiating the moment. One can bypass this difficult, and reduceto evaluation oracles of the moment, by taking arbitrarily small approximations of the Gateauxderivative. In particular, the third step could be replaced by defining: ˆ f ǫ ( g ) = 1 ǫ (cid:16) E h m ( Z ; ˜ θ, ˆ g + ǫ ( g − g )) − m ( Z ; ˜ θ, ˆ g ) i(cid:17) For sufficiently small ǫ , the approximation error k ˆ f ǫ − ˆ f k is negligible. Moreover, ˆ f ǫ only requiresblack-box access to evaluations of the moment function to be computed.22 eferences [1] . Accessed: 2020-09-15.[2] Karim Abou-Moustafa and Csaba Szepesvári. An exponential tail bound for lq stable learningrules. volume 98 of Proceedings of Machine Learning Research , pages 31–63, Chicago, Illinois,22–24 Mar 2019. PMLR.[3] Chunrong Ai and Xiaohong Chen. Efficient estimation of models with conditional momentrestrictions containing unknown functions. Econometrica , 71(6):1795–1843, 2003.[4] Chunrong Ai and Xiaohong Chen. Estimation of possibly misspecified semiparametric con-ditional moment restriction models with different conditioning variables. Journal of Econo-metrics , 141(1):5–43, 2007.[5] Chunrong Ai and Xiaohong Chen. The semiparametric efficiency bound for models of sequen-tial moment restrictions containing unknown functions. Journal of Econometrics , 170(2):442–457, 2012.[6] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and Generalization in Overparam-eterized Neural Networks, Going Beyond Two Layers. arXiv e-prints , page arXiv:1811.04918,November 2018.[7] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations .cambridge university press, 2009.[8] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarialnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume70 , pages 214–223, 2017.[9] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: Debiasedinference of average treatment effects in high dimensions. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 80(4):597–623, 2018.[10] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized Random Forests. arXiv e-prints , page arXiv:1610.01271, October 2016.[11] Ravi Bansal and Salim Viswanathan. No arbitrage and arbitrage pricing: A new approach. The Journal of Finance , 48(4):1231–1262, 1993.[12] Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics , 33(4):1497–1537, 2005.[13] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn.Res. , 20:63–1, 2019.[14] Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, and Christian Hansen. Pro-gram evaluation and causal inference with high-dimensional data. Econometrica , 85(1):233–298, 2017. 2315] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference for high-dimensional sparse econometric models. arXiv:1201.0220 , 2011.[16] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatmenteffects after selection among high-dimensional controls. The Review of Economic Studies ,81(2):608–650, 2014.[17] Alexandre Belloni, Victor Chernozhukov, and Kengo Kato. Uniform post-selection infer-ence for least absolute deviation regression and other Z-estimation problems. Biometrika ,102(1):77–94, 2014.[18] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Pivotal estimation via square-rootlasso in nonparametric regression. The Annals of Statistics , 42(2):757–788, 2014.[19] D Benkeser, M Carone, M J Van Der Laan, and P B Gilbert. Doubly robust nonparametricinference on the average treatment effect. Biometrika , 104(4):863–880, 10 2017.[20] Andrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of momentsfor instrumental variable analysis. In Advances in Neural Information Processing Systems ,pages 3559–3569, 2019.[21] Daniele Bianchi, Matthias Büchner, and Andrea Tamoni. Bond risk premiums with machinelearning. The Review of Financial Studies , 2020.[22] Peter J Bickel. On adaptive estimation. The Annals of Statistics , pages 647–671, 1982.[23] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov, and Jon A Wellner. Efficient and adaptiveestimation for semiparametric models , volume 4. Johns Hopkins University Press, 1993.[24] Peter J Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: Sharpbest order of convergence estimates. Sankhy¯a: The Indian Journal of Statistics, Series A ,pages 381–393, 1988.[25] Richard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric iv estimationof shape-invariant engel curves. Econometrica , 75(6):1613–1669, 2007.[26] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine LearningResearch , 2:499–526, 2002.[27] Jelena Bradic and Mladen Kolar. Uniform inference for high-dimensional quantile regression:Linear functionals and regression rank scores. arXiv:1702.06209 , 2017.[28] Alain Celisse and Benjamin Guedj. Stability revisited: new generalisation bounds for theLeave-one-Out. arXiv e-prints , page arXiv:1608.06412, August 2016.[29] Luyang Chen, Markus Pelger, and Jason Zhu. Deep learning in asset pricing. Available atSSRN 3350138 , 2019.[30] Xiaohong Chen and Timothy M Christensen. Optimal sup-norm rates and uniform inferenceon nonlinear functionals of nonparametric iv regression. Quantitative Economics , 9(1):39–84,2018. 2431] Xiaohong Chen and Demian Pouzo. Efficient estimation of semiparametric conditional mo-ment models with possibly nonsmooth residuals. Journal of Econometrics , 152(1):46–60,2009.[32] Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment modelswith possibly nonsmooth generalized residuals. Econometrica , 80(1):277–321, 2012.[33] Xiaohong Chen and Demian Pouzo. Sieve wald and qlr inferences on semi/nonparametricconditional moment models. Econometrica , 83(3):1013–1079, 2015.[34] Xiaohong Chen and Halbert White. Improved rates and asymptotic normality for nonpara-metric neural network estimators. IEEE Transactions on Information Theory , 45(2):682–691,1999.[35] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,Whitney Newey, and James Robins. Double/debiased machine learning for treatment andstructural parameters: Double/debiased machine learning. The Econometrics Journal , 21(1),2018.[36] Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, andJames M Robins. Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033 ,2016.[37] Victor Chernozhukov, Christian Hansen, and Martin Spindler. Valid post-selection and post-regularization inference: An elementary, general approach. Annual Review of Economics ,7(1):649–688, 2015.[38] Victor Chernozhukov, Denis Nekipelov, Vira Semenova, and Vasilis Syrgkanis. Plug-in regu-larized estimation of high-dimensional parameters in nonlinear semiparametric models. arXivpreprint arXiv:1806.04823 , 2018.[39] Victor Chernozhukov, Whitney Newey, and Rahul Singh. Double/de-biased machine learn-ing of global and local parameters using regularized Riesz representers. arXiv preprintarXiv:1802.08667 , 2018.[40] Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Automatic debiased machinelearning of causal and structural effects. arXiv preprint arXiv:1809.05224 , 8, 2018.[41] John H Cochrane. Asset Pricing: Revised Edition . Princeton University Press, 2009.[42] Serge Darolles, Yanqin Fan, Jean-Pierre Florens, and Eric Renault. Nonparametric instru-mental regression. Econometrica , 79(5):1541–1565, 2011.[43] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training ganswith optimism. CoRR , abs/1711.00141, 2017.[44] Iván Díaz and Mark J van der Laan. Targeted data adaptive estimation of the causal dose–response curve. Journal of Causal Inference , 1(2):171–192, 2013.[45] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimation ofconditional moment models. arXiv preprint arXiv:2006.07201 , 2020.2546] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provablyoptimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018.[47] André Elisseeff, Massimiliano Pontil, et al. Leave-one-out error and stability of learning algo-rithms with applications. NATO science series sub series iii computer and systems sciences ,190:111–130, 2003.[48] Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation andinference. Econometrica , 2018.[49] Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation andinference: Application to causal effects and other semiparametric estimands. arXiv preprintarXiv:1809.09953 , 2018.[50] Guanhao Feng, Jingyu He, and Nicholas G Polson. Deep learning for predicting asset returns. arXiv preprint arXiv:1804.09314 , 2018.[51] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv:1901.09036 ,2019.[52] Dylan J. Foster and Vasilis Syrgkanis. Orthogonal Statistical Learning. arXiv e-prints , pagearXiv:1901.09036, January 2019.[53] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior , 29(1):79 – 103, 1999.[54] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neuralinformation processing systems , 27:2672–2680, 2014.[55] Shihao Gu, Bryan Kelly, and Dacheng Xiu. Autoencoder asset pricing models. Journal ofEconometrics , 2020.[56] Peter Hall, Joel L Horowitz, et al. Nonparametric methods for inference in the presence ofinstrumental variables. The Annals of Statistics , 33(6):2904–2929, 2005.[57] Lars Peter Hansen and Ravi Jagannathan. Assessing specification errors in stochastic discountfactor models. The Journal of Finance , 52(2):557–590, 1997.[58] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability ofstochastic gradient descent. In Proceedings of the 33rd International Conference on Interna-tional Conference on Machine Learning - Volume 48 , ICML’16, pages 1225–1234. JMLR.org,2016.[59] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A flexibleapproach for counterfactual prediction. In Doina Precup and Yee Whye Teh, editors, Pro-ceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedingsof Machine Learning Research , pages 1414–1423, International Convention Centre, Sydney,Australia, 06–11 Aug 2017. PMLR.[60] Rafail Z Hasminskii and Ildar A Ibragimov. On the nonparametric estimation of functionals.In Proceedings of the Second Prague Symposium on Asymptotic Statistics , 1979.2661] David Haussler. Sphere packing numbers for subsets of the boolean n-cube with boundedvapnik-chervonenkis dimension. J. Comb. Theory, Ser. A , 69(2):217–232, 1995.[62] David A Hirshberg and Stefan Wager. Debiased inference of average partial effects in single-index models. arXiv:1811.02547 , 2018.[63] David A Hirshberg and Stefan Wager. Augmented minimax linear estimation. arXiv:1712.00038v5 , 2019.[64] Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the conver-gence of single-call stochastic extra-gradient methods. arXiv e-prints , page arXiv:1908.08465,August 2019.[65] I Ibragimov and R Has’minskii. Statistical estimation, vol. 16 of. Applications of Mathematics ,1981.[66] Jana Jankova and Sara Van De Geer. Confidence intervals for high-dimensional inverse co-variance estimation. Electronic Journal of Statistics , 9(1):1205–1229, 2015.[67] Jana Jankova and Sara Van De Geer. Confidence regions for high-dimensional generalizedlinear models under sparsity. arXiv:1610.01353 , 2016.[68] Jana Jankova and Sara Van De Geer. Semiparametric efficiency bounds for high-dimensionalmodels. The Annals of Statistics , 46(5):2336–2359, 2018.[69] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing forhigh-dimensional regression. The Journal of Machine Learning Research , 15(1):2869–2909,2014.[70] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regressionunder the Gaussian random design model: Asymptotic theory. IEEE Transactions on Infor-mation Theory , 60(10):6522–6554, 2014.[71] Adel Javanmard and Andrea Montanari. Debiasing the lasso: Optimal sample size for Gaus-sian designs. The Annals of Statistics , 46(6A):2593–2622, 2018.[72] Satyen Kale, Ravi Kumar, and Sergei Vassilvitskii. Cross-validation and mean-square stability.In In Proceedings of the Second Symposium on Innovations in Computer Science (ICS2011 ,pages 487–495, 2011.[73] Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. arXiv:2004.14497 , 2020.[74] Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small. Nonparametricmethods for doubly robust estimation of continuous treatment effects. Journal of the RoyalStatistical Society: Series B, Statistical Methodology , 79(4):1229, 2017.[75] Khashayar Khosravi, Greg Lewis, and Vasilis Syrgkanis. Non-Parametric Inference Adaptiveto Intrinsic Dimension. arXiv e-prints , page arXiv:1901.03719, January 2019.[76] George Kimeldorf and Grace Wahba. Some results on Tchebycheffian spline functions. Journalof Mathematical Analysis and Applications , 33(1):82–95, 1971.2777] Chris AJ Klaassen. Consistent estimation of the influence function of locally asymptoticallylinear estimators. The Annals of Statistics , pages 1548–1562, 1987.[78] V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of functionlearning. High Dimensional Probability II , 47:443–459, 2000.[79] Michael R Kosorok. Introduction to empirical processes and semiparametric inference .Springer Science & Business Media, 2007.[80] Guillaume Lecué and Shahar Mendelson. Regularization and the small-ball method ii: com-plexity dependent error rates. The Journal of Machine Learning Research , 18(1):5356–5403,2017.[81] Guillaume Lecué and Shahar Mendelson. Regularization and the small-ball method i: Sparserecovery. Ann. Statist. , 46(2):611–641, 04 2018.[82] B Ya Levit. On the efficiency of a class of non-parametric estimates. Theory of Probability &Its Applications , 20(4):723–740, 1976.[83] Luofeng Liao, You-Lin Chen, Zhuoran Yang, Bo Dai, Zhaoran Wang, and Mladen Kolar.Provably efficient neural estimation of structural equation model: An adversarial approach. arXiv preprint arXiv:2007.01290 , 2020.[84] Alexander R Luedtke and Mark J Van Der Laan. Statistical inference for the mean outcomeunder a possibly non-unique optimal treatment strategy. Annals of Statistics , 44(2):713, 2016.[85] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Interna-tional Conference on Algorithmic Learning Theory , pages 3–17. Springer, 2016.[86] Marcial Messmer. Deep learning and the cross-section of expected returns. Available at SSRN3081555 , 2017.[87] Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, and Yura Malitsky.Revisiting Stochastic Extragradient. arXiv e-prints , page arXiv:1905.11373, May 2019.[88] Krikamol Muandet, Wittawat Jitkrittum, and Jonas Kübler. Kernel conditional moment testvia maximum moment restriction. arXiv preprint arXiv:2002.09225 , 2020.[89] Krikamol Muandet, Arash Mehrjou, Si Kai Lee, and Anant Raj. Dual iv: A single stageinstrumental variable regression. arXiv preprint arXiv:1910.12358 , 2019.[90] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unifiedframework for high-dimensional analysis of m -estimators with decomposable regularizers. Statist. Sci. , 27(4):538–557, 11 2012.[91] Whitney K Newey. The asymptotic variance of semiparametric estimators. Econometrica ,pages 1349–1382, 1994.[92] Whitney K Newey, Fushing Hsieh, and James M Robins. Undersmoothing and bias correctedfunctional estimation. Technical report, MIT Department of Economics, 1998.2893] Whitney K Newey, Fushing Hsieh, and James M Robins. Twicing kernels and a small biasproperty of semiparametric estimators. Econometrica , 72(3):947–962, 2004.[94] Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametricmodels. Econometrica , 71(5):1565–1578, 2003.[95] Whitney K Newey and James R Robins. Cross-fitting and fast remainder rates for semipara-metric estimation. arXiv:1801.09138 , 2018.[96] Matey Neykov, Yang Ning, Jun S Liu, and Han Liu. A unified theory of confidence regions andtesting for high-dimensional estimating equations. Statistical Science , 33(3):427–443, 2018.[97] Jerzy Neyman. Optimal asymptotic tests of composite statistical hypotheses. In Probabilityand Statistics , page 416–444. Wiley, 1959.[98] Jerzy Neyman. C ( α ) tests and their use. Sankhy¯a: The Indian Journal of Statistics, SeriesA , pages 1–21, 1979.[99] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv:1712.04912 , 2017.[100] Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions forsparse high dimensional models. The Annals of Statistics , 45(1):158–195, 2017.[101] Johann Pfanzagl. Lecture notes in statistics. Contributions to a general asymptotic statisticaltheory , 13, 1982.[102] Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictablesequences. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems 26 , pages 3066–3074. CurranAssociates, Inc., 2013.[103] Zhao Ren, Tingni Sun, Cun-Hui Zhang, and Harrison H Zhou. Asymptotic normality and opti-malities in estimation of large Gaussian graphical models. The Annals of Statistics , 43(3):991–1026, 2015.[104] James Robins, Lingling Li, Eric Tchetgen, Aad van der Vaart, et al. Higher order influencefunctions and minimax estimation of nonlinear functionals. In Probability and statistics:essays in honor of David A. Freedman , pages 335–421. Institute of Mathematical Statistics,2008.[105] James Robins, Mariela Sued, Quanhong Lei-Gomez, and Andrea Rotnitzky. Comment on"performance of double-robust estimators when inverse probability weights are highly vari-able". Statistical Science , 22(4):544–559, 2007.[106] James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regressionmodels with missing data. Journal of the American Statistical Association , 90(429):122–129,1995.[107] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Analysis of semiparametric regres-sion models for repeated outcomes in the presence of missing data. Journal of the AmericanStatistical Association , 90(429):106–121, 1995.29108] Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal ofthe Econometric Society , pages 931–954, 1988.[109] Dominik Rothenhäusler and Bin Yu. Incremental causal effects. arXiv:1907.13258 , 2019.[110] Dan Rubin and Mark J van der Laan. A general imputation methodology for nonparametricregression with censored data. Technical report, UC Berkeley Division of Biostatistics, 2005.[111] Daniel Rubin and Mark J van der Laan. Extending marginal structural models through local,penalized, and additive learning. Technical report, UC Berkeley Division of Biostatistics,2006.[112] Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonignorabledrop-out using semiparametric nonresponse models. Journal of the American Statistical As-sociation , 94(448):1096–1120, 1999.[113] Anton Schick. On asymptotically efficient estimation in semiparametric models. The Annalsof Statistics , 14(3):1139–1151, 1986.[114] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory , pages 416–426. Springer, 2001.[115] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory toalgorithms . Cambridge university press, 2014.[116] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression.In Advances in Neural Information Processing Systems , pages 4595–4607, 2019.[117] Rahul Singh and Liyang Sun. De-biased machine learning in instrumental variable models fortreatment effects. arXiv preprint arXiv:1909.05244 , 2019.[118] Rahul Singh, Liyuan Xu, and Arthur Gretton. Kernel methods for policy evaluation: Treat-ment effects, mediation analysis, and off-policy planning. arXiv preprint arXiv:2010.04855 ,2020.[119] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimizationlandscape of over-parameterized shallow neural networks. IEEE Transactions on InformationTheory , 65(2):742–769, 2019.[120] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence ofregularized learning in games. In Advances in Neural Information Processing Systems , pages2989–2997, 2015.[121] Vasilis Syrgkanis and Manolis Zampetakis. Estimation and inference with trees and forests inhigh dimensions. volume 125 of Proceedings of Machine Learning Research , pages 3453–3454.PMLR, 09–12 Jul 2020.[122] B Toth and MJ van der Laan. TMLE for marginal structural models based on an instrument.Technical report, UC Berkeley Division of Biostatistics, 2016.[123] Anastasios Tsiatis. Semiparametric theory and missing data . Springer Science & BusinessMedia, 2007. 30124] Sara Van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptoticallyoptimal confidence regions and tests for high-dimensional models. The Annals of Statistics ,42(3):1166–1202, 2014.[125] Mark J van der Laan and Alexander R Luedtke. Targeted learning of an optimal dynamictreatment, and statistical inference for its mean outcome. Technical report, UC BerkeleyDivision of Biostatistics, 2014.[126] Mark J Van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observationaland Experimental Data . Springer Science & Business Media, 2011.[127] Mark J Van Der Laan and Daniel Rubin. Targeted maximum likelihood learning. TheInternational Journal of Biostatistics , 2(1), 2006.[128] Aad Van Der Vaart et al. On differentiable functionals. The Annals of Statistics , 19(1):178–204, 1991.[129] Aad W Van der Vaart. Asymptotic Statistics , volume 3. Cambridge University Press, 2000.[130] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint , volume 48.Cambridge University Press, 2019.[131] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks ,94:103–114, 2017.[132] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXiv preprint arXiv:1802.03620 , 2018.[133] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parame-ters in high dimensional linear models. Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 76(1):217–242, 2014.[134] Wenjing Zheng and Mark J Van Der Laan. Asymptotic theory for cross-validated targetedmaximum likelihood estimation. 2010.[135] Yinchu Zhu and Jelena Bradic. Breaking the curse of dimensionality in regression. arXiv:1708.00430. , 2017.[136] Yinchu Zhu and Jelena Bradic. Linear hypothesis testing in dense high-dimensional linearmodels. Journal of the American Statistical Association , 113(524):1583–1600, 2018. A Unrestricted and Restricted Models In the context of semi-parametric statistics, recall that the causal parameter θ = θ ( g ) = E [ m ( Z ; g )] is a functional m of the underlying regression g ( x ) := E [ Y | X = x ] . In an unrestricted model, weassume g ∈ L ( P ) , the space of square integrable functions. In a restricted model, additionalinformation about g can be encoded by the restriction g ∈ G ⊂ L ( P ) , where G is some convexfunction space. In this section, we give an account of Riesz representation in restricted models,following the notation and technical lemmas of [39].31enote G := span ( G ) and ¯ G := closure ( G ) . Define the modulus of continuity of g θ ( g ) by L := sup g ∈G\{ } | θ ( g ) |k g k Definition 1 (RR and minimal RR) . A RR of the functional θ ( g ) is a ∈ L ( P ) s.t. θ ( g ) = E [ g ( X ) a ( X )] , ∀ g ∈ G If a ∈ ¯ G , then it is the minimal RR and we denote it by a min0 . Any RR can be reduced to theminimal RR by projecting it onto ¯ G . Lemma 19 (Lemma 1 of [39]) . We have the following results1. If L < ∞ then there exists a unique minimal RR a min0 and L = k a min0 k 2. If there exists a RR a with k a k < ∞ then L = k a min0 k ≤ k a k < ∞ , where a min0 is theunique minimal RR obtained by projecting a onto ¯ G In both cases, g θ ( g ) can be extended to ¯ G or to all of L ( P ) with modulus of continuity L To interpret these results, consider a toy example of vectors in R rather than functions in L ( P ) .Suppose the functional of interest is θ : R → R , ( x, y, z ) x + 2 y + 3 z Moreover, assume g ∈ G where G is the ( x, y ) -plane, though the ambient space is R . Then anyvector of the form a = (1 , , c ) with c ∈ R is a valid RR. The unique minimal RR is a min0 = (1 , , .As an aside, the vector a = (1 , , is a universal RR; it holds for any choice of G ⊂ R , not justthe ( x, y ) -plane. From any RR, we can obtain the minimal RR by projection onto the ( x, y ) -plane.In [39, Theorem 2], we see that it is better to use a min0 rather than any a to attain full semi-parametric efficiency (unless of course G = L ( P ) so there is no difference). By the stated lemma,we know how to obtain a min0 from any a : projection onto ¯ G .When do these technical issues arise? In the semi-parametric literature, a popular restricted modelis the additive model. It is an important setting where G is not dense in L ( P ) . We present adefinition of the additive model, then a technical lemma about the minimal RR in an additivemodel. Definition 2 (Additive model) . Suppose that1. the regression g is additive in components x = ( x (1) , x (2) ) : g ( x ) = g (1)0 ( x (1) ) + g (2)0 ( x (2) ) g (1)0 ∈ G (1)0 , a dense subset of L ( P (1) ) , where P (1) is the distribution of X (1) 3. the functional depends on only the first component: m ( z ; g ) = m ( z ; g (1) ) emma 20 (Lemma 6 of [39]) . Assume an additive model. Consider any RR a ∈ L ( P ) . Then ∀ g ∈ G θ ( g ) = θ ( g (1) ) = ˆ a min0 ( x (1) ) g (1) ( x (1) ) d P (1) , a min0 ( x (1) ) = E [ a ( X ) | X (1) = x (1) ] and k a min0 k q ≤ k a k q , ∀ q ∈ [1 , ∞ ] This preservation of order and contraction of norm is helpful in analysis.Finally, we quote some projection geometry for sumspaces from [23, Appendix A.4]. Suppose H and H are closed subspaces of a Hilbert space H . Lemma 21. If H ⊥ H then the projection onto the sumspace H + H is the sum of the projectionsonto H and H More generally, H may not be orthogonal to H . Denote by P i the orthogonal projection onto H i ,and denote by Q i := I − P i the projection onto H ⊥ i . Denote by Π the projection onto the closureof H + H Lemma 22 (Corollary 1 of [23]) . For any h ∈ H [ I − ( Q Q ) m ] h → Π h, m → ∞ Stronger versions of this result are available that provide quantitative rates of convergence and thatallow for r ≥ subspaces. B Examples B.1 Causal Inference Recall the definition of mean-squared continuity: ∃ M ≥ s.t. ∀ f ∈ F : p E [ m ( Z ; f ) ] ≤ M k f k We verify mean-square continuity for several important functionals.1. Average treatment effect (ATE): θ = E [ g (1 , W ) − g (0 , W )] To lighten notation, let π ( w ) := P ( D = 1 | W = w ) be the propensity score. Assume π ( w ) ∈ (cid:0) M , − M (cid:1) for M ∈ (1 , ∞ ) . Then E [ g (1 , W ) − g (0 , W )] ≤ E [ g (1 , W ) + g (0 , W ) ] ≤ M E (cid:2) π ( W ) g (1 , W ) + [1 − π ( W )] g (0 , W ) (cid:3) = 2 M E [ g ( X )] 33. Average policy effect: θ = ´ g ( x ) dµ ( x ) where µ ( x ) = F ( x ) − F ( x ) Denote the densities corresponding to distributions ( F, F , F ) by ( f, f , f ) . Assume f ( x ) f ( x ) ≤√ M and f ( x ) f ( x ) ≤ √ M for M ∈ [0 , ∞ ) . In this example, m ( Z ; g ) = m ( g ) . E [ m ( Z ; g )] = { m ( g ) } = (cid:26) ˆ g ( x ) dµ ( x ) (cid:27) = (cid:26) E (cid:20) g ( X ) (cid:26) f ( X ) f ( X ) − f ( X ) f ( X ) (cid:27)(cid:21)(cid:27) ≤ n √ M E | g ( X ) | o ≤ M E [ g ( X )] 3. Policy effect from transporting covariates: θ = E [ g ( t ( X )) − g ( X )] Denote the density of t ( X ) by f t ( x ) . Assume f t ( x ) f ( x ) ≤ M for M ∈ [0 , ∞ ) . Then E [ g ( t ( X )) − g ( X )] ≤ E [ g ( t ( X )) + g ( X ) ]= 2 E (cid:20) g ( X ) (cid:26) f t ( X ) f ( X ) − (cid:27)(cid:21) ≤ M + 1) E [ g ( X )] 4. Cross effect: θ = E [ Dg (0 , W )] Assume π ( w ) < − M for some M ∈ (1 , ∞ ) . Then E [ Dg (0 , W )] ≤ E [ g (0 , W )] ≤ M E [ { − π ( W ) } g (0 , W ) ] ≤ M E [ g ( X )] 5. Regression decomposition: E [ Y | D = 1] − E [ Y | D = 0] = θ response + θ composition where θ response = E [ g (1 , W ) | D = 1] − E [ g (0 , W ) | D = 1] θ composition = E [ g (0 , W ) | D = 1] − E [ g (0 , W ) | D = 0] Assume π ( w ) < − M for some M ∈ (1 , ∞ ) . Then re-write the target parameters in termsof the cross effect. θ response = E [ DY ] − E [ Dg (0 , W )] E [ D ] θ composition = E [ Dγ (0 , W )] E [ D ] − E [(1 − D ) Y ] E [1 − D ] We implement DML for the cross effect, empirical means for the population means, then deltamethod. 34. Average treatment on the treated (ATT): θ = E [ g (1 , W ) | D = 1] − E [ g (0 , W ) | D = 1] Assume π ( w ) < − M for some M ∈ (1 , ∞ ) . Then re-write the target parameters in termsof the cross effect. θ = E [ DY ] − E [ Dg (0 , W )] E [ D ] We implement DML for the cross effect, empirical means for the population means, then deltamethod.7. Local average treatment effect (LATE): θ = E [ g (1 ,W ) − g (0 ,W )] E [ h (1 ,W ) − h (0 ,W )] The result follows from the view of LATE as a ratio of two ATEs. B.2 Asset Pricing We present three proofs of the existence of the stochastic discount factor. These arguments arequoted from the excellent exposition of [41].1. Marginal rate of substitution in a consumption model.Consider an investor with utility function U ( c t , c t +1 ) = u ( c t )+ β E t [ u ( c t +1 )] , where u is periodutility, c t is consumption at time t , and β is a subjective discount factor. Denote by e t theoriginal consumption level, and ξ the amount of the asset the consumer buys. The consumersolves the optimization problem max ξ u ( c t ) + β E t [ u ( c t +1 )] s.t. c t = e t − p t ξ, c t +1 = e t +1 + x t +1 ξ Substituting constraints into the objective, the FOC yields p t = E t (cid:20) β u ′ ( c t +1 ) u ′ ( c t ) x t +1 (cid:21) , m t +1 = β u ′ ( c t +1 ) u ′ ( c t ) The same FOC arises in the longer-term objective E t hP ∞ j =0 β j u ( c t + j ) i .2. State price density in a contingent claim model with complete markets.For simplicity, consider a two-period model with S possible states of nature tomorrow. Acontingent claim is a security that pays one dollar in one state s only tomorrow. pc t ( s ) is theprice today of the contingent claim. In a complete market , investors can buy any contingentclaims. If there are complete contingent claims, the state price density exists, and it is equalto the contingent claim price divided by probabilities. Let x t +1 ( s ) denote an asset’s payoffin state of nature s . The asset’s price must equal the value of the contingent claims of whichit is a bundle. Let π t +1 ( s ) be the probability that state s occurs conditional on informationavailable today. Then p t = X s pc t ( s ) x t +1 ( s ) = X s π t +1 ( s ) pc t ( s ) π t +1 ( s ) x t +1 ( s ) , m t +1 ( s ) = pc t ( s ) π t +1 ( s ) 35. Pricing kernel from the law of one price.Let X be the set of all payoffs that investors can purchase (or the subset of tradeable payoffsused in a particular study). For example, if there are complete contingent claims to S statesof nature then X = R S . More generally, markets are incomplete, so X ⊂ R S .Free portfolio formation means x, x ′ ∈ X implies ax + bx ′ ∈ X for any a, b ∈ R . Thisassumption rules out short sales constraints, bid-ask spreads, and leverage limitations. Let p t ( x ) denote the price at time t of the asset that delivers payoff x at time t + 1 . The law of oneprice means p t ( ax + bx ′ ) = ap t ( x ) + bp t ( x ′ ) . In other words, asset pricing is a linear functionalover a vector space. This assumption says that investors cannot make instantaneous profits byrepackaging portfolios. It would be satisfied in a market that has already reached equilibrium.Given free portfolio formation and the law of one price, there exists a unique payoff m ∗ t +1 ∈ X such that p t ( x ) = E t [ m ∗ t +1 x ] for all x ∈ X . m ∗ t +1 is called the mimicking portfolio . Unlessmarkets are complete, there are infinitely many SDFs that satisfy p t ( x ) = E t [ m t +1 x ] of theform m t +1 = m ∗ t +1 + ǫ where ǫ ∈ X ⊥ . An incomplete market can be interpreted as a restrictedmodel, and the mimicking portfolio can be interpreted as a minimal Riesz representer in thediscussion of Section A. C Local Riesz Representer Convergence Rate Suppose that we use the constraint the test functions to lie in: F ( r n ) = { f ∈ star ( ∂ ( G − ˆ g )) : k f k ≤ r n } And consider the estimator: inf a ∈A sup f ∈ Ψ n ( a, f ) Then by a localized concentration bound we have: ∀ a ∈ A , f ∈ F ( r n ) : | Ψ n ( a, f ) − Ψ( a, f ) | ≤ O (cid:0) δ n,ζ k m ( · ; f ) − a f k + δ n,ζ (cid:1) ≤ O (cid:0) ( M + 1) δ n,ζ k f k + δ n,ζ (cid:1) ≤ O (cid:0) ( M + 1) δ n,ζ r n + δ n,ζ (cid:1) =: ǫ n where δ n,ζ = δ n + c q log( c /ζ ) n and δ n bounds the critical radius of the function class: { Z → m ( Z ; f ) − a ( X ) f ( X ) : f ∈ F ( r n ) , a ∈ A} . Thus we have that: sup f ∈F ( r n ) Ψ(ˆ a, f ) − ǫ n ≤ sup f ∈F ( r n ) Ψ n (ˆ a, f ) ≤ sup f ∈F ( r n ) Ψ n ( a ∗ , f ) ≤ sup f ∈F ( r n ) Ψ( a ∗ , f ) + ǫ n = inf a ∈A sup f ∈F ( r n ) Ψ( a, f ) + ǫ n sup f ∈F ( r n ) Ψ(ˆ a, f ) ≤ inf a ∈A sup f ∈F ( r n ) Ψ( a, f ) + 2 ǫ n Moreover, if a is a local Riesz representer, i.e. it satisfies the Riesz equation for differences with ˆ g of any function in G within a ball r n around ˆ g , then: inf a ∈A sup f ∈F ( r n ) Ψ( a, f ) = inf a ∈A sup f ∈F ( r n ) h a − a, f i ≤ r n inf a ∈A sup f ∈F (1) h a − a, f i ≤ r n inf a ∈A k a − a k Thus if g lies within a ball r n of ˆ g , we conclude that: E [( a ( X ) − ˆ a ( X )) (ˆ g ( X ) − g ( X ))] ≤ O (cid:18) M r n δ n,ζ + δ n,ζ + r n inf a ∈A k a − a k (cid:19) If for instance r n δ n,ζ = o ( n − / ) and δ n,ζ = o ( n − / ) and r n inf a ∈A k a − a k = o ( n − / ) , then wecan conclude that: √ n E [( a ( X ) − ˆ a ( X )) (ˆ g ( X ) − g ( X ))] → p If a ∈ A and both A and G are VC-subgraph classes with constant VC dimension, then it can beshown that δ n,ζ = O (cid:18)q log( n/ζ ) n (cid:19) . Thus for the above conditions to hold, it suffices that: r n = o (1) (i.e. that ˆ g is RMSE-consistent).Finally observe that we need A to have a small approximation error to a , not with respect to the k · k norm, but rather with the weaker norm: k a − a k F = sup f ∈F (1) h a − a, f i Thus a does not need to match the component of a that is orthogonal to the subspace F . If forinstance, we assume that F lies in the space spanned by top K eigenfunctions of a reproducingkernel hilbert space, then it suffices to consider A the space spanned by those functions too. Then inf a ∈A k a − a k F = 0 . For instance, if G is a finite dimensional linear function space and g ∈ G ,then it suffices to consider A that is also finite dimensional linear, even if the true a does not liein that sub-space. Then all the conditions of Lemma 16 will be satisfied, even if ˆ a will never beconsistent with respect to a . D Proofs from Section 3 For convenience, throughout this section we will use the notation: Ψ( a, f ) := E [ m ( Z ; f ) − a ( X ) · f ( X )] = E [( a ( X ) − a ( X )) f ( X )] (by Riesz definition) Ψ n ( a, f ) := 1 n n X i =1 ( m ( Z i ; f ) − a ( X i ) · f ( X i )) .1 Proof of Theorem 1 Proof. Let: Ψ λn ( a, f ) = Ψ n ( a, f ) − k f k ,n − λ k f k A Ψ λ ( a, f ) = Ψ( a, f ) − k f k − λ k f k A Thus our estimate can be written as: ˆ a := arg min a ∈A sup f ∈F Ψ λn ( a, f ) + µ k a k A Relating empirical and population regularization. As a preliminary observation, we havethat by Theorem 14.1 of [130], w.p. − ζ : ∀ f ∈ F B : (cid:12)(cid:12) k f k n, − k f k (cid:12)(cid:12) ≤ k f k + δ for our choice of δ := δ n + c q log( c /ζ ) n , where δ n upper bounds the critical radius of F B and c , c are universal constants. Moreover, for any f , with k f k A ≥ B , we can consider the function f √ B/ k f k A , which also belongs to F B , since F is star-convex. Thus we can apply the above lemmato this re-scaled function and multiply both sides by k f k A /B , leading to: ∀ f ∈ F s.t. k f k A ≥ B : (cid:12)(cid:12) k f k n, − k f k (cid:12)(cid:12) ≤ k f k + δ k f k A B Thus overall, we have: ∀ f ∈ F : (cid:12)(cid:12) k f k n, − k f k (cid:12)(cid:12) ≤ k f k + δ max (cid:26) , k f k A B (cid:27) (9)Thus we have that w.p. − ζ : ∀ f ∈ F : λ k f k A + k f k ,n ≥ λ k f k A + 12 k f k − δ max (cid:26) , k f k A B (cid:27) ≥ (cid:18) λ − δ B (cid:19) k f k A + 12 k f k − δ Assuming that λ ≥ δ B , we have that, the latter is at least: ∀ f ∈ F : λ k f k A + k f k ,n ≥ λ k f k A + 12 k f k − δ Upper bounding centered empirical sup-loss. We now argue that the centered empiricalsup-loss: sup f ∈F (Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f )) = sup f ∈F E n [( a ∗ ( X ) − ˆ a ( X )) f ( X )] 38s small. By the definition of ˆ a : sup f ∈F Ψ λn (ˆ a, f ) ≤ sup f ∈F Ψ λn ( a ∗ , f ) + µ (cid:0) k a ∗ k A − k ˆ a k A (cid:1) (10)By Lemma 7 of [52], the fact that m ( Z ; f ) − a ∗ ( X ) f ( X ) is -Lipschitz with respect to the vector ( m ( Z ; f ) , f ( z )) (since a ∗ ( X ) ∈ [ − , ) and by our choice of δ := δ n + c q log( c /ζ ) n , where δ n is anupper bound on the critical radius of F B and m ◦ F B , w.p. − ζ : ∀ f ∈ F B : | Ψ n ( a ∗ , f ) − Ψ( a ∗ , f ) | ≤ O (cid:16) δ (cid:16) k f k + p E [ m ( Z ; f ) ] (cid:17) + δ (cid:17) = O (cid:0) δ M k f k + δ (cid:1) where we have invoked Assumption 1. Thus, if k f k A ≥ √ B , we can apply the latter inequality forthe function f √ B/ k f k A , which falls in F B , and then multiply both sides by k f k A / √ B (invokingthe linearity of the operator Ψ n ( a, f ) with respect to f ) to get: ∀ f ∈ F : | Ψ n ( a ∗ , f ) − Ψ( a ∗ , f ) | ≤ O (cid:18) δ M k f k + δ max (cid:26) , k f k A √ B (cid:27)(cid:19) (11)By Equations (10) and (11), we have that w.p. − ζ , for some universal constant C : sup f ∈F Ψ λn ( a ∗ , f ) = sup f ∈F (cid:0) Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) ≤ sup f ∈F (cid:18) Ψ( a ∗ , f ) + Cδ + Cδ √ B k f k A + CM δ k f k − k f k ,n − λ k f k A (cid:19) ≤ sup f ∈F (cid:18) Ψ( a ∗ , f ) + Cδ + Cδ √ B k f k A + CM δ k f k − k f k − λ k f k A + δ (cid:19) ≤ sup f ∈F Ψ λ/ ( a ∗ , f ) + O (cid:0) δ (cid:1) + sup f ∈F (cid:18) Cδ √ B k f k A − λ k f k A (cid:19) + sup f ∈F (cid:18) CM δ k f k − k f k (cid:19) Moreover, observe that for any norm k · k and any constants a, b > : sup f ∈F (cid:0) a k f k − b k f k (cid:1) ≤ a b Thus if we assume that λ ≥ δ /B , we have: sup f ∈F (cid:18) Cδ √ B k f k A − λ k f k A (cid:19) ≤ C δ Bλ ≤ C δ sup f ∈F (cid:18) CM δ k f k − k f k (cid:19) ≤ C M δ Thus we have: sup f ∈F Ψ λn ( a ∗ , f ) ≤ sup f ∈F Ψ λ/ ( a ∗ , f ) + O (cid:0) M δ (cid:1) sup f ∈F Ψ λn (ˆ a, f ) = sup f ∈F (cid:0) Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f ) + Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) ≥ sup f ∈F (cid:0) Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) + inf f ∈F (cid:0) Ψ n ( a ∗ , f ) + k f k ,n + λ k f k A (cid:1) Observe that since Ψ n ( a, f ) is a linear operator of f and F is a symmetric class, we have: inf f ∈F (cid:0) Ψ n ( a ∗ , f ) + k f k ,n + λ k f k A (cid:1) = inf f ∈F (cid:0) Ψ n ( a ∗ , − f ) + k f k ,n + λ k f k A (cid:1) = inf f ∈F (cid:0) − Ψ n ( a ∗ , f ) + k f k ,n + λ k f k A (cid:1) = − sup f ∈F (cid:0) Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) = − sup f ∈F Ψ λn ( a ∗ , f ) Combining this with Equation (10) yields: sup f ∈F (cid:0) Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) ≤ f ∈F Ψ λn ( a ∗ , f ) + µ (cid:0) k a ∗ k A − k ˆ a k A (cid:1) ≤ f ∈F Ψ λ/ ( a ∗ , f ) + µ (cid:0) k a ∗ k A − k ˆ a k A (cid:1) + O (cid:0) M δ (cid:1) Lower bounding centered empirical sup-loss. First observe that: Ψ n ( a, f ) − Ψ n ( a ∗ , f ) = E n [( a ∗ ( X ) − a ( X )) f ( X )] Let ∆ = a ∗ − ˆ a . Suppose that k ∆ k ≥ δ and let r = δ k ∆ k ∈ [0 , / . Then observe that since ∆ ∈ F and F is star-convex, we also have that r ∆ ∈ F . Thus sup f ∈F (cid:0) Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) ≥ Ψ n (ˆ a, r ∆) − Ψ n ( a ∗ , r ∆) − r k ∆ k ,n − λr k ∆ k A = r E n (cid:2) ( a ∗ ( X ) − ˆ a ( X )) (cid:3) − r k ∆ k ,n − λr k ∆ k A = r k ∆ k ,n − r k ∆ k ,n − λr k ∆ k A ≥ r k ∆ k ,n − r k ∆ k ,n − λ k ∆ k A Moreover, since δ n upper bounds the critical radius of F B and by Equation (9): r k ∆ k ,n ≤ r (cid:18) k ∆ k + δ + δ k ∆ k A B (cid:19) ≤ δ + δ k ∆ k A B ≤ δ + λ k ∆ k A Thus we get: sup f ∈F (cid:0) Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) ≥ r k ∆ k ,n − δ − λ k ∆ k A δ n upper bounds the critical radius of F B and by Equation (9): k ∆ k ,n ≥ k ∆ k − δ B k ∆ k A − δ ≥ k ∆ k − λ k ∆ k A − δ Thus we have: sup f ∈F (cid:0) Ψ n (ˆ a, f ) − Ψ n ( a ∗ , f ) − k f k ,n − λ k f k A (cid:1) ≥ r k ∆ k − δ − λ k ∆ k A ≥ δ k ∆ k − δ − λ k ∆ k A Combining upper and lower bound. Combining the upper and lower bound on the centeredpopulation sup-loss we get that w.p. − ζ : either k ∆ k ≤ δ or: δ k ∆ k ≤ O (cid:0) M δ (cid:1) + 2 sup f ∈F Ψ λ/ ( a ∗ , f ) + 3 λ k ∆ k A + µ (cid:0) k a ∗ k A − k ˆ a k A (cid:1) We now control the last part. Since µ ≥ λ : λ k ∆ k A + µ (cid:0) k a ∗ k A − k ˆ a k A (cid:1) ≤ λ (cid:0) k a ∗ k A + k ˆ a k A (cid:1) + µ (cid:0) k a ∗ k A − k ˆ a k A (cid:1) ≤ µ k a ∗ k A We can then conclude that: δ k ∆ k ≤ O (cid:0) M δ (cid:1) + 2 sup f ∈F Ψ λ/ ( a ∗ , f ) + 2 µ k a ∗ k A Dividing over by δ/ , we get: k ∆ k ≤ O (cid:0) M δ (cid:1) + 8 δ sup f ∈F Ψ λ/ ( a ∗ , f ) + 8 µδ k a ∗ k A Thus either k ∆ k ≤ δ or the latter inequality holds. Thus in any case the latter inequality holds. Upper bounding population sup-loss at minimum. Observe that by the definition of theRiesz representer: sup f ∈F Ψ λ/ ( a ∗ , f ) = sup f ∈F E [( a ( X ) − a ∗ ( X )) f ( z )] − k f k − λ k f k A ≤ sup f ∈F E [( a ( X ) − a ∗ ( X )) f ( z )] − k f k = k a − a ∗ k Concluding. Concluding we get that w.p. − ζ : k ˆ a − a ∗ k ≤ O (cid:0) M δ (cid:1) + 8 δ k a ∗ − a k + 8 µδ k a ∗ k A By a trinagle inequality we get: k ˆ a − a k ≤ O (cid:0) M δ (cid:1) + 8 δ k a ∗ − a k + k a ∗ − a k + 8 µδ k a ∗ k A a ∗ = arg min a ∈A k a − a k and using the fact that δ ≥ ǫ n , we get: k ˆ a − a k ≤ O (cid:16) M δ + k a ∗ − a k + µδ k a ∗ k A (cid:17) ≤ O (cid:16) M δ + µδ k a ∗ k A (cid:17) D.2 Proof of Theorem 3 Proof. By the definition of ˆ a : ≤ sup f Ψ n (ˆ a, f ) ≤ sup f Ψ n ( a , f ) + λ ( k a k A − k ˆ a k A ) Let δ n,ζ = max i (cid:0) R ( F i ) + R ( m ◦ F i ) (cid:1) + c r log( c /ζ ) n for some universal constants c , c . By Theorem 26.5 and 26.9 of [115], and since F i is a symmetricclass and since k a k ∞ ≤ , w.p. − ζ : ∀ f ∈ F i : | Ψ n ( a , f ) − Ψ( a , f ) | ≤ δ n,ζ Since Ψ( a , f ) = 0 for all f ∈ F , we have that, w.p. − ζ : k ˆ a k A ≤ k a k A + δ n,ζ /λ Let B n,λ,ζ = ( k a k H + δ n,ζ /λ ) , A B · F i := { a · f : a ∈ A B , f ∈ F i } and ǫ n,λ,ζ = max i (cid:0) R ( A B n,λ,ζ · F i ) + R ( m ◦ F i ) (cid:1) + c r log( c /ζ ) n for some universal constants c , c , then again by Theorem 26.5 and 26.9 of [115], ∀ a ∈ A B n,λ,ζ , f ∈ F iU | Ψ n ( a, f ) − Ψ( a, f ) | ≤ ǫ n,λ,ζ By a union bound over the d function classes composing F , we have that w.p. − ζ : sup f ∈F Ψ n ( a , f ) ≤ sup f ∈F Ψ( a , f ) + δ n,ζ/d = δ n,ζ/d and sup f ∈F Ψ n (ˆ a, f ) ≥ sup f ∈F Ψ(ˆ a, f ) − ǫ n,λ,ζ/d If k ˆ a − a k ≤ δ n,ζ , then the theorem follows immediately. Thus we consider the case when k ˆ a − a k ≥ δ n,ζ . Since, by assumption, for any a ∈ A B with k a − a k ≥ δ n,ζ it holds that42 − a k a − a k ∈ span κ ( F ) , we have a − ˆ a k a − ˆ a k = P pi =1 w i f i , with p < ∞ , k w k ≤ κ and f i ∈ F . Thus: sup f ∈F Ψ(ˆ a, f ) ≥ κ p X i =1 w i Ψ(ˆ a, f i ) = 1 κ Ψ ˆ a, X i w i f i ! = 1 κ k ˆ a − a k Ψ(ˆ a, a − ˆ a )= 1 κ k ˆ a − a k E [( a ( X ) − ˆ a ( X )) ]= 1 κ k ˆ a − a k Combining all the above we have, w.p. − ζ : k ˆ a − a k ≤ κ (cid:0) ǫ n,λ,ζ/d + δ n,ζ/d + λ ( k a k A − k ˆ a k A ) (cid:1) Moreover, since functions in A and F are bounded in [ − , , we have that the function a · f is -Lipschitz with respect to the vector of functions ( a, f ) . Thus we can apply a vector version of thecontraction inequality [85] to get that: R ( A B n,λ,z · F i ) ≤ (cid:0) R ( A B n,λ,z ) + R ( F i ) (cid:1) Finally, we have that since A is star-convex: R ( A B n,λ,z ) ≤ p B n,λ,z R ( A ) Leading to the final bound of: k ˆ a − a k ≤ κ (cid:16) k a k A + δ n,ζ /λ ) R ( A ) + 2 d max i =1 (cid:0) R ( F i ) + R ( m ◦ F i ) (cid:1)(cid:17) + κ c r log( c d/ζ ) n + λ ( k a k A − k ˆ a k A ) ! Since λ ≥ δ n,ζ , we get the result. D.3 Proof of Corollary 5 Proof. Consider any ˆ a = h ˆ θ, ·i ∈ A B n,λ,ζ and let ν = ˆ θ − θ , then: δ n,ζ /λ + k θ k ≥ k ˆ θ k = k θ + ν k = k θ + ν S k + k ν S c k ≥ k θ k − k ν S k + k ν S c k Thus: k ν S c k ≤ k ν S k + δ n,ζ /λ and ν lies in the restricted cone for which the restricted eigenvalue of V holds. Moreover, since | S | = s : k ν k ≤ k ν S k + δ n,ζ /λ ≤ √ s k ν S k + δ n,ζ /λ ≤ √ s k ν k + δ n,ζ /λ ≤ r sγ ν ⊤ V ν + δ n,ζ /λ k ˆ a − a k = p E [ h ν, x i ] = √ ν ⊤ V ν Thus we have: ˆ a ( x ) − a ( x ) k ˆ a − a k = p X i =1 ν i √ ν ⊤ V ν x i Thus for any ˆ a ∈ A B n,λ,ζ , we can write ˆ a − a k ˆ a − a k as P pi =1 w i f i , with f i ∈ F and: k w k = k ν k √ ν ⊤ V ν ≤ r sγ + δ n,ζ λ k ˆ a − a k . Thus: ˆ a − a k ˆ a − a k ∈ span κ ( F ) for κ = 2 q sγ + δ n,ζ λ k ˆ a − a k .Moreover, observe that by the triangle inequality: k a k A − k ˆ a k A = k θ k − k ˆ θ k ≤ k θ − ˆ θ k = k ν k ≤ r sγ ν ⊤ V ν + δ n,ζ /λ Moreover, by standard results on the Rademacher complexity of linear function classes (see e.g.Lemma 26.11 of [115]), we have R ( A B ) ≤ B q p ) n max x ∈X k x k ∞ and R ( F i ) , R ( m ◦ F i ) ≤ q n max x ∈X k x k ∞ (the latter via the fact that each F i ; and therefore also m ◦ F i ; containsonly two elements and invoking Masart’s lemma). Thus invoking Theorem 3: k ˆ a − a k ≤ (cid:18) r sγ + δ n,ζ λ k ˆ a − a k (cid:19) · k θ k + 1) r log(2 p ) n + δ n,ζ + λ r sγ k ˆ a − a k ! The right hand side is upper bounded by the sum of the following four terms: Q := 2 r sγ k θ k + 1) r log(2 p ) n + δ n,ζ ! Q := (cid:18) δ n,ζ λ k ˆ a − a k (cid:19) k θ k + 1) r log(2 p ) n + δ n,ζ ! Q := 2 λ sγ k ˆ a − a k Q := δ n,ζ r sγ If k ˆ a − a k ≥ q sγ δ n,ζ and setting λ ≤ γ s , yields: Q ≤ λ r γs k θ k + 1) r log(2 p ) n + δ n,ζ ! Q ≤ k ˆ a − a k Q on the left-hand-side and dividing by / , we have: k ˆ a − a k ≤ 43 ( Q + Q + Q ) = 43 max (cid:26)r sγ , λ r γs (cid:27) 20 ( k θ k + 1) r log(2 p ) n + 11 δ n,ζ ! On the other hand if k ˆ a − a k ≤ q sγ δ n,ζ , then the latter inequality trivially holds. Thus it alwaysholds. E Proofs from Section 5 E.1 Proof of Proposition 8 Proposition 23. Consider an online linear optimization algorithm over a convex strategy space S and consider the OFTRL algorithm with a -strongly convex regularizer with respect to some norm k · k on space S : f t = arg min f ∈ S f ⊤ X τ ≤ t ℓ τ + ℓ t + 1 η R ( f ) Let k · k ∗ denote the dual norm of k · k and R = sup f ∈ S R ( f ) − inf f ∈ S R ( f ) . Then for any f ∗ ∈ S : T X t =1 ( f t − f ∗ ) ⊤ ℓ t ≤ Rη + η T X t =1 k ℓ t − ℓ t − k ∗ − η T X t =1 k f t − f t − k Proof. The proof follows by observing that Proposition 7 in [120] holds verbatim for any convexstrategy space S and not necessarily the simplex. Proposition 24. Consider a minimax objective: min θ ∈ Θ max w ∈ W ℓ ( θ, w ) . Suppose that Θ , W areconvex sets and that ℓ ( θ, w ) is convex in θ for every w and concave in θ for any w . Let k · k Θ and k · k W be arbitrary norms in the corresponding spaces. Moreover, suppose that the followingLipschitzness properties are satisfied: ∀ θ ∈ Θ , w, w ′ ∈ W : k∇ θ ℓ ( θ, w ) − ∇ θ ℓ ( θ, w ′ ) k Θ , ∗ ≤ L k w − w ′ k W ∀ w ∈ W, θ, θ ′ ∈ Θ : k∇ w ℓ ( θ, w ) − ∇ w ℓ ( θ ′ , w ) k W, ∗ ≤ L k θ − θ ′ k Θ where k · k Θ , ∗ and k · k W, ∗ correspond to the dual norms of k · k Θ , k · k W . Consider the algorithmwhere at each iteration each player updates their strategy based on: θ t +1 = arg min θ ∈ Θ θ ⊤ X τ ≤ t ∇ θ ℓ ( θ τ , w τ ) + ∇ θ ℓ ( θ t , w t ) + 1 η R min ( θ ) w t +1 = arg max w ∈ W w T X τ ≤ t ∇ w ℓ ( θ τ , w τ ) + ∇ w ℓ ( θ t , w t ) − η R max ( w ) uch that R min is -strongly convex in the set Θ with respect to norm k · k Θ and R max is -stronglyconvex in the set W with respect to norm k·k W and with any step-size η ≤ L . Then the parameters ¯ θ = T P Tt =1 θ t and ¯ w = T P Tt =1 w t correspond to an R ∗ η · T -approximate equilibrium and hence ¯ θ is a R ∗ ηT -approximate solution to the minimax objective, where R is defined as: R ∗ := max (cid:26) sup θ ∈ Θ R min ( θ ) − inf θ ∈ Θ R min ( θ ) , sup w ∈ W R max ( w ) − inf w ∈ W R max ( w ) (cid:27) Proof. The proposition is essentially a re-statement of Theorem 25 of [120] (which in turn is anadaptation of Lemma 4 of [102]), specialized to the case of the OFTRL algorithm and to the case ofa two-player convex-concave zero-sum game, which implies that the if the sum of regrets of playersis at most ǫ , then the pair of average solutions corresponds to an ǫ -equilibrium (see e.g. [53] andLemma 4 of [102]). Proof of Proposition 8 Let R E ( x ) = P pi =1 x i log( x i ) . For the space Θ := { ρ ∈ R p : ρ ≥ , k ρ k ≤ B } , the entropic regularizer is B -strongly convex with respect to the ℓ norm and hencewe can set R min ( ρ ) = B R E ( ρ ) . Similarly, for the space W := { w ∈ R p : w ≥ , k w k = 1 } , theentropic regularizer is -strongly convex with respect to the ℓ norm and thus we can set R max ( w ) = R E ( w ) . For this choice of regularizers, the update rules can be easily verified to have a closed formsolution provided in Proposition 8, by writing the Lagrangian of each OFTRL optimization problemand invoking strong duality. Further, we can verify the lipschitzness conditions. Since the dual ofthe ℓ norm is the ℓ ∞ norm, ∇ ρ ℓ ( ρ, w ) = E n [ V V ⊤ ] w + λ and thus: k∇ ρ ℓ ( ρ, w ) − ∇ ρ ℓ ( ρ, w ′ ) k ∞ = k E n [ V V ⊤ ]( w − w ′ ) k ∞ ≤ k E n [ V V ⊤ ] k ∞ k w − w ′ k k∇ w ℓ ( ρ, w ) − ∇ w ℓ ( ρ ′ , w ) k ∞ = k E n [ V V ⊤ ]( ρ − ρ ′ ) k ∞ ≤ k E n [ V V ⊤ ] k ∞ k ρ − ρ ′ k Thus we have L = k E n [ V V ⊤ ] k ∞ . Finally, observe that: sup ρ ∈ Θ B R E ( ρ ) − inf ρ ∈ Θ B R E ( ρ ) = B log( B ∨ 1) + B log(2 p )sup w ∈ W R E ( w ) − inf w ∈ W R E ( w ) = log(2 p ) Thus we can take R ∗ = B log( B ∨ 1) + ( B + 1) log(2 p ) . Thus if we set η = k E n [ V V ⊤ ] k ∞ , then wehave that after T iterations, ¯ θ = ¯ ρ + − ¯ ρ − is an ǫ ( T ) -approximate solution to the minimax problem,with ǫ ( T ) = 16 k E n [ V V ⊤ ] k ∞ B log( B ∨ 1) + ( B + 1) log(2 p ) T . Combining all the above with Proposition 24 yields the proof of Proposition 8. E.2 Proof of Proposition 15 Observe that the loss function − ℓ ( a, · ) is strongly convex in f with respect to the k · k ,n norm, i.e.: − D ff ℓ ( a, f )[ ν, ν ] ≥ E n [ ν ( X ) ] ℓ ( a, f ) − ℓ ( a ′ , f ) = E n [( a ( X ) − a ′ ( X )) · f ( X )] is an k a − a ′ k ,n -Lipschitz function with respect to the ℓ ,n norm (via a Cauchy-Schwarz inequality).Thus we can conclude that (see Lemma 1 in [1]): k f t − f t +1 k ,n ≤ k ¯ a Proof. For example for ATE [ K (3) ] ij = [Φ ( m ) Φ ′ ] ij = h M ∗ φ ( x i ) , φ ( x j ) i = h φ ( x i ) , M φ ( x j ) i = h φ ( d i , w i ) , φ (1 , w j ) − φ (0 , w j ) i = k (( d i , w i ) , (1 , w j )) − k (( d i , w i ) , (0 , w j )) [ K (4) ] ij = [Φ ( m ) (Φ ( m ) ) ′ ] ij = h M ∗ φ ( x i ) , M ∗ φ ( x j ) i = h φ ( x i ) , M M ∗ φ ( x j ) i = h φ ( x i ) , M ∗ φ (1 , w j ) − M ∗ φ (0 , w j ) i = h M φ ( x i ) , φ (1 , w j ) − φ (0 , w j ) i = h φ (1 , w i ) − φ (0 , w i ) , φ (1 , w j ) − φ (0 , w j ) i = k ((1 , w i ) , (1 , w j )) − k ((1 , w i ) , (0 , w j )) − k ((0 , w i ) , (1 , w j )) + k ((0 , w i ) , (0 , w j )) E.4 Proof of Proposition 10 Proof. Write the objective as E ( f ) := 1 n n X i =1 h f, M ∗ φ ( x i ) i H − a ( x i ) h f, φ ( x i ) i H − h f, φ ( x i ) i H − λ k f k H Recall that for an RKHS, evaluation is a continuous functional represented as the inner productwith the feature map. Due to the ridge penalty, the stated objective is coercive and strongly convexw.r.t f . Hence it has a unique maximizer ˆ f that obtains the maximum.Write ˆ f = ˆ f n + ˆ f ⊥ n where ˆ f n ∈ row (Ψ) and ˆ f ⊥ n ∈ null (Ψ) . Substituting this decomposition of ˆ f into the objective, we see that E ( ˆ f ) = E ( ˆ f n ) − λ k ˆ f ⊥ n k H Therefore E ( ˆ f ) ≤ E ( ˆ f n ) Since ˆ f is the unique maximizer, ˆ f = ˆ f n . E.5 Proof of Proposition 11 Proof. Write the objective as E ( f ) = 1 n n X i =1 h M f, φ ( x i ) i − h a, φ ( x i ) ih f, φ ( x i ) i − h f, φ ( x i ) i − λ h f, f i = f ′ M ′ ˆ µ − f ′ ˆ T a − f ′ ˆ T f − λf ′ f where ˆ µ := n P ni =1 φ ( x i ) and ˆ T := n P ni =1 φ ( x i ) ⊗ φ ( x i ) . Appealing to the representer theorem E ( γ ) = γ ′ Ψ M ′ ˆ µ − γ ′ Ψ ˆ T a − γ ′ Ψ ˆ T Ψ ′ γ − λγ ′ ΨΨ ′ γ = γ ′ Ψ M ′ ˆ µ − n γ ′ (cid:20) K (1) K (3) (cid:21) Φ a − n γ ′ (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) γ − λγ ′ Kγ Ψ M ′ ˆ µ − n (cid:20) K (1) K (3) (cid:21) Φ a − n (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ˆ γ − λK ˆ γ = 0 Hence ˆ γ = 12 (cid:20) n (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) + λK (cid:21) − (cid:20) Ψ M ′ ˆ µ − n (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) = 12 (cid:20)(cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) + nλK (cid:21) − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) E.6 Proof of Proposition 12 Proof. Observe that ˆ f ( x ) = h ˆ f , φ ( x ) i = φ ( x ) ′ Ψ ′ ˆ γ = 12 φ ( x ) ′ Ψ ′ ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) m ( x ; ˆ f ) = h M ˆ f , φ ( x ) i = 12 φ ( x ) ′ M Ψ ′ ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) k ˆ f k H = ˆ γ ′ ΨΨ ′ ˆ γ = 14 ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) ′ ∆ − K ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) Write the objective as E ( a ) = 1 n n X i =1 m ( x i ; ˆ f ) − h a, φ ( x i ) i ˆ f ( x i ) − ˆ f ( x i ) − λ k ˆ f k H + µ k a k H where the various terms involving ˆ f only depend on a in the form Φ a . Due to the ridge penalty,the stated objective is coercive and strongly convex w.r.t a . Hence it has a unique maximizer ˆ a that obtains the maximum.Write ˆ a = ˆ a n + ˆ a ⊥ n where ˆ a n ∈ row (Φ) and ˆ a ⊥ n ∈ null (Φ) . Substituting this decomposition of ˆ a intothe objective, we see that E (ˆ a ) = E (ˆ a n ) + µ k ˆ a ⊥ n k H Therefore E (ˆ a ) ≥ E (ˆ a n ) Since ˆ a is the unique minimizer, ˆ a = ˆ a n . 49 .7 Proof of Proposition 13 Proof. Write the objective as E ( a ) = ˆ f ′ M ′ ˆ µ − ˆ f ′ ˆ T a − ˆ f ′ ˆ T ˆ f − λ ˆ f ′ ˆ f + µa ′ a E ( β ) = ˆ γ ′ Ψ M ′ ˆ µ − ˆ γ ′ Ψ ˆ T Φ ′ β − ˆ γ ′ Ψ ˆ T Ψ ′ ˆ γ − λ ˆ γ ′ ΨΨ ′ ˆ γ + µβ ′ ΦΦ ′ β = ˆ γ ′ Ψ M ′ ˆ µ − n ˆ γ ′ (cid:20) K (1) K (1) K (3) K (1) (cid:21) β − n ˆ γ ′ (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ˆ γ − λ ˆ γ ′ K ˆ γ + µβ ′ K (1) β = X j =1 E j where E = ˆ γ ′ Ψ M ′ ˆ µE = − n ˆ γ ′ (cid:20) K (1) K (1) K (3) K (1) (cid:21) βE = − n ˆ γ ′ (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ˆ γE = − λ ˆ γ ′ K ˆ γE = µβ ′ K (1) β Recall that ˆ γ = 12 ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (3) (cid:21) Φ a (cid:21) = 12 ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) Hence ˆ γ ′ = 12 " n ˆ µ ′ M Ψ ′ − β ′ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − We analyze each term1. E E = 12 " n ˆ µ ′ M Ψ ′ − β ′ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − Ψ M ′ ˆ µ Hence ∂E ∂β = − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − Ψ M ′ ˆ µ E = 12 n " β ′ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ − n ˆ µ ′ M Ψ ′ ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β = 12 n β ′ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β − 12 ˆ µ ′ M Ψ ′ ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β Hence ∂E ∂β = 1 n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − Ψ M ′ ˆ µ E E = − n (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) Note that ∂∂s [ x − As ] ′ W [ x − As ] = − A ′ W ( x − As ) Therefore ∂E ∂β = 12 n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) E E = − λ (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) ′ ∆ − K ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) Note that ∂∂s [ x − As ] ′ W [ x − As ] = − A ′ W ( x − As ) Therefore ∂E ∂β = λ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) E ∂E ∂β = 2 µ · K (1) β − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − Ψ M ′ ˆ µ + 1 n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − Ψ M ′ ˆ µ + 12 n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) + λ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ∆ − (cid:20) n Ψ M ′ ˆ µ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β (cid:21) + 2 µ · K XX ˆ β Grouping terms (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − Ψ M ′ ˆ µ − n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ∆ − n Ψ M ′ ˆ µ − λ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ∆ − n Ψ M ′ ˆ µ = 1 n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β − n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β − λ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) β + 2 µ · K (1) β Define Ω := (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) − nλ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K We simplify each side of the equation1. LHS ((cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) − nλ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ) ∆ − Ψ M ′ ˆ µ = Ω∆ − Ψ M ′ ˆ µ 52. RHS (cid:26) n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ − n (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − (cid:20) K (1) K (1) K (1) K (2) K (3) K (1) K (3) K (2) (cid:21) − λ (cid:20) K (1) K (1) K (3) K (1) (cid:21) ′ ∆ − K ! ∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) + 2 µ · K XX (cid:27) ˆ β = (cid:26) n Ω∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) + 2 µ · K (1) (cid:27) ˆ β E.8 Proof of Corollary 14 Proof. ˆ a ( x ) = h ˆ a, φ ( x ) i = φ ( x ) ′ Φ ′ ˆ β = K xX (cid:26) n Ω∆ − (cid:20) K (1) K (1) K (3) K (1) (cid:21) + 2 µ · K (1) (cid:27) − Ω∆ − Ψ M ′ ˆ µ What remains is an account of how to evaluate V := Ψ M ′ ˆ µ ∈ R n .There are two cases1. j ∈ [ n ] Observe that the j -th element of V is v j = φ ( x j ) ′ M ′ ˆ µ = 1 n n X i =1 φ ( x j ) ′ M ′ φ ( x i ) Moreover φ ( x j ) ′ M ′ φ ( x i ) = h φ ( x j ) , M ∗ φ ( x i ) i = [ K (2) ] ji Therefore v j = 1 n n X i =1 [ K (2) ] ji j ∈ { n + 1 , ..., n } Observe that the j -th element of V is v j = φ ( x j ) ′ M M ′ ˆ µ = 1 n n X i =1 φ ( x j ) ′ M M ′ φ ( x i ) Moreover φ ( x j ) ′ M M ′ φ ( x i ) = h M ∗ φ ( x j ) , M ∗ φ ( x i ) i = [ K (4) ] ji v j = 1 n n X i =1 [ K (4) ] ji F Proofs from Section 6 F.1 Proof of Lemma 16 Proof. Observe that θ = E [ m a ( Z ; g )] for all a . Moreover, we can decompose: ˆ θ − θ = 1 n K X k =1 X i ∈ P k ( m ˆ a k ( Z i ; ˆ g ) − E Z [ m ˆ a k ( Z ; ˆ g k )]) + 1 K K X k =1 ( E Z [ m ˆ a k ( Z ; ˆ g k )] − E Z [ m ˆ a k ( Z ; g )])= 1 n K X k =1 X i ∈ P k ( m ˆ a k ( Z i ; ˆ g k ) − E Z [ m ˆ a k ( Z ; ˆ g k )]) + 1 K K X k =1 E X [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] Thus as long as K = Θ(1) and: √ n E X [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] → p we have that: √ n (cid:16) ˆ θ − θ (cid:17) = √ n n K X k =1 X i ∈ P k ( m ˆ a k ( Z i ; ˆ g k ) − E Z [ m ˆ a k ( Z ; ˆ g k )]) | {z } A + o p (1) Suppose that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that: k ˆ a k − a ∗ k → p and k ˆ g k − g ∗ k → p . Then we can further decompose A as: A = E n [ m a ∗ ( Z ; g ∗ )] − E Z [ m a ∗ ( Z ; g ∗ )] + 1 n K X k =1 X i ∈ P k m ˆ a k ( Z i ; ˆ g k ) − m a ∗ ( Z i ; g ∗ ) − E Z [ m ˆ a k ( Z ; ˆ g k ) − m a ∗ ( Z ; g ∗ )] | {z } V i Denote with: B := 1 n K X k =1 X i ∈ P k V i =: 1 n K X k =1 B k As long as n E [ B ] → , then we have that √ nB → p . The second moment of each B k is: E (cid:2) B k (cid:3) = X i,j ∈ P k E [ V i V j ] = X i,j ∈ P k E [ E [ V i V j | ˆ g k ]] = X i ∈ P k E (cid:2) V i (cid:3) i = j , V i is independentof V j and mean-zero, conditional on the nuisance ˆ g k estimated on the out-of-fold data for fold k .Moreover, by Jensen’s inequality with respect to K P Kk =1 B k E [ B ] = E n K X k =1 B k ! = K n E K K X k =1 B k ! ≤ Kn K X k =1 E [ B k ] = Kn K X k =1 X i ∈ P k E [ V i ] = Kn n X i =1 E [ V i ] Finally, observe that E [ V i ] → p , by mean-squared-continuity of the moment and by boundednessof the Riesz representer function class, the function class G and the variable Y . More elaborately: E [ V i ] ≤ E h ( m ˆ a ( Z i ; ˆ g k ) − m a ∗ ( Z i ; g ∗ )) i ≤ E h ( m ( Z i ; ˆ g k ) − m ( Z i ; g ∗ )) i + 2 E [(ˆ a k ( X ) ( Y − ˆ g k ( X )) − a ∗ ( X ) ( Y − g ∗ ( X ))) ] The latter can further be bounded as: E [( a k ( X ) − a ∗ ( X )) ( Y − g k ( X )) ] + 4 E [ a ∗ ( X ) ( g ∗ ( X ) − g k ( X )) ] ≤ C (cid:0) E (cid:2) k ˆ a k − a ∗ k + k ˆ g − g ∗ k (cid:3)(cid:1) assuming that ( Y − ˆ g k ( X )) ≤ C and a ∗ ( X ) ≤ C a.s.. Finally, by linearity of the operator andmean-squared continuity, we have: E [( m ( Z i ; ˆ g k ) − m ( Z i ; g ∗ )) ] = E [( m ( Z i ; ˆ g k − g ∗ )) ] ≤ M E (cid:2) k ˆ g k − g ∗ k (cid:3) Thus we have: E [ V i ] ≤ (2 M + 4 C ) (cid:0) E (cid:2) k ˆ a k − a ∗ k + k ˆ g − g ∗ k (cid:3)(cid:1) → Thus as long as K = Θ(1) , we have that: n E [ B ] = Kn n X i =1 E [ V i ] ≤ (2 M + 4 C ) K E (cid:2) k ˆ g − g ∗ k + k ˆ a − a ∗ k (cid:3) → and we can conclude the result that: √ n (cid:16) ˆ θ − θ (cid:17) = √ n ( E n [ m a ∗ ( Z ; g ∗ )] − E Z [ m a ∗ ( Z ; g ∗ )]) + o p (1) where the latter term can be easily argued, invoking the Central Limit Theorem, to be asymptoti-cally normal N (0 , σ ∗ ) with σ ∗ = Var ( m a ∗ ( Z ; g ∗ )) . F.2 Proof of Normality without Consistency Lemma 25. Suppose that K = Θ(1) and that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that for all k ∈ [ K ] : k ˆ a k − a ∗ k L → and k ˆ g k − g ∗ k L → . Assume that: ∀ k ∈ [ K ] : √ n E [( a ∗ ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ∗ ( X ))] → p and that ˆ g k admits an asymptotically linear representation around the truth g , i.e.: p | P k | (ˆ g k ( X ) − g ( X )) = 1 p | P k | X i ∈ P k ψ ( X, Z i ; g ) + o p (1) ith E [ ψ ( X, Z i ; g ) | X ] = 0 and let: σ ∗ := Var Z i ( m a ∗ ( Z i ; g ∗ ) + E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )]) Assume that Condition 1 is satisfied and the variables Y, g ( X ) , a ( X ) are bounded a.s. for all g ∈ G and a ∈ A . Then: √ n (cid:16) ˆ θ − θ (cid:17) → d N (cid:0) , σ ∗ (cid:1) Similarly, if ˆ a k has an asymptotically linear representation around the truth, then the statementabove holds with: σ ∗ := Var Z i ( m a ∗ ( Z i ; g ∗ ) + E X [ ψ ( X, Z i ; a ) ( g ( X ) − g ∗ ( X ))]) Proof. Observe that θ = E [ m a ( Z ; g )] for all a . Moreover, we can decompose: ˆ θ − θ = 1 n K X k =1 X i ∈ P k ( m ˆ a k ( Z i ; ˆ g ) − E [ m ˆ a k ( Z ; ˆ g k )]) + 1 K K X k =1 ( E [ m ˆ a k ( Z ; ˆ g k )] − E [ m ˆ a k ( Z ; g )])= 1 n K X k =1 X i ∈ P k ( m ˆ a k ( Z i ; ˆ g k ) − E [ m ˆ a k ( Z ; ˆ g k )]) | {z } A + 1 K K X k =1 E [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] | {z } C Suppose that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that: k ˆ a k − a ∗ k → p and k ˆ g k − g ∗ k → p . Then we can further decompose A as: A = E n [ m a ∗ ( Z ; g ∗ )] − E [ m a ∗ ( Z ; g ∗ )] + 1 n K X k =1 X i ∈ P k m ˆ a k ( Z i ; ˆ g k ) − m a ∗ ( Z i ; g ∗ ) − E [ m ˆ a k ( Z ; ˆ g k ) − m a ∗ ( Z ; g ∗ )] | {z } V i Denote with: B := 1 n K X k =1 X i ∈ P k V i =: 1 n K X k =1 B k As long as n E [ B ] → , then we have that √ nB → p . The second moment of each B k is: E [ B k ] = X i,j ∈ P k E [ V i V j ] = X i,j ∈ P k E [ E [ V i V j | ˆ g k ]] = X i ∈ P k E [ V i ] where in the last equality we used the fact that due to cross-fitting, for any i = j , V i is independentof V j and mean-zero, conditional on the nuisance ˆ g k estimated on the out-of-fold data for fold k .Moreover, by Jensen’s inequality with respect to K P Kk =1 B k E [ B ] = E n K X k =1 B k ! = K n E K K X k =1 B k ! ≤ Kn K X k =1 E [ B k ] = Kn K X k =1 X i ∈ P k E [ V i ] = Kn n X i =1 E [ V i ] E [ V i ] → p , by mean-squared-continuity of the moment and by boundednessof the Riesz representer function class, the function class G and the variable Y . More elaborately: E [ V i ] ≤ E h ( m ˆ a ( Z i ; ˆ g k ) − m a ∗ ( Z i ; g ∗ )) i ≤ E h ( m ( Z i ; ˆ g k ) − m ( Z i ; g ∗ )) i + 2 E [(ˆ a k ( X ) ( Y − ˆ g k ( X )) − a ∗ ( X ) ( Y − g ∗ ( X ))) ] The latter can further be bounded as: E [( a k ( X ) − a ∗ ( X )) ( Y − g k ( X )) ] + 4 E [ a ∗ ( X ) ( g ∗ ( X ) − g k ( X )) ] ≤ C E (cid:2) k ˆ a k − a ∗ k + k ˆ g − g ∗ k (cid:3) assuming that ( Y − ˆ g k ( X )) ≤ C and a ∗ ( X ) ≤ C a.s.. Finally, by linearity of the operator andmean-squared continuity, we have: E [( m ( Z i ; ˆ g k ) − m ( Z i ; g ∗ )) ] = E [( m ( Z i ; ˆ g k − g ∗ )) ] ≤ M E (cid:2) k ˆ g k − g ∗ k (cid:3) Thus we have: E [ V i ] ≤ (2 M + 4 C ) E (cid:2) k ˆ a k − a ∗ k + k ˆ g − g ∗ k (cid:3) → Thus as long as K = Θ(1) , we have that: n E [ B ] = Kn n X i =1 E [ V i ] ≤ (2 M + 4 C ) K (cid:0) k ˆ g − g ∗ k + k ˆ a − a ∗ k (cid:1) → p and we can conclude the result that: √ n A = √ n ( E n [ m a ∗ ( Z ; g ∗ )] − E [ m a ∗ ( Z ; g ∗ )]) + o p (1) Now we analyze term C . We will prove one of the two conditions in the “or” statement, when ˆ g k has an asymptotically linear representation, i.e. p | P k | (ˆ g k ( X ) − g ( X )) = 1 p | P k | X i ∈ P k ψ ( X, Z i ; g ) + o p (1) with E [ ψ ( X, Z i ; g ) | X ] = 0 . The case when ˆ a k is asymptotically linear can be proved analogously.Let: C k := E [( a ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] We can then write: C k = E [( a ∗ ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] + E [( a ( X ) − a ∗ ( X )) (ˆ g k ( X ) − g ( X ))] Since: p | P k | E [( a ∗ ( X ) − ˆ a k ( X )) (ˆ g k ( X ) − g ( X ))] ≤ p | P k |k a ∗ − ˆ a k k k ˆ g k − g k = k a ∗ − ˆ a k k O p (1) = o p (1) we have that: p | P k | C k = p | P k | E [( a ( X ) − a ∗ ( X )) (ˆ g k ( X ) − g ( X ))] + o p (1)= 1 p | P k | X i ∈ P k E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )] + o p (1) K = Θ(1) and n/ | P k | → K , we then also have that: √ nC = √ nK K X k =1 C k = √ KK K X k =1 p | P k | C k + o (1)= 1 √ K K X k =1 p | P k | X i ∈ P k E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )] + o p (1)= 1 √ n K X k =1 X i ∈ P k E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )] + o p (1)= 1 √ n X i ∈ [ n ] E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )] + o p (1)= √ n E n [ E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )]] + o p (1) √ n (cid:16) ˆ θ − θ (cid:17) = √ n ( E n [ m a ∗ ( Z ; g ∗ ) + E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )]] − E [ m a ∗ ( Z ; g ∗ )]) + o p (1) where the latter term can be easily argued, invoking the Central Limit Theorem, to be asymptoti-cally normal N (0 , σ ∗ ) with σ ∗ = Var Z i ( m a ∗ ( Z i ; g ∗ ) + E X [( a ( X ) − a ∗ ( X )) ψ ( X, Z i ; g )]) . F.3 Proof of Lemma 17 Proof. Observe that θ = E [ m a ( Z ; g )] for all a . Moreover, we can decompose: ˆ θ − θ = E n [ m ˆ a ( Z ; ˆ g )] − E [ m ˆ a ( Z ; ˆ g )] + E [ m ˆ a ( Z ; ˆ g )] − E [ m ˆ a ( Z ; g )]= E n [ m ˆ a ( Z ; ˆ g )] − E [ m ˆ a ( Z ; ˆ g )] + E [( a ( X ) − ˆ a ( X )) (ˆ g ( X ) − g ( X ))] Thus as long as K = Θ(1) and: √ n E [( a ( X ) − ˆ a ( X )) (ˆ g ( X ) − g ( X ))] → p we have that: √ n (cid:16) ˆ θ − θ (cid:17) = √ n E n [ m ˆ a ( Z ; ˆ g )] − E [ m ˆ a ( Z ; ˆ g )] | {z } A + o p (1) Suppose that for some a ∗ and g ∗ (not necessarily equal to a and g ), we have that: k ˆ a k − a ∗ k → p and k ˆ g k − g ∗ k → p . Then we can further decompose A as: A = E n [ m a ∗ ( Z ; g ∗ )] − E [ m a ∗ ( Z ; g ∗ )] + E n [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] − E [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] Let δ n,ζ = δ n + c q log( c /ζ ) n , where δ n upper bounds the critical radius of function classes G B and m ◦ G B and A B , where B is set such that these sets contain functions that are bounded in [ − , .58y a concentration inequality, almost identical to that of Equation (11), we have that w.p. − ζ : ∀ f ∈ F , a ∈ A| E n [ m a ( Z ; g ) − m a ∗ ( Z ; g ∗ )] − E [ m a ( Z ; a ) − m a ∗ ( Z ; g ∗ )] |≤ O (cid:0) δ n,ζ ( k m ◦ ( g − g ∗ ) k kk a k A + k a − a ∗ k k g k G + k g − g ∗ k k a k A ) + δ n,ζ k a k A k g k G (cid:1) Applying the latter for ˆ g, ˆ a and invoking the MSE continuity, w.p. − ζ : | E n [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] − E [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] |≤ O (cid:0) δ n,ζ M ( k ˆ a − a ∗ k k ˆ g k G + k ˆ g − g ∗ k k ˆ a k A ) + δ n,ζ k ˆ g k G k ˆ a k A (cid:1) If we let δ n, ∗ = δ n + c p c nn , then we have that: | E n [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] − E [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] | = O p (cid:0) δ n, ∗ M ( k ˆ a − a ∗ k k ˆ g k G + k ˆ g − g ∗ k k ˆ a k A ) + δ n, ∗ k ˆ g k G k ˆ a k A (cid:1) If k ˆ a − a ∗ k , k ˆ g − g ∗ k = O p ( r n ) and k ˆ a k A , k ˆ g k G = O p (1) , we have that: | E n [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] − E [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] | = O p (cid:0) M δ n, ∗ r n + δ n, ∗ (cid:1) Thus as long as: √ n (cid:0) δ n, ∗ r n + δ n, ∗ (cid:1) → , we have that: √ n | E n [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] − E [ m ˆ a ( Z ; ˆ g ) − m a ∗ ( Z ; g ∗ )] | = o p (1) Thus we conclude that: √ n (cid:16) ˆ θ − θ (cid:17) = √ n ( E n [ m a ∗ ( Z ; g ∗ )] − E [ m a ∗ ( Z ; g ∗ )]) + o p (1) where the latter term can be easily argued, invoking the Central Limit Theorem, to be asymptoti-cally normal N (0 , σ ∗ ) with σ ∗ = Var ( m a ∗ ( Z ; g ∗ )) . F.4 Proof of Lemma 18 Proof. Let h = ( a, g ) and V ( Z ; h ) = m a ( Z ; g ) − m a ∗ ( Z ; g ∗ ) − E [ m a ( Z ; g ) − m a ∗ ( Z ; g ∗ )] . We arguethat: √ n E n h V ( Z ; ˆ h ) i = o p (1) The remainder of the proof is identical to the proof of Lemma 17. For the above property it sufficesto show that n E (cid:20) E n h V ( Z ; ˆ h ) i (cid:21) → .First we re-write the differences V ( Z ; h ) − V ( Z ; h ′ ) : V ( Z ; h ) − V ( Z ; h ′ ) = m ( Z ; g − g ′ ) + ( a ( X ) − a ′ ( X )) Y − a ( X ) g ( X ) + a ′ ( X ) g ′ ( X ) − ( h a , g − g ′ i − h a, g i + h a ′ , g ′ i + h a − a ′ , g i ) 59y MSE continuity of the the moment and boundedness of the functions we have that: E h ( V ( Z ; h ) − V ( Z ; h ′ )) i ≤ c E (cid:2) k h ( X ) − h ′ ( X ) k ∞ (cid:3) for some constant c . Moreover, since, for every x, y : x ≤ y + | x | | x − y | + | y | | x − y | : E h E n [ V ( Z ; ˆ h )] i = 1 n X i,j E h V ( Z i ; ˆ h ) V ( Z j ; ˆ h ) i ≤ n X i,j (cid:16) E h V ( Z i ; ˆ h − i,j ) V ( Z j ; ˆ h − i,j ) i + 2 E h(cid:12)(cid:12)(cid:12) V ( Z i ; ˆ h − i,j ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) V ( Z j ; ˆ h − i,j ) − V ( Z j ; ˆ h ) (cid:12)(cid:12)(cid:12)i(cid:17) ≤ n X i,j E h V ( Z i ; ˆ h − i,j ) V ( Z j ; ˆ h − i,j ) i + 2 r E h V ( Z i ; ˆ h − i,j ) i s E (cid:20)(cid:16) V ( Z j ; ˆ h − i,j ) − V ( Z j ; ˆ h ) (cid:17) (cid:21)! ≤ n X i,j E h V ( Z i ; ˆ h − i,j ) V ( Z j ; ˆ h − i,j ) i + 2 c r E h V ( Z i ; ˆ h − i,j ) i r E h k ˆ h − i,j ( X j ) − ˆ h ( X j ) k ∞ i! ≤ n X i,j E h V ( Z i ; ˆ h − i,j ) V ( Z j ; ˆ h − i,j ) i + 8 c β n − r E h V ( Z i ; ˆ h − i,j ) i! For every i = j we have: E [ V ( Z i ; ˆ h − i,j ) V ( Z j ; ˆ h − i,j )] = E h E h V ( Z i ; ˆ h − i ) V ( Z j ; ˆ h − j ) | ˆ h − i,j ii = E h E h V ( Z i ; ˆ h − i,j ) | ˆ h − i,j i E h V ( Z j ; ˆ h − i,j ) | ˆ h − i,j ii = 0 and q E [ V ( Z ; ˆ h − i,j ) ] ≤ O (cid:18)q E [ k ˆ a − i,j − a ∗ k + k ˆ g − i,j − g ∗ k ] (cid:19) = O ( r n − ) E [ V ( Z ; ˆ h − i ) ] ≤ O (cid:0) E (cid:2) k ˆ a − i − a ∗ k + k ˆ g − i − g ∗ k (cid:3)(cid:1) = O ( r n − ) Thus we get that: n E h E n [ V ( Z ; ˆ h )] i = 1 n n X i =1 E [ V ( Z i ; ˆ h − i ) ] + O ( β n − r n − ) = O (cid:0) r n − + n β n − r n − (cid:1) Thus it suffices to assume that r n − + n β n − r n − →0