aa r X i v : . [ s t a t . M L ] F e b Debiased Kernel Methods
Rahul Singh
MIT Economics [email protected]
Abstract
I propose a practical procedure based on bias correction and sample splittingto calculate confidence intervals for functionals of generic kernel methods, i.e.nonparametric estimators learned in a reproducing kernel Hilbert space (RKHS).For example, an analyst may desire confidence intervals for functionals of kernelridge regression or kernel instrumental variable regression. The framework en-compasses (i) evaluations over discrete domains, (ii) treatment effects of discretetreatments, and (iii) incremental treatment effects of continuous treatments. Forthe target quantity, whether it is (i)-(iii), I prove pointwise √ n consistency, Gaus-sian approximation, and semiparametric efficiency by finite sample arguments. Ishow that the classic assumptions of RKHS learning theory also imply inference. The reproducing kernel Hilbert space (RKHS) is a popular nonparametric setting in machine learn-ing. For example, kernel ridge regression is a classic algorithm for prediction, with impressiveattributes. Computationally, it has a closed form solution; empirically, it adapts well to local nonlin-earity while preserving smoothness; statistically, it is minimax optimal with respect to mean squareerror [15]. However, practical confidence intervals for functionals of this classic procedure–undergeneral and weak conditions–have remained elusive.A recent literature proposes kernel methods for causal learning problems, such as nonparametricinstrumental variable regression [81] and nonparametric treatment effects [83, 80], that share theimpressive attributes of kernel ridge regression. The motivation for these causal kernel methods ispolicy evaluation in social science and epidemiology. In social science and epidemiology, however,confidence intervals are essential. By providing confidence intervals, this paper disembarrasses abroad class of machine learning algorithms for practical use.I provide conceptual, algorithmic, and statistical contributions to RKHS theory.
Conceptual.
The generality of the framework is threefold. First, I unify (i) evaluations over discretedomains, (ii) treatment effects of discrete treatments, and (iii) incremental treatment effects of con-tinuous treatments into one general inference problem. Second, I allow for any nonparametric kernelmethod with a known learning rate. Third, I allow for the covariates to be discrete or continuous,and low, high, or infinite dimensional. As such, the main results apply to kernel methods trained ontexts, graphs, or images.
Algorithmic.
I provide a general purpose procedure to automatically debias (i)-(iii) for pointwiseinference. The debiasing is novel, and I show that is has a one line, closed form solution. Itsonly hyperparameters are kernel hyperparameters and ridge regression penalties, just like kernelridge regression. The former have well known heuristics, and the latter are easily tuned by crossvalidation.
Statistical.
For the target quantity, whether it is (i)-(iii), I prove pointwise √ n consistency, Gaussianapproximation, and semiparametric efficiency by finite sample arguments. The analysis explicitly Preprint. Under review. ccounts for each source of error in any finite sample size. Formally, I show that the classic assump-tions of RKHS learning theory also imply inference.The structure of the paper is as follows. Section 1.1 describes related work. Section 2 articulatesthe general inference problem. Section 3 proposes the general purpose debiasing procedure andfurnishes confidence intervals. Sections 4 and 5 present the main theoretical contributions. Section 6concludes.
By focusing on functionals, i.e. scalar summaries, of nonparametric quantities, thispaper continues the tradition of classic semiparametric statistical theory [50, 32, 35, 65, 48, 71, 97,12, 56, 69, 98, 13, 57, 1, 58, 2, 92, 49, 3]. Whereas classic semiparametric theory studies functionalsof densities or regressions over low dimensional domains, I study functionals of generic kernelmethods over domains that may include discrete treatments and low, high, or infinite dimensionalcovariates. In classic semiparametric theory, an object called the Riesz representer of the functionalappears in asymptotic variance calculations [56, 36]. For exactly the same reasons, it appears in thepractical procedure for confidence intervals that I propose.
Debiasing.
In asymptotic inference, the Riesz representer is inevitable. Motivated by this insight,a growing literature directly incorporates the Riesz representer into estimation, which amounts todebiasing known estimators [32, 13, 99, 7, 8, 9, 10, 41, 42, 43, 93, 64, 23, 60, 66, 38, 39, 40,14, 101, 102]. Doubly robust estimating equations serve this purpose [70, 69, 96, 95, 51, 91]. Ageometric perspective on debiasing emphasizes Neyman orthogonality: by debiasing, the learningproblem for the functional of interest becomes orthogonal to the learning problem for the underlyingnonparametric object [61, 62, 98, 67, 100, 9, 10, 22, 6, 21, 30]. In this work, I debias kernel methodsusing doubly robust estimating equations. The learning problem for (i)-(iii) becomes orthogonal tothe learning problem for the underlying kernel method.
Sample splitting.
With debiasing alone, a key challenge remains: for inference, the function classin which the nonparametric quantity is learned must be Donsker. However, popular nonparametricsettings in machine learning such as the RKHS are not Donsker. A solution to this challengingissue is to combine debiasing with sample splitting [11, 77, 48, 98, 67]. The debiased machinelearning (DML) and targeted maximum likelihood literatures are responsible for this insight [76, 74,75, 96, 100, 95, 28, 94, 22, 46, 21, 45]. In particular, DML delivers simple sufficient conditionsfor inference on functionals in terms of learning rates of the underlying regression and the Rieszrepresenter [22, 21]. The analysis of Section 4 is written at this level of generality. Whereas theexisting DML theory is asymptotic, I provide the first finite sample analysis of DML with black boxmachine learning.
Automatic debiasing.
Since standard statistical learning theory provides rates for nonparametricregression, what remains is estimation and analysis that provides sufficiently fast rates for the Rieszrepresenter. However, the Riesz representer may be a complex object. Even for simple functionalssuch as policy effects, its closed form involves ratios of densities, which would imply slow rates.A crucial insight is that the Riesz representer is directly identified from data; with some creativity,it can be estimated and analyzed directly, without estimating or analyzing its components [68, 59,4, 24, 26, 33, 34, 82, 73, 25]. For the RKHS context, I propose a novel estimator of the Rieszrepresenter that generalizes kernel ridge regression. I derive its fast rate in Section 5.
Kernel methods.
Recent developments in the kernel methods literature motivate this project. Inparticular, a new toolkit addresses causal learning problems such as instrumental variable regression[81, 55, 29], treatment effects under selection on observables [63, 54, 83], and treatment effects withnegative controls [80]. Confidence intervals are essential for practical use in policy evaluation, yetthey have remained elusive until now. As explained above, a key contribution is fast rate analysisfor a new Riesz representer estimator. My analysis directly builds on the seminal work of [15],which provides minimax optimal rates for kernel ridge regression. As such, classic learning theoryassumptions imply the learning rate for the Riesz representer and hence inference for the functional.2
Framework
The general inference problem is to find a confidence interval for θ ∈ R where θ = E [ m ( W, γ )] , γ ∈ H m : W × H → R is an abstract formula. W ∈ W is a concatenation of random variables in themodel excluding the outcome Y ∈ Y ⊂ R . H is an RKHS, consisting of functions of the form γ : W → R . I denote its kernel by k : W × W → R , and its feature map φ : W → H , w k ( w, · ) .The first set of examples are functionals of kernel ridge regresssion. These statistical quantities havecausal interpretation under the assumption of unconfoundedness, also called selection on observ-ables. The simplest example is evaluation over a discrete domain. Example 2.1 (Evaluation of nonparametric regression) . W are covariates, and γ ( w ) = E [ Y | W = w ] . Then evaluation at location w is θ = γ ( w ) . I require that W are discrete. Next, I turn to classic treatment effects in causal inference: average treatment effect (ATE), averagetreatment effect with distribution shift (ATE-DS), average treatment on the treated (ATT), and con-ditional average treatment effect (CATE). I generalize each of these classic effects to have discretetreatment values rather than binary treatment values. The expressions follow from [72, 20, 37].
Example 2.2 (Treatment effects of discrete treatments under selection on observables) . W =( D, X ) are treatment and covariates, and γ ( d, x ) = E [ Y | D = d, X = x ] . For heterogeneouseffects, I replace X with ( V, X ) where V is the low dimensional subcovariate of interest. I requirethat ( D, V ) are discrete.1. ATE. The mean potential outcome of treatment value d is given by θ = E [ γ ( d, X )] .2. ATE-DS. The mean potential outcome of treatment value d under distribution shift is givenby θ = ˜ E [ γ ( d, X )] , where ˜ E is the expectation over an alternative population.3. ATT. The mean potential outcome of treatment value d ′ for the subpopulation who receivedtreatment value d is given by β = θ P ( d ) where θ = E [ γ ( d ′ , X ) D = d ] .4. CATE. The mean potential outcome of treatment value d for the subpopulation with subco-variate value v is given by β = θ P ( v ) where θ = E [ γ ( d, v, X ) V = v ] . In a couple instances, the quantity of interest is β ∈ R , which is a ratio of θ ∈ R and a marginalprobability. By standard arguments, inference for θ implies inference for β by delta method. In theexamples so far, I have restricted treatment to be discrete. I now allow treatment to be continuous.The expression follows from [37, 73]. Example 2.3 (Incremental treatment effects of continuous treatments under selection on observ-ables) . W = ( D, X ) are treatment and covariates, and γ ( d, x ) = E [ Y | D = d, X = x ] . I nowallow D to be continuous. The incremental treatment effect weighted by the density ω over treatmentvalues is θ = E [ S ( U ) γ ( U, X )] where S ( u ) = − ω ′ ( u ) ω ( u ) and U is drawn from ω independently of X . The examples listed so far are statistical quantities that coincide with causal quantities under theassumption of no unobserved confounding. Often, in observational data, unobserved confoundingdrives spurious correlations. The next set of examples are statistical quantities that coincide withcausal quantities despite unobserved confounding. They are functionals of kernel instrumental vari-able regression. In instrumental variable regression, the analyst has access to auxiliary variablescalled instruments. Again, I begin with evaluation.
Example 2.4 (Evaluation of nonparametric instrumental variable regression) . W = ( X, Z ) arecovariates and instruments. γ ( x ) is the solution to E [ Y | Z = z ] = E [ γ ( X ) | Z = z ] . Thenevaluation at location x is θ = γ ( x ) . I require that X are discrete. Next, I revisit the classic treatment effects in causal inference. This time, I relax the assumption ofno unobserved confounding. In its place, I assume access to two auxiliary variables called negativecontrols, which leads to new statistical quantities. The expressions follow from [52, 53, 90].3 xample 2.5 (Treatment effects of discrete treatments under negative controls) . W = ( D, X, ˜ W ) are treatment, covariates, and negative control outcome. Z is the negative control treatment. γ ( d, x, ˜ w ) is the solution to E [ Y | D = d, X = x, Z = z ] = E [ γ ( D, X, ˜ W ) | D = d, X = x, Z = z ] . For heterogeneous effects, I replace X with ( V, X ) where V is the low dimensional subcovariateof interest. I require that ( D, V ) are discrete.1. ATE. The mean potential outcome of treatment value d is given by θ = E [ γ ( d, X, ˜ W )] .2. ATE-DS. The mean potential outcome of treatment value d under distribution shift is givenby θ = ˜ E [ γ ( d, X, ˜ W )] , where ˜ E is the expectation over an alternative population.3. ATT. The mean potential outcome of treatment value d ′ for the subpopulation who receivedtreatment value d is given by β = θ P ( d ) where θ = E [ γ ( d ′ , X, ˜ W ) D = d ] .4. CATE. The mean potential outcome of treatment value d for the subpopulation with subco-variate value v is given by β = θ P ( v ) where θ = E [ γ ( d, v, X, ˜ W ) V = v ] . As before, delta method transforms inference about θ ∈ R into inference about β ∈ R . Finally, Iallow treatment to be continuous in the negative control setting. The expression generalizes [37, 73]. Example 2.6 (Incremental treatment effects of continuous treatments under negative controls) . W = ( D, X, ˜ W ) are treatment, covariates, and negative control outcome. Z is the negative controltreatment. γ ( d, x, ˜ w ) is the solution to E [ Y | D = d, X = x, Z = z ] = E [ γ ( D, X, ˜ W ) | D = d, X = x, Z = z ] . I now allow D to be continuous. The incremental treatment effect weighted bythe density ω over treatment values is θ = E [ S ( U ) γ ( U, X, ˜ W )] where S ( u ) = − ω ′ ( u ) ω ( u ) and U isdrawn from ω independently of ( X, ˜ W ) . To prove validity of the confidence interval, I require that the formula m is mean square continuous. Assumption 2.1 (Mean square continuity) . There exists ¯ L m < ∞ s.t. E [ m ( W, γ )] ≤ ¯ L m · E [ γ ( W )] This condition will be key in Section 4, where I reduce the problem of inference for θ into theproblem of learning ( γ , α min0 ) , where α min0 is introduced below. It is a powerful condition, yet it iseasily satisfied in all of the examples above. Proposition 2.1 (Verifying continuity for examples) . For each of the evaluation and discrete treat-ment effect examples above, Assumption 2.1 holds if the corresponding propensity scores arebounded away from zero. For the incremental treatment effect examples, Assumption 2.1 holds if S ( u ) is bounded, and the ratio of ω ( u ) over the propensity is bounded. It is helpful to define the operator M : γ ( · ) m ( · , γ ) to represent the formula m . Whereverpossible, the notation is consistent with companion work [25], in which we propose and analyzeconfidence intervals for adversarial estimators over generic function spaces. For the present analysis,I require further structure on the operator M . Assumption 2.2 (Bounded linear operator) . M ∈ L ( H , H ) , i.e. M is a bounded linear operatorfrom H to H . As long as the RKHS includes constant and Dirac functions with respect to evaluated components, Ican show M : H → H for the examples above. Moreover, I can show that this operator is boundedand linear. I appeal to the fact that the RKHS is a space in which evaluation is a bounded linearfunctional. This is a key point, since k M k op < ∞ will underpin learning rates in Section 5. Proposition 2.2 (Verifying boundedness for examples) . Construct the RKHS H over W as thetensor product of RKHSs {H j } over components { W j } of W , i.e. k ( w, w ′ ) = Q j k j ( w j , w ′ j ) .Moreover, assume each k j is bounded by some κ j < ∞ . If W j is a discrete random variable,assume there exists a constant c j > s.t. k j ( w j , w ′ j ) − c j is positive definite over W j . Then ineach of the examples above, Assumption 2.2 holds. For incremental effects, I additionally require S ∈ H d , which is an indirect restriction on ω . roposition 2.3 (Consequences of RKHS construction) . If a kernel k j is discrete then its RKHS H j contains Dirac functions. If such a c j exists, then H j contains constant functions. Let c max j be thelargest such c j satisfying the desired property. Then k H j k H j = c max j . From the closed form expression for a kernel k j , c max j is often obvious by inspection. With theseresults, I can now write m ( w, γ ) = [ M γ ]( w ) = h M γ, φ ( w ) i H = h γ, M ∗ φ ( w ) i H where M ∗ is the adjoint of M . An RKHS H is a subset of L ( P ) , the space of square integrable functions with respect to themeasure P . In this discussion, I temporarily set aside RKHS geometry and focus on L ( P ) geometry.Denote the L ( P ) norm by k · k . In anticipation of later analysis, I consider the restricted modelof γ ∈ Γ ⊂ H ⊂ L ( P ) , where Γ is some convex function space. In RKHS learning theory, meansquare rates are adaptive to the smoothness of γ , encoded by γ ∈ Γ . Proposition 2.4 (Riesz representation) . Suppose Assumption 2.1 holds. Further suppose γ ∈ Γ .Then there exists a Riesz representer α ∈ L ( P ) s.t. E [ m ( W, γ )] = E [ α ( W ) γ ( W )] , ∀ γ ∈ Γ Moreover, there exists a unique minimal Riesz representer α min0 ∈ closure ( span (Γ)) that satisfiesthis equation. Riesz representation delivers a doubly robust formulation of the target θ ∈ R . We view ( γ , α min0 ) as nuisance parameters that we must learn in order to learn and infer θ . E [ ψ ( W )] , ψ ( w ) = ψ ( w, θ , γ , α min0 ) , ψ ( w, θ, γ, α ) = m ( w, γ ) + α ( w )[ y − γ ( w )] − θ The term α ( w )[ y − γ ( w )] is the product of the Riesz representer and the kernel method residual. Itserves as a bias correction for the term m ( w, γ ) . This formulation is doubly robust in the sense thatthe moment function for θ remains valid if either γ or α is correct, i.e. E [ ψ ( W, θ , γ , α )] ∀ α ∈ L ( P ) , E [ ψ ( W, θ , γ, α )] ∀ γ ∈ L ( P ) While any Riesz representer α will suffice for valid learning and inference, the minimal Rieszrepresenter α min0 confers semiparametric efficiency [24, Theorem 4.2]. The goal of this paper is general purpose learning and inference for θ , where θ ∈ R may be an(i) evaluation over discrete domain, (ii) treatment effect of discrete treatment, or (iii) incrementaltreatment effect of continuous treatment. In Section 2, I demonstrated that any such θ is a meansquare continuous functional of an RKHS function γ ∈ H and hence has a unique minimal repre-senter α min0 . In this section, I describe a meta algorithm to turn kernel estimators ˆ γ of γ and ˆ α for α min0 into an estimator ˆ θ of θ such that ˆ θ has a valid and practical confidence interval. This metaalgorithm is precisely debiased machine learning (DML) [22, 21].Recall that ˆ γ may be any one of a number of different kernel methods such as kernel ridge regressionor kernel instrumental variable regression. To preserve this generality, I do not instantiate a choiceof ˆ γ ; I treat it as a black box. In subsequent analysis, I will only require that ˆ γ converges to γ inmean square error. This mean square rate is guaranteed by existing learning theory. As such, theinferential theory builds directly on the learning theory.The target estimator ˆ θ as well as its confidence interval will depend on nuisance estimators ˆ γ and ˆ α .For now, I refrain from instantiating the estimator ˆ α for α min0 . As we will see in subsequent analysis,the general theory only requires that ˆ α converges to α min0 in mean square error. This property canbe verified by extending classic learning theory arguments, which I reserve for Section 5. As a preview, I will take
Γ = H c , which is a smoother RKHS than H . Conveniently, closure ( span ( H c )) = H c , and H = H . lgorithm 3.1 (Target and confidence interval) . Partition the sample into folds { I ℓ } ℓ =1: L . Denoteby I cℓ observations not in fold I ℓ
1. For each fold ℓ , estimate ˆ γ ℓ and ˆ α ℓ from observations in I cℓ
2. Estimate ˆ θ as ˆ θ = 1 n L X ℓ =1 X i ∈ I ℓ { m ( W i , ˆ γ ℓ ) + ˆ α ℓ ( W i )[ Y i − ˆ γ ℓ ( W i )] }
3. Estimate its (1 − a ) · % confidence interval as ˆ θ ± c a ˆ σ √ n , ˆ σ = 1 n L X ℓ =1 X i ∈ I ℓ { m ( W i , ˆ γ ℓ ) + ˆ α ℓ ( W i )[ Y i − ˆ γ ℓ ( W i )] − ˆ θ } where c a is the − a quantile of the standard Gaussian. Algorithm 3.1 is a meta algorithm that takes as inputs the sample, the algorithm ˆ γ for γ , and thealgorithm ˆ α for α min0 . A broad literature already proposes and analyzes estimators ˆ γ . I now proposean estimator ˆ α that extends kernel ridge regression. As such, existing learning guarantees for kernelridge regression will extend to ˆ α .For readability, in the main text I focus on the Riesz representer for functionals of kernel ridgeregression. The Riesz representer for functionals of kernel instrumental variable regression is astraightforward extension that incurs additional notation, which I reserve for the Appendix.I motivate the new estimator by the following property. Proposition 3.1 (Riesz representer loss) . Suppose the kernel k ( w, w ′ ) is bounded. Then α min0 ∈ argmin α ∈H L ( α ) , L ( α ) = − h α, M ∗ µ i H + h α, T α i H where T = E [ φ ( W ) ⊗ φ ( W )] is the uncentered covariance operator and µ = E [ φ ( W )] is the kernelmean embedding. The quantities ( µ, T ) generalize the sufficient statistics ( ˜ M, ˜ G ) in automatic debiased machine learn-ing (Auto-DML) [24, 26]. In Auto-DML, which uses p explicit basis functions, ˜ M ∈ R p is themoment vector and ˜ G ∈ R p × p is the covariance matrix. In my approach, which uses the countablespectrum of the kernel k as implicit basis functions, µ ∈ H is the kernel mean embedding [85] and T : H → H is the covariance operator [31].Next, I define the regularized Reisz representer in this context. The regularization is a ridge penaltywith regularization parameter λ . Definition 3.1 (Kernel ridge Riesz representer: Population) . α λ = argmin α ∈H L λ ( α ) , L λ ( α ) = L ( α ) + λ k α k H I define the estimator as the empirical analogue. To lighten notation, I abstract from sample splittingand consider the setting where n observations are used. Definition 3.2 (Kernel ridge Riesz representer: Sample) . ˆ α = argmin α ∈H L nλ ( α ) , L nλ ( α ) = − h α, M ∗ ˆ µ i H + h α, ˆ T α i H + λ k α k H where ˆ T = n P ni =1 [ φ ( W i ) ⊗ φ ( W i )] and ˆ µ = n P ni =1 φ ( W i ) . As a kernel estimator, ˆ α can be implemented by simple matrix operations. Whereas kernel methodstypically involve matrix operations of the matrix K (1) ∈ R n × n with ( i, j ) -th entry k ( W i , W j ) , thiskernel method involves an extended kernel matrix K ∈ R n × n . Intuitively, if m is a partial or6omplete evaluation, then the same partial or complete evaluations should apply to the dictionary offeatures.Wherever possible, I preserve the notation of companion work on adversarial estimation [25]. Givena formula m , define φ ( m ) ( w ) = M ∗ φ ( w ) . Define the operator Φ :
H → R n with i -th row h φ ( W i ) , ·i H and likewise the operator Φ ( m ) : H → R n with i -th row h φ ( m ) ( W i ) , ·i H . Finallydefine the concatenated operator Ψ :
H → R n that is constructed by concatenating Φ and Φ ( m ) .By construction, K = ΨΨ ∗ . For intuition, it is helpful to conceptualize these operators as matrices.Informally, Ψ = (cid:20) ΦΦ ( m ) (cid:21) , K = (cid:20) K (1) K (2) K (3) K (4) (cid:21) = (cid:20) ΦΦ ∗ Φ(Φ ( m ) ) ∗ Φ ( m ) Φ ∗ Φ ( m ) (Φ ( m ) ) ∗ (cid:21) Note that { K ( j ) } j ∈ [4] ∈ R n × n and hence K ∈ R n × n can be computed from data, though theydepend on the choice of formula m . Proposition 3.2 (Extended kernel matrix: Computation) . Suppose the conditions of Proposition 2.2hold. In the case of complete evaluation with m ( W, γ ) = γ ( w ) , K (1) ij = k ( W i , W j ) , K (2) ij = K (3) ji = k ( W i , w ) , K (4) ij = k ( W i , w ) k ( W j , w )( c max ) where c max = Q j c max j . In the case of partial evaluation with m ( W, γ ) = γ ( d, X ) , K (1) ij = k ( D i , D j ) k ( X i , X j ) , K (2) ij = K (3) ji = k ( D i , d ) k ( X i , X j ) K (4) ij = k ( D i , d ) k ( D j , d )( c max d ) k ( X i , X j ) With this additional notation, I prove that ˆ α has a closed form solution and derive that solution. Myargument is an extension of the classic representer theorem for kernel methods [47, 78]. Proposition 3.3 (Generalized representer theorem) . There exists some ρ ∈ R n s.t. ˆ α = Ψ ∗ ρ . I use this generalized representation theorem to derive the closed form of ˆ α in terms of matrixoperations. Algorithm 3.2 (Kernel ridge Riesz representer: Computation) . Given observations { W i } , formula m , and evaluation location w
1. Calculate K ( j ) ∈ R n × n as defined above2. Calculate Ω n × n and u, v ∈ R n by Ω = (cid:20) K (1) K (1) K (1) K (3) K (3) K (1) K (3) K (3) (cid:21) , v = (cid:20) K (2) K (4) (cid:21) n , u i = (cid:26) k ( W i , w ) if i ∈ { , ..., n } ˜ k ( W i , w ) if i ∈ { n + 1 , ..., n } where n ∈ R n is a vector of ones, k ( w, w ′ ) = h φ ( w ) , φ ( w ′ ) i H , and ˜ k ( w, w ′ ) = h φ ( w ) , M φ ( w ′ ) i H
3. Set ˆ α ( w ) = v ⊤ (Ω + nλK ) − u For comparison, if ˆ γ is kernel ridge regression, then ˆ γ ( w ) = v ⊤ ( K (1) + nλI ) − u where v i = Y i and u i = k ( W i , w ) . I write this section at a high level of generality so that it can be used in future work. In particular,I abstract from the RKHS framework. I assume high level conditions and consider black box esti-mators (ˆ γ, ˆ α ) . I prove by finite sample arguments that ˆ θ defined by Algorithm 3.1 is consistent, andits confidence interval is valid and semiparametrically efficient. In Section 5, I will verify the highlevel conditions in the RKHS framework.Towards this end, define σ = E [ ψ ( W )] , η = E | ψ ( W ) | , and ζ = E [ ψ ( W )] . Write theBerry Esseen constant as c BE = 0 . [79]. The result will be in terms of abstract mean squarerates on the nuisances. 7 efinition 4.1 (Mean square error) . Write the conditional mean square error of (ˆ γ ℓ , ˆ α ℓ ) trained on I cℓ as R (ˆ γ ℓ ) = E [ { ˆ γ ℓ ( W ) − γ ( W ) } | I cℓ ] , R (ˆ α ℓ ) = E [ { ˆ α ℓ ( W ) − α ( W ) } | I cℓ ] We are now ready to state the first main result, which is a finite sample Gaussian approximation.
Theorem 4.1 (Gaussian approximation) . Suppose E [ Y − γ ( W )] ≤ ¯ σ , E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] , k α k ∞ ≤ ¯ α Then w.p. − ǫ , sup z ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) √ nσ (ˆ θ − θ ) ≤ z (cid:19) − Φ( z ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c BE (cid:16) ησ (cid:17) n − + ∆ √ π + ǫ where Φ( z ) is the standard Gaussian c.d.f. and ∆ = 3 Lǫ · σ n ( √ L m + ¯ α ) p R (ˆ γ ℓ ) + ¯ σ p R (ˆ α ℓ ) + √ n p R (ˆ γ ℓ ) p R (ˆ α ℓ ) o Theorem 4.1 is the first finite sample DML argument for black box machine learning. It is a finitesample refinement of the asymptotic black box theorem in [22]. It is a black box generalization ofthe main result in [24], which provides a finite sample analysis specific to the Dantzig selector. ByTheorem 4.1, the neighborhood of Gaussian approximation scales as σ √ n . It is therefore helpful tobound the magnitude of the asymptotic variance σ . Proposition 4.1 (Variance bound) . Suppose the conditions of Theorem 4.1 hold. Then σ ≤
3( ¯ L m E [ γ ( W )] + ¯ α ¯ σ + θ ) Observe that finite sample Gaussian approximation is in terms of the true asymptotic variance σ . Inow provide a guarantee for its estimator ˆ σ . Theorem 4.2 (Variance estimation) . Suppose E [ Y − γ ( W )] ≤ ¯ σ , E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] , k ˆ α ℓ k ∞ ≤ ¯ α ′ Then w.p. − ǫ ′ , | ˆ σ − σ | ≤ ∆ ′ + 2 √ ∆ ′ ( √ ∆ ′′ + σ ) + ∆ ′′ where ∆ ′ = 24 Lǫ ′ { [ˆ θ − θ ] + [ ¯ L m + (¯ α ′ ) ] R (ˆ γ ℓ ) + ¯ σ R (ˆ α ℓ ) } , ∆ ′′ = r ǫ ′ ζ n − Theorem 4.2 is a both a black box generalization and a finite sample refinement of the asymptotictheorem in [26] specific to the Lasso. Theorems 4.1 and 4.2 immediately imply sufficient conditionsfor validity of the proposed confidence interval.
Corollary 4.1 (Confidence interval) . Suppose the assumptions of Theorems 4.1 and 4.2 hold as wellas the regularity condition on moments n(cid:0) ησ (cid:1) + ζ o n − → . Assume ( p ¯ L m + ¯ α + ¯ α ′ ) p R (ˆ γ ℓ ) p → , ¯ σ p R (ˆ α ℓ ) p → , √ n p R (ˆ γ ℓ ) p R (ˆ γ ℓ ) p → Then the estimator ˆ θ given in Algorithm 3.1 is consistent, i.e. ˆ θ p → θ , and the confidenceinterval given in Algorithm 3.1 includes θ with probability approaching the nominal level, i.e. lim n →∞ P (cid:16) θ ∈ h ˆ θ ± ˆ σ √ n i(cid:17) = 1 − a . Corollary 4.1 summarizes simple sufficient conditions for inference in terms of learning rates. Veri-fying these conditions is the goal of the next section.8
Validity of bias correction
To employ Corollary 4.1, I must verify several conditions. In terms of the general inference problem,I must show E [ Y − γ ( W )] ≤ ¯ σ , E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] , k α k ∞ ≤ ¯ α, k ˆ α ℓ k ∞ ≤ ¯ α ′ Proposition 2.1 already previewed some of the relevant analysis. In what follows, I articulate themild requirements for these conditions to hold. The same mild requirements will also furnish learn-ing rates to satisfy ( p ¯ L m + ¯ α + ¯ α ′ ) p R (ˆ γ ℓ ) p → , ¯ σ p R (ˆ α ℓ ) p → , √ n p R (ˆ γ ℓ ) p R (ˆ γ ℓ ) p → Note that the product condition √ n p R (ˆ γ ℓ ) p R (ˆ γ ℓ ) p → allows a tradeoff: one of the learningrates may be slow, as long as the other is sufficiently fast to compensate. While proving theseresults, I will abstract from sample splitting in order to improve readability. In other words, I willprove results for (ˆ γ, ˆ α ) trained on the full sample rather than I cℓ . I place standard, weak assumptions on the supports of ( Y, W ) . Assumption 5.1 (Original spaces) . Assume1. W ∈ W is a Polish space, i.e. separable and completely metrizable topological spaces2. Y ∈ Y ⊂ R A Polish space may be low, high, or infinite dimensional. Random variables with support in a Polishspace may be discrete or continuous. Next, I place standard, weak assumptions on the RKHS H . Assumption 5.2 (RKHS regularity) . Assume1. k is bounded. Formally, sup w ∈W k φ ( w ) k H ≤ √ κ φ ( w ) is measurable3. k is characteristic Commonly used kernels are bounded. Boundedness implies Bochner integrability, which permitsthe exchange of expectation and inner product. Measurability is a similarly weak condition. Thecharacteristic property ensures injectivity of the kernel mean embedding µ , and hence uniqueness ofthe RKHS representation [87, 88, 86]. See [86] for examples.Proposition 2.1 verifies E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] for a broad class of important examples. Inow verify E [ Y − γ ( W )] ≤ ¯ σ , k α k ∞ ≤ ¯ α , and k ˆ α k ∞ ≤ ¯ α ′ under Assumptions 5.1 and 5.2. Proposition 5.1 (Bounded noise and bounded Riesz representer) . Suppose γ ∈ H and Assump-tions 5.1 and 5.2 hold. Then ¯ σ = 2( E [ Y ] + κ k γ k H ) , ¯ α = √ κ k α min0 k H , ¯ α ′ = κ k α min0 k H √ λ The next two assumptions are the crux of the RKHS learning paradigm [15, 84]. The former isan assumption about the smoothness of γ , parametrized by a scalar c ∈ [1 , . The latter is anassumption about the effective dimension of the RKHS, parametrized by a scalar b ∈ (1 , ∞ ] .In preparation, define the convolution operator L k : L ( P ) → L ( P ) , f R k ( · , w ) f ( w ) d P ( w ) . L k is a self adjoint, positive, compact operator, so by the spectral theorem we can denote its count-able eigenvalues and eigenfunctions by { λ j } and { ϕ j } , respectively: L k f = P ∞ j =1 λ j h ϕ j , f i · ϕ j .With this notation, I express L ( P ) and the RKHS H in terms of the series { ϕ j } . If f ∈ L ( P ) , then f can be uniquely expressed as f = P ∞ j =1 a j ϕ j and the partial sums P Jj =1 a j ϕ j converge to f in9 ( P ) . Indeed, L ( P ) = f = ∞ X j =1 a j ϕ j : ∞ X j =1 a j < ∞ , h f, g i = ∞ X j =1 a j b j for f = P ∞ j =1 a j ϕ j and g = P ∞ j =1 b j ϕ j . By [27, Theorem 4], the RKHS H corresponding to thekernel k can be explicitly represented as H = f = ∞ X j =1 a j ϕ j : ∞ X j =1 a j λ j < ∞ , h f, g i H = ∞ X j =1 a j b j λ j Let us interpret this result, known as Picard’s criterion [5]. Recall that { λ j } is a weakly decreasingsequence. The RKHS H is the subset of L ( P ) for which higher order terms in the series { ϕ j } havea smaller contribution. Finally, I define H c = f = ∞ X j =1 a j ϕ j : ∞ X j =1 a j λ cj < ∞ ⊂ H , c ∈ [1 , which is interpretable as an even smoother RKHS. We are now ready to state the smoothness as-sumption. Assumption 5.3 (Smoothness) . Assume γ ∈ H c for some c ∈ [1 , . In other words, γ is a particularly smooth element of H ; it is well approximated by the leadingterms in the series { ϕ j } . This condition appears under the name of the source condition in bothstatistical learning [84, 15] and econometrics [17, 16, 19, 18]. An abstract, equivalent version of thisassumption is helpful for analysis. Proposition 5.2 (Consequences of smoothness) . The following statements are equivalent: (i) γ ∈H c ; (ii) there exists some witness function f ∈ H s.t. γ = T c − f ; (iii) there exists some witnessfunction ˜ f ∈ L ( P ) s.t. γ = L c k ˜ f . Moreover, γ ∈ H c implies α min0 ∈ H c . The effective dimension condition is also in terms of the spectrum.
Assumption 5.4 (Effective dimension) . If the RKHS is finite dimensional, write J ≤ β < ∞ and b = ∞ . If the RKHS is infinite dimensional, the eigenvalues { λ k } decay at a polynomial rate: ι ≤ j b · λ j ≤ β for j ≥ , where ι, β > and b ∈ (1 , ∞ ) . Note that { λ k } are also the eigenvalues of the covariance operator T . Proposition 5.3 (Consequences of effective dimension) . If Assumption 5.4 holds then λ j = Θ( j − b ) .Moreover, L k ( · ) = P ∞ j =1 λ j h ϕ j , ·i · ϕ j if and only if T ( · ) = P ∞ j =1 λ j h e j , ·i · e j where e j = p λ j ϕ j . The spectrum of a random vector’s covariance characterizes its dimension. In this sense, the assump-tion on spectral decay quantifies effective dimension. Following [15], I denote by P ( b, c ) the classof distributions that satisfy Assumptions 5.3 and 5.4. Assumptions 5.1, 5.2, 5.3, and 5.4 are classic assumptions in RKHS learning theory. I quote a classicresult on the mean square rate for kernel ridge regression. This result also requires a weak conditionon the noise.
Assumption 5.5 (Noise) . Y = γ ( W ) + ε where ε is bounded, Gaussian, or subGaussian. Theorem 5.1 (Kernel ridge regression: MSE [15]) . Suppose Assumptions 5.1, 5.2, 5.3, 5.4, and 5.5hold. Calibrate the ridge regularization sequence s.t. λ = n − if b = ∞ n − bbc +1 if b ∈ (1 , ∞ ) , c ∈ (1 , bb +1 ( n ) · n − bb +1 if b ∈ (1 , ∞ ) , c = 1 hen kernel ridge regression ˆ γ trained on { W i } i ∈ [ n ] satisfies lim τ →∞ lim n →∞ sup ρ ∈P ( b,c ) P { W i }∼ ρ n ( R (ˆ γ ) > τ · r n ) = 0 where R (ˆ γ ) = E [ { ˆ γ ( W ) − γ ( W ) } |{ W i } i ∈ [ n ] ] and r n = n − if b = ∞ n − bcbc +1 if b ∈ (1 , ∞ ) , c ∈ (1 , bb +1 ( n ) · n − bb +1 if b ∈ (1 , ∞ ) , c = 1 Moreover, the rate is optimal when b ∈ (1 , ∞ ) and c ∈ (1 , . It is optimal up to a logarithmic factorwhen b ∈ (1 , ∞ ) and c = 1 . Recall that b = ∞ is the finite dimensional regime; b ∈ (1 , ∞ ) , c ∈ (1 , is the infinite dimensionalregime with polynomial spectral decay and additional smoothness; b ∈ (1 , ∞ ) , c = 1 is the infinitedimensional regime with polynomial spectral decay and no additional smoothness. A similar resultis available for kernel instrumental variable regression by combining [81, Theorem 4] with [29,Lemma 11].Remarkably, an almost identical result is possible for the kernel ridge Riesz representer. This isthe second main result of the paper. The result is almost for free in the sense that it only requiresreplacing Assumption 5.5 with Assumptions 2.1 and 2.2. Indeed, the core techniques are the same. Asimilar result holds for the Riesz representer of functionals of kernel instrumental variable regressionby generalizing [81] rather than [15]. Theorem 5.2 (Kernel ridge Riesz representer: MSE) . Suppose Assumptions 2.1 and 2.2 as wellAssumptions 5.1, 5.2, 5.3, and 5.4 hold. Calibrate the ridge regularization sequence λ as in Theo-rem 5.1. Then the kernel ridge Riesz representer ˆ α trained on { W i } i ∈ [ n ] satisfies lim τ →∞ lim n →∞ sup ρ ∈P ( b,c ) P { W i }∼ ρ n ( R (ˆ α ) > τ · r n ) = 0 where R (ˆ α ) = E [ { ˆ α ( W ) − α min0 ( W ) } |{ W i } i ∈ [ n ] ] and r n is as in Theorem 5.1 Corollary 5.1 (Confidence interval) . Suppose the assumptions of Theorems 5.1 and 5.2 hold aswell as the regularity condition on moments n(cid:0) ησ (cid:1) + ζ o n − → . Then for any c ∈ (1 , and b ∈ (1 , ∞ ] , the estimator ˆ θ is consistent, i.e. ˆ θ p → θ , and the confidence interval includes θ withprobability approaching the nominal level, i.e. lim n →∞ P (cid:16) θ ∈ h ˆ θ ± ˆ σ √ n i(cid:17) = 1 − a . Recall that in Assumption 5.3, c = 1 means γ is correctly specified as an element in H , while c > means γ is in the interior of H . As long as γ is in the interior of the RKHS, rather than at theboundary, we can construct valid confidence intervals by kernel matrix operations. Meanwhile, theRKHS may be finite or infinite dimensional, as long as the spectrum decays at some polynomial rate λ j = Θ( j − b ) with b > . Examining the proof of Corollary 5.1, the only condition that fails to hold when c = 1 is ¯ α ′ p R (ˆ γ ) p → . Specifically, the bound on ¯ α ′ given in Proposition 5.1 diverges too quickly. Asolution is to trim the kernel ridge Riesz representer using a known bound on the minimal Rieszrepresenter α min0 . In particular, consider the following estimator. Algorithm 5.1 (Kernel ridge Riesz representer: Trimming) . Given ˆ α calculated from Algorithm 3.2,set ˜ α ( w ) = − ¯ α if ˆ α ( w ) < − ¯ α ˆ α ( w ) if ˆ α ( w ) ∈ [ − ¯ α, ¯ α ]¯ α if ˆ α ( w ) > ¯ α Denote by (˜ θ, ˜ σ ) the estimators for ( θ , σ ) calculated from Algorithm 3.1 using (ˆ γ, ˜ α ) rather than (ˆ γ, ˆ α ) . 11 orollary 5.2 (Confidence interval with trimming) . Suppose the assumptions of Theorems 5.1and 5.2 hold as well as the regularity condition on moments n(cid:0) ησ (cid:1) + ζ o n − → . Then for any c ∈ [1 , and b ∈ (1 , ∞ ] , the estimator ˜ θ is consistent, i.e. ˜ θ p → θ , and the confidence interval in-cludes θ with probability approaching the nominal level, i.e. lim n →∞ P (cid:16) θ ∈ h ˜ θ ± ˜ σ √ n i(cid:17) = 1 − a . Now the case c = 1 is allowed. Since c = 1 simply means γ ∈ H , the inference result allows usto actually relax Assumption 5.3. In other words, smoothness beyond correct specification hastenslearning but is irrelevant to inference. Meanwhile, the RKHS may be finite or infinite dimensional,as long as the spectrum decays at some polynomial rate λ j = Θ( j − b ) with b > . For any kernel method with a mean square learning rate, I propose a practical procedure to basedon bias correction and sample splitting to calculate confidence intervals for its functionals. Theinferential results are almost free in the sense that classic learning theory assumptions, together witha mean square continuity condition, are enough to justify the procedure. By providing confidenceintervals for functionals of kernel ridge regression and kernel instrumental variable regression, Ifacilitate their use in social science and epidemiology.
Acknowledgments and disclosure of funding
I am grateful to Alberto Abadie, Anish Agarwal, Victor Chernozhukov, Anna Mikusheva, WhitneyNewey, Devavrat Shah, Vasilis Syrgkanis, Suhas Vijaykumar, and Yinchu Zhu for helpful discus-sions. I am grateful to the Jerry A Hausman Graduate Dissertation Fellowship for financial support.
References [1] Chunrong Ai and Xiaohong Chen. Efficient estimation of models with conditional momentrestrictions containing unknown functions.
Econometrica , 71(6):1795–1843, 2003.[2] Chunrong Ai and Xiaohong Chen. Estimation of possibly misspecified semiparametric con-ditional moment restriction models with different conditioning variables.
Journal of Econo-metrics , 141(1):5–43, 2007.[3] Chunrong Ai and Xiaohong Chen. The semiparametric efficiency bound for models of sequen-tial moment restrictions containing unknown functions.
Journal of Econometrics , 170(2):442–457, 2012.[4] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: Debiasedinference of average treatment effects in high dimensions.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 80(4):597–623, 2018.[5] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learn-ing theory.
Journal of Complexity , 23(1):52–72, 2007.[6] Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, and Christian Hansen. Pro-gram evaluation and causal inference with high-dimensional data.
Econometrica , 85(1):233–298, 2017.[7] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference for high-dimensional sparse econometric models. arXiv:1201.0220 , 2011.[8] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatmenteffects after selection among high-dimensional controls.
The Review of Economic Studies ,81(2):608–650, 2014.[9] Alexandre Belloni, Victor Chernozhukov, and Kengo Kato. Uniform post-selection infer-ence for least absolute deviation regression and other Z-estimation problems.
Biometrika ,102(1):77–94, 2014.[10] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Pivotal estimation via square-rootlasso in nonparametric regression.
The Annals of Statistics , 42(2):757–788, 2014.1211] Peter J Bickel. On adaptive estimation.
The Annals of Statistics , pages 647–671, 1982.[12] Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner,and YA’Acov Ritov.
Efficient and Adaptive Estimation for Semiparametric Models , volume 4.Johns Hopkins University Press Baltimore, 1993.[13] Peter J Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: Sharpbest order of convergence estimates.
Sankhy¯a: The Indian Journal of Statistics, Series A ,pages 381–393, 1988.[14] Jelena Bradic and Mladen Kolar. Uniform inference for high-dimensional quantile regression:Linear functionals and regression rank scores. arXiv:1702.06209 , 2017.[15] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squaresalgorithm.
Foundations of Computational Mathematics , 7(3):331–368, 2007.[16] Marine Carrasco. A regularization approach to the many instruments problem.
Journal ofEconometrics , 170(2):383–398, 2012.[17] Marine Carrasco, Jean-Pierre Florens, and Eric Renault. Linear inverse problems in structuraleconometrics estimation based on spectral decomposition and regularization.
Handbook ofEconometrics , 6:5633–5751, 2007.[18] Marine Carrasco and Barbara Rossi. In-sample inference and forecasting in misspecifiedfactor models.
Journal of Business & Economic Statistics , 34(3):313–338, 2016.[19] Marine Carrasco and Guy Tchuente. Regularized LIML for many instruments.
Journal ofEconometrics , 186(2):427–442, 2015.[20] Gary Chamberlain. Panel data.
Handbook of Econometrics , 2:1247–1318, 1984.[21] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,Whitney Newey, and James Robins. Double/debiased machine learning for treatment andstructural parameters: Double/debiased machine learning.
The Econometrics Journal , 21(1),2018.[22] Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, andJames M Robins. Locally robust semiparametric estimation. arXiv:1608.00033 , 2016.[23] Victor Chernozhukov, Christian Hansen, and Martin Spindler. Valid post-selection and post-regularization inference: An elementary, general approach.
Annual Review of Economics ,7(1):649–688, 2015.[24] Victor Chernozhukov, Whitney Newey, and Rahul Singh. De-biased machine learning ofglobal and local parameters using regularized Riesz representers. arXiv:1802.08667 , 2018.[25] Victor Chernozhukov, Whitney Newey, Rahul Singh, and Vasilis Syrgkanis. Adversarial esti-mation of Riesz representers. arXiv:2101.00009 , 2020.[26] Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Automatic debiased machinelearning of causal and structural effects. arXiv:1809.05224 , 2018.[27] Felipe Cucker and Steve Smale. On the mathematical foundations of learning.
Bulletin of theAmerican Mathematical Society , 39(1):1–49, 2002.[28] Iván Díaz and Mark J van der Laan. Targeted data adaptive estimation of the causal dose–response curve.
Journal of Causal Inference , 1(2):171–192, 2013.[29] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimationof conditional moment models. arXiv:2006.07201 , 2020.[30] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv:1901.09036 ,2019.[31] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for super-vised learning with reproducing kernel hilbert spaces.
Journal of Machine Learning Research ,5(Jan):73–99, 2004.[32] Rafail Z Hasminskii and Ildar A Ibragimov. On the nonparametric estimation of functionals.In
Proceedings of the Second Prague Symposium on Asymptotic Statistics , 1979.[33] David A Hirshberg and Stefan Wager. Debiased inference of average partial effects in single-index models. arXiv:1811.02547 , 2018. 1334] David A Hirshberg and Stefan Wager. Augmented minimax linear estimation. arXiv:1712.00038v5 , 2019.[35] I Ibragimov and R Has’minskii. Statistical estimation, vol. 16 of.
Applications of Mathemat-ics , 1981.[36] Hidehiko Ichimura and Whitney K Newey. The influence function of semiparametric estima-tors. arXiv:1508.01378 , 2015.[37] Guido W Imbens and Whitney K Newey. Identification and estimation of triangular simulta-neous equations models without additivity.
Econometrica , 77(5):1481–1512, 2009.[38] Jana Jankova and Sara Van De Geer. Confidence intervals for high-dimensional inverse co-variance estimation.
Electronic Journal of Statistics , 9(1):1205–1229, 2015.[39] Jana Jankova and Sara Van De Geer. Confidence regions for high-dimensional generalizedlinear models under sparsity. arXiv:1610.01353 , 2016.[40] Jana Jankova and Sara Van De Geer. Semiparametric efficiency bounds for high-dimensionalmodels.
The Annals of Statistics , 46(5):2336–2359, 2018.[41] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression.
The Journal of Machine Learning Research , 15(1):2869–2909, 2014.[42] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regressionunder the Gaussian random design model: Asymptotic theory.
IEEE Transactions on Infor-mation Theory , 60(10):6522–6554, 2014.[43] Adel Javanmard and Andrea Montanari. Debiasing the lasso: Optimal sample size for Gaus-sian designs.
The Annals of Statistics , 46(6A):2593–2622, 2018.[44] Palle Jorgensen and Feng Tian. Discrete reproducing kernel Hilbert spaces: Sampling anddistribution of Dirac-masses.
The Journal of Machine Learning Research , 16(1):3079–3114,2015.[45] Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. arXiv:2004.14497 , 2020.[46] Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small. Nonparametricmethods for doubly robust estimation of continuous treatment effects.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 79(4):1229, 2017.[47] George Kimeldorf and Grace Wahba. Some results on Tchebycheffian spline functions.
Jour-nal of Mathematical Analysis and Applications , 33(1):82–95, 1971.[48] Chris AJ Klaassen. Consistent estimation of the influence function of locally asymptoticallylinear estimators.
The Annals of Statistics , pages 1548–1562, 1987.[49] Michael R Kosorok.
Introduction to Empirical Processes and Semiparametric Inference .Springer Science & Business Media, 2007.[50] B Ya Levit. On the efficiency of a class of non-parametric estimates.
Theory of Probability &Its Applications , 20(4):723–740, 1976.[51] Alexander R Luedtke and Mark J Van Der Laan. Statistical inference for the mean outcomeunder a possibly non-unique optimal treatment strategy.
Annals of Statistics , 44(2):713, 2016.[52] Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. Identifying causal effects with proxyvariables of an unmeasured confounder.
Biometrika , 105(4):987–993, 2018.[53] Wang Miao and Eric Tchetgen Tchetgen. A confounding bridge approach for double negativecontrol inference on causal effects. arXiv:1808.04945 , 2018.[54] Krikamol Muandet, Motonobu Kanagawa, Sorawit Saengkyongam, and Sanparith Marukatat.Counterfactual mean embeddings. arXiv:1805.08845 , 2018.[55] Krikamol Muandet, Arash Mehrjou, Si Kai Lee, and Anant Raj. Dual iv: A single stageinstrumental variable regression. arXiv:1910.12358 , 2019.[56] Whitney K Newey. The asymptotic variance of semiparametric estimators.
Econometrica ,pages 1349–1382, 1994.[57] Whitney K Newey, Fushing Hsieh, and James M Robins. Undersmoothing and bias correctedfunctional estimation. Technical report, MIT Department of Economics, 1998.1458] Whitney K Newey, Fushing Hsieh, and James M Robins. Twicing kernels and a small biasproperty of semiparametric estimators.
Econometrica , 72(3):947–962, 2004.[59] Whitney K Newey and James R Robins. Cross-fitting and fast remainder rates for semipara-metric estimation. arXiv:1801.09138 , 2018.[60] Matey Neykov, Yang Ning, Jun S Liu, and Han Liu. A unified theory of confidence regionsand testing for high-dimensional estimating equations.
Statistical Science , 33(3):427–443,2018.[61] Jerzy Neyman. Optimal asymptotic tests of composite statistical hypotheses. In
Probabilityand Statistics , page 416–444. Wiley, 1959.[62] Jerzy Neyman. C ( α ) tests and their use. Sankhy¯a: The Indian Journal of Statistics, Series A ,pages 1–21, 1979.[63] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv:1712.04912 , 2017.[64] Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions forsparse high dimensional models.
The Annals of Statistics , 45(1):158–195, 2017.[65] Johann Pfanzagl. Lecture notes in statistics.
Contributions to a General Asymptotic StatisticalTheory , 13, 1982.[66] Zhao Ren, Tingni Sun, Cun-Hui Zhang, and Harrison H Zhou. Asymptotic normality and opti-malities in estimation of large Gaussian graphical models.
The Annals of Statistics , 43(3):991–1026, 2015.[67] James Robins, Lingling Li, Eric Tchetgen, Aad van der Vaart, et al. Higher order influencefunctions and minimax estimation of nonlinear functionals. In
Probability and Statistics:Essays in Honor of David A. Freedman , pages 335–421. Institute of Mathematical Statistics,2008.[68] James Robins, Mariela Sued, Quanhong Lei-Gomez, and Andrea Rotnitzky. Comment on"performance of double-robust estimators when inverse probability weights are highly vari-able".
Statistical Science , 22(4):544–559, 2007.[69] James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regressionmodels with missing data.
Journal of the American Statistical Association , 90(429):122–129,1995.[70] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Analysis of semiparametric regres-sion models for repeated outcomes in the presence of missing data.
Journal of the AmericanStatistical Association , 90(429):106–121, 1995.[71] Peter M Robinson. Root-n-consistent semiparametric regression.
Econometrica , pages 931–954, 1988.[72] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-tional studies for causal effects.
Biometrika , 70(1):41–55, 1983.[73] Dominik Rothenhäusler and Bin Yu. Incremental causal effects. arXiv:1907.13258 , 2019.[74] Dan Rubin and Mark J van der Laan. A general imputation methodology for nonparametricregression with censored data. Technical report, UC Berkeley Division of Biostatistics, 2005.[75] Daniel Rubin and Mark J van der Laan. Extending marginal structural models through local,penalized, and additive learning. Technical report, UC Berkeley Division of Biostatistics,2006.[76] Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonignorabledrop-out using semiparametric nonresponse models.
Journal of the American Statistical As-sociation , 94(448):1096–1120, 1999.[77] Anton Schick. On asymptotically efficient estimation in semiparametric models.
The Annalsof Statistics , 14(3):1139–1151, 1986.[78] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In
International conference on computational learning theory , pages 416–426. Springer, 2001.[79] Irina Shevtsova. On the absolute constants in the Berry-Esseen type inequalities for identicallydistributed summands. arXiv:1111.6554 , 2011.1580] Rahul Singh. Kernel methods for unobserved confounding: Negative controls, proxies, andinstruments. arXiv:2012.10315 , 2020.[81] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression.In
Advances in Neural Information Processing Systems , pages 4595–4607, 2019.[82] Rahul Singh and Liyang Sun. De-biased machine learning in intrumental variable models fortreatment effects. arXiv:1909.05244 , 2019.[83] Rahul Singh, Liyuan Xu, and Arthur Gretton. Kernel methods for policy evaluation: Treat-ment effects, mediation analysis, and off-policy planning. arXiv:2010.04855 , 2020.[84] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and theirapproximations.
Constructive Approximation , 26(2):153–172, 2007.[85] Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A Hilbert space embeddingfor distributions. In
International Conference on Algorithmic Learning Theory , pages 13–31,2007.[86] Bharath Sriperumbudur. On the optimal estimation of probability measures in weak andstrong topologies.
Bernoulli , 22(3):1839–1893, 2016.[87] Bharath Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet. On the relation between univer-sality, characteristic kernels and RKHS embedding of measures. In
International Conferenceon Artificial Intelligence and Statistics , pages 773–780, 2010.[88] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, charac-teristic kernels and RKHS embedding of measures.
Journal of Machine Learning Research ,12(7), 2011.[89] Dougal J Sutherland. Fixing an error in Caponnetto and de Vito (2007). arXiv:1702.02982 ,2017.[90] Eric J Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao. An introductionto proximal causal learning. arXiv:2009.10982 , 2020.[91] B Toth and MJ van der Laan. TMLE for marginal structural models based on an instrument.Technical report, UC Berkeley Division of Biostatistics, 2016.[92] Anastasios Tsiatis.
Semiparametric Theory and Missing Data . Springer Science & BusinessMedia, 2007.[93] Sara Van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptoticallyoptimal confidence regions and tests for high-dimensional models.
The Annals of Statistics ,42(3):1166–1202, 2014.[94] Mark J van der Laan and Alexander R Luedtke. Targeted learning of an optimal dynamictreatment, and statistical inference for its mean outcome. Technical report, UC BerkeleyDivision of Biostatistics, 2014.[95] Mark J Van der Laan and Sherri Rose.
Targeted Learning: Causal Inference for Observationaland Experimental Data . Springer Science & Business Media, 2011.[96] Mark J Van Der Laan and Daniel Rubin. Targeted maximum likelihood learning.
The Inter-national Journal of Biostatistics , 2(1), 2006.[97] Aad Van Der Vaart et al. On differentiable functionals.
The Annals of Statistics , 19(1):178–204, 1991.[98] Aad W Van der Vaart.
Asymptotic Statistics , volume 3. Cambridge University Press, 2000.[99] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional param-eters in high dimensional linear models.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 76(1):217–242, 2014.[100] Wenjing Zheng and Mark J Van Der Laan. Asymptotic theory for cross-validated targetedmaximum likelihood estimation. 2010.[101] Yinchu Zhu and Jelena Bradic. Breaking the curse of dimensionality in regression. arXiv:1708.00430. , 2017.[102] Yinchu Zhu and Jelena Bradic. Linear hypothesis testing in dense high-dimensional linearmodels.
Journal of the American Statistical Association , 113(524):1583–1600, 2018.16
Framework
Proof of Proposition 2.1.
I consider each functional of kernel ridge regression. The arguments forthe functionals of kernel instrumental variable regression are identical.1. Evaluation. E [ γ ( w )] = { γ ( w ) } = (cid:26) E (cid:20) γ ( W ) 1 { W = w } P ( w ) (cid:21)(cid:27) ≤ (cid:26) E | γ ( W ) | P ( w ) (cid:27) ≤ E [ γ ( W )] P ( w ) i.e. ¯ L m = P ( w ) −
2. Treatment effects. For ATE, E [ γ ( d, X )] = E (cid:20) { D = d } P ( d | X ) γ ( D, X ) (cid:21) ≤ E [ γ ( D, X )] inf x P ( d | x ) i.e. ¯ L m = { inf x P ( d | x ) } − . ATE-DS, ATT, and CATE are similar.3. Incremental treatment effect. E [ S ( U ) γ ( U, X )] = E (cid:20) ω ( D ) f ( D | X ) S ( D ) γ ( D, X ) (cid:21) ≤ sup u S ( u ) · sup d,x | ω ( d ) | f ( d | x ) · E (cid:2) γ ( D, X ) (cid:3) i.e. ¯ L m = sup u S ( u ) · sup d,x | ω ( d ) | f ( d | x ) Proof of Proposition 2.2.
As in the proof of Proposition 2.1, it is sufficient to show the results forfunctionals of kernel ridge regression.1. Evaluation.(a) M : H → H .For each constituent RKHS H j , the existence of c j implies that H j includes the con-stant function over W j . Hence if γ ∈ H then γ evaluated in its j -th argument is alsoan element of H (since it becomes constant with respect to its j -th argument).(b) k M k op < ∞ .Write M γ = γ ( w ) H where H is the constant function in H . The definition of anRKHS as a space where evaluation is a bounded linear functional can be seen from γ ( w ) = h γ, φ ( w ) i H ≤ k γ k H √ κ Hence k M γ k H k γ k H ≤ k γ k H √ κ k H k H k γ k H = √ κ k H k H i.e. k M k op = √ κ k H k H
2. ATE (and ATE-DS).(a) M : H → H .See the argument for evaluation above.(b) k M k op < ∞ .Denote φ ( w ) = φ ( d ) ⊗ φ ( x ) and γ = γ d ⊗ γ x . Write M γ = [ γ d ( d ) H d ] ⊗ γ x where H d is the constant function in H d . As before γ d ( d ) = h γ d , φ ( d ) i H d ≤ k γ d k H d √ κ d Hence k M γ k H k γ k H ≤ k γ d k H d √ κ d k H d k H d · k γ x k H x k γ d k H d · k γ x k H x = √ κ d k H d k H d i.e. k M k op = √ κ d k H d k H d
17. ATT (and CATE).(a) M : H → H .Consider the extension δ ( d , x, d ) of γ ( d , x ) with feature map φ ( d ) ⊗ φ ( x ) ⊗ φ ( d ) .Since H d contains the constant function, any γ can be trivially extended this way:take δ = δ d ⊗ δ x ⊗ δ d = γ d ⊗ γ x ⊗ H d . With this notation, write M δ =[ δ d ( d ′ ) H d ] ⊗ δ x ⊗ d = d . Note that δ d ( d ′ ) H d = γ d ( d ′ ) H d ∈ H d since H d contains constant functions. δ x = γ x ∈ H x by construction. Finally, d = d ∈ H d since H d contains Dirac functions.(b) k M k op < ∞ .As before δ d ( d ′ ) = h δ d , φ ( d ′ ) i H d ≤ k δ d k H d √ κ d Hence over functions of the form δ = δ d ⊗ δ x ⊗ H d k M δ k H k δ k H ≤ k δ d k H d √ κ d k H d k H d · k δ x k H x · k d = d k H d k δ d k H d · k δ x k H x · k H d k H d = √ κ d · k d = d k H d i.e. k M k op = √ κ d · k d = d k H d
4. Incremental treatment effects.(a) M : H → H . Consider the extension δ ( d , x, d ) of γ ( d , x ) with feature map φ ( d ) ⊗ φ ( x ) ⊗ φ ( d ) as above. Since H d contains the constant function, any γ canbe trivially extended this way: take δ = δ d ⊗ δ x ⊗ δ d = γ d ⊗ γ x ⊗ H d . Withthis notation, write M δ = δ d ⊗ δ x ⊗ S . Note that δ d = γ d ∈ H d by construction.Similarly δ x = γ x ∈ H x by construction. Finally, S ∈ H d is an indirect assumptionon the density ω .(b) k M k op < ∞ . Over functions of the form δ = δ d ⊗ δ x ⊗ H d k M δ k H k δ k H = k δ d k H d · k δ x k H x · k S k H d k δ d k H d · k δ x k H x · k H d k H d = k S k H d k H d k H d i.e. k M k op = k S k H d k H d k H d Proof of Proposition 2.3.
See [44, Theorem 2.12] for the first result. For the second result, recallthat a function f : W → R is an element of H iff k ( w, w ′ ) − cf ( w ) f ( w ′ ) is positive definite forsome c > . Then k f k H = c for the largest possible c . Proof of Proposition 2.4.
First, I show that the functional γ E [ m ( W, γ )] is bounded as a func-tional over L ( P ) . Write {| E [ m ( W, γ )] |} ≤ { E | m ( W, γ ) |} ≤ E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] Therefore | E [ m ( W, γ )] |k γ k ≤ p ¯ L m k γ k k γ k ≤ p ¯ L m Hence Riesz representation theorem w.r.t L ( P ) implies existence of the RR α ∈ L ( P ) .[24, Lemma S3.1] guarantees existence of the unique minimal Riesz representer α min0 ∈ closure ( span (Γ)) . B Algorithm
Proof of Proposition 3.1.
Recall that γ ∈ Γ implies that the minimal representer α min0 ∈ Γ ⊂ H by Proposition 2.4. Trivially, α min0 ∈ argmin α ∈H R ( α ) , R ( α ) = E [ α min0 ( W ) − α ( W )] R ( α ) = E [ α min0 ( W ) − α ( W )] = E [ α min0 ( W )] − E [ α ( W ) α ( W )] + E [ α ( W )] = C − E [ m ( W, α )] + E [ α ( W )] Boundedness of the kernel implies Bochner integrability of φ ( · ) , which then allows us to exchangeexpectation and inner product. By the RKHS representation property and Bochner integrability of φ ( · ) E [ m ( W, α )] = E [ h α, M ∗ φ ( w ) i H ] = h α, M ∗ µ i H , µ = E [ φ ( W )] and E [ α ( W )] = E [ h α, φ ( W ) i H ] = h α, T α i H , T = E [ φ ( W ) ⊗ φ ( W )] Proof of Proposition 3.2.
In both cases K (1) ij = h φ ( W i ) , φ ( W j ) i H K (2) ij = h φ ( W i ) , φ ( m ) ( W j ) i H = h φ ( W i ) , M ∗ φ ( W j ) i H = h M φ ( W i ) , φ ( W j ) i H K (4) ij = h φ ( m ) ( W i ) , φ ( m ) ( W j ) i H = h M ∗ φ ( W i ) , M ∗ φ ( W j ) i H = h M φ ( W i ) , M φ ( W j ) i H where the final equality holds by opening up the definition of M . Formally, h M ∗ φ ( W i ) , M ∗ φ ( W j ) i H = h M [ M ∗ φ ( W i )] , φ ( W j ) i H = h m ( · , M ∗ k ( W i , · )) , φ ( W j ) i H = h M ∗ [ m ( · , k ( W i , · ))] , φ ( W j ) i H = h m ( · , k ( W i , · )) , M φ ( W j ) i H = h M φ ( W i ) , M φ ( W j ) i H I now specialize these expressions.1. Complete evaluation [ M γ ]( · ) = m ( · , γ ( · )) = γ ( w ) H ( · )[ M φ ( W )]( · ) = m ( · , k ( W, · )) = k ( W, w ) H ( · ) Hence h φ ( W i ) , φ ( W j ) i H = k ( W i , W j ) h M φ ( W i ) , φ ( W j ) i H = h k ( W i , w ) H , φ ( W j ) i H = k ( W i , w ) h M φ ( W i ) , M φ ( W j ) i H = h k ( W i , w ) H , k ( W j , w ) H i H = k ( W i , w ) k ( W j , w ) k H k H
2. Partial evaluation [ M γ ]( · ) = m ( · , γ ( · )) = [ γ d ( d ) H d ( · )] ⊗ γ x ( · )[ M φ ( W )]( · ) = m ( · , k ( W, · )) = [ k ( D, d ) H d ( · )] ⊗ k ( X, · ) Hence h φ ( W i ) , φ ( W j ) i H = h φ ( D i ) ⊗ φ ( X i ) , φ ( D j ) ⊗ φ ( X j ) i H = k ( D i , D j ) k ( X i , X j ) h M φ ( W i ) , φ ( W j ) i H = h [ k ( D i , d ) H d ] ⊗ φ ( X i ) , φ ( D j ) ⊗ φ ( X j ) i H = k ( D i , d ) k ( X i , X j ) h M φ ( W i ) , M φ ( W j ) i H = h [ k ( D i , d ) H d ] ⊗ φ ( X i ) , [ k ( D j , d ) H d ] ⊗ φ ( X j ) i H = k ( D i , d ) k ( D j , d ) k H d k H d · k ( X i , X j ) roof of Proposition 3.3. Note that M ∗ ˆ µ = 1 n n X i =1 M ∗ φ ( W i ) = 1 n n X i =1 φ ( m ) ( W i ) and ˆ T α = 1 n n X i =1 [ φ ( W i ) ⊗ φ ( W i )] α = 1 n n X i =1 h α, φ ( W i ) i H φ ( W i ) Write the objective as L nλ ( α ) = 1 n n X i =1 n − h α, φ ( m ) ( W i ) i H + h α, φ ( W i ) i H o + λ k α k H Recall that for an RKHS, evaluation is a continuous functional represented as the inner product withthe feature map. Due to the ridge penalty, the stated objective is coercive and strongly convex w.r.t α . Hence it has a unique minimizer ˆ α that obtains the minimum.Write ˆ α = ˆ α n + ˆ α ⊥ n where ˆ α n ∈ row (Ψ) and ˆ α ⊥ n ∈ null (Ψ) . Substituting this decomposition of ˆ α into the objective, we see that L nλ (ˆ α ) = L nλ (ˆ α n ) + λ k ˆ α ⊥ n k H Therefore L nλ (ˆ α ) ≥ L nλ (ˆ α n ) Since ˆ α is the unique minimizer, ˆ α = ˆ α n . Derivation of Algorithm 3.2.
By Proposition 3.3, ˆ α = Ψ ∗ ρ . Substituting this expression into L nλ ( α ) gives L nλ ( ρ ) . L nλ ( ρ ) = − ρ ⊤ Ψ 1 n n X i =1 φ ( m ) ( W i ) + ρ ⊤ Ψ ˆ T Ψ ∗ ρ + λρ ⊤ ΨΨ ∗ ρ = − n ρ ⊤ n X i =1 Ψ φ ( m ) ( W i ) + 1 n ρ ⊤ ΨΦ ∗ ΦΨ ∗ ρ + λρ ⊤ ΨΨ ∗ ρ = − n ρ ⊤ v + 1 n ρ ⊤ Ω ρ + λρ ⊤ Kρ where Ω = ΨΦ ∗ ΦΨ ∗ , v = n X i =1 Ψ φ ( m ) ( W i ) Differentiating, the FOC gives − n v + 2 n Ω ρ + 2 λKρ ⇐⇒ ρ = (Ω + nλK ) − v To evaluate the estimator at a test location w involves computing ˆ α ( w ) = h ˆ α, φ ( w ) i H = ρ ⊤ Ψ φ ( w ) = v ⊤ (Ω + nλK ) − Ψ φ ( w ) Finally note that h φ ( m ) ( w ) , φ ( w ′ ) i H = h M ∗ φ ( w ) , φ ( w ′ ) i H = h φ ( w ) , M φ ( w ′ ) i H Validity of confidence interval
C.1 Gateaux differentiation
For readability, I introduce the following notation for Gateaux differentiation.
Definition C.1 (Gateaux derivative) . Let u ( w ) , v ( w ) be functions and let τ, ζ ∈ R be scalars. TheGateaux derivative of ψ ( w, θ, γ, α ) with respect to its argument γ in the direction u is [ ∂ γ ψ ( w, θ, γ, α )]( u ) = ∂∂τ ψ ( w, θ, γ + τ u, α ) (cid:12)(cid:12)(cid:12)(cid:12) τ =0 The cross derivative of ψ ( w, θ, γ, α ) with respect to its argument ( γ, α ) in the directions ( u, v ) is [ ∂ γ,α ψ ( w, θ, γ, α )]( u, v ) = ∂ ∂τ ∂ζ ψ ( w, θ, γ + τ u, α + ζv ) (cid:12)(cid:12)(cid:12)(cid:12) τ =0 ,ζ =0 Proposition C.1 (Calculation of derivatives) . [ ∂ γ ψ ( w, θ, γ, α )]( u ) = m ( w, u ) − α ( w ) u ( w )[ ∂ α ψ ( w, θ, γ, α )]( v ) = v ( w )[ y − γ ( w )][ ∂ γ,α ψ ( w, θ, γ, α )]( u, v ) = − v ( w ) u ( w ) Proof.
For the first result, write ψ ( w, θ, γ + τ u, α ) = m ( w, γ ) + τ m ( w, u ) + α ( w )[ y − γ ( w ) − τ u ( w )] − θ For the second result, write ψ ( w, θ, γ, α + ζv ) = m ( w, γ ) + α ( w )[ y − γ ( w )] + ζv ( w )[ y − γ ( w )] − θ For the final result, write ψ ( w, θ, γ + τ u, α + ζv )= m ( w, γ ) + τ m ( w, u ) + α ( w )[ y − γ ( w ) − τ u ( w )] + ζv ( w )[ y − γ ( w ) − τ u ( w )] − θ Finally, take scalar derivatives with respect to ( τ, ζ ) .By using the doubly robust moment function, we have the following helpful property. Proposition C.2 (Mean zero derivatives) . For any ( u, v ) , E [ ∂ γ ψ ( W )]( u ) = 0 , E [ ∂ α ψ ( W )]( v ) = 0 Proof.
For the first result, write E [ ∂ γ ψ ( W )]( u ) = E [ m ( W, u ) − α ( W ) u ( W )] Then appeal to the definition of the Riesz representer. For the second result, write E [ ∂ α ψ ( W )]( v ) = E [ v ( W )[ y − γ ( W )]] In the case of nonparametric regression, γ ( w ) = E [ Y | W = w ] and we appeal to law of iteratedexpectations. In the case of nonparametric instrumental variable regression, γ ( w ) = h ( x ) s.t. E [ Y | Z = z ] = E [ h ( X ) | Z = z ] . The result then holds for v ( w ) = v ( z ) . C.2 Taylor expansion
Train (ˆ γ ℓ , ˆ α ℓ ) on observations in I cℓ . Let m = | I ℓ | = nL be the number of observations in I ℓ . Denoteby E ℓ [ · ] the average over observations in I ℓ . Definition C.2 (Foldwise target and oracle) . ˆ θ ℓ = E ℓ [ m ( W, ˆ γ ℓ ) + ˆ α ℓ ( W ) { Y − ˆ γ ℓ ( W ) } ]¯ θ ℓ = E ℓ [ m ( W, γ ) + α ( W ) { Y − γ ( W ) } ] roposition C.3 (Taylor expansion) . Let u = ˆ γ ℓ − γ and v = ˆ α ℓ − α . Then √ m (ˆ θ ℓ − ¯ θ ℓ ) = P j =1 ∆ jℓ where ∆ ℓ = √ m E ℓ [ m ( W, u ) − α ( W ) u ( W )]∆ ℓ = √ m E ℓ [ v ( W ) { Y − γ ( W ) } ]∆ ℓ = √ m E ℓ [ − u ( W ) v ( W )] Proof.
An exact Taylor expansion gives ψ ( w, θ , ˆ γ ℓ , ˆ α ℓ ) − ψ ( w ) = [ ∂ γ ψ ( w )]( u ) + [ ∂ α ψ ( w )]( v ) + [ ∂ γ,α ψ ( w )]( u, v ) Averaging over observations in I ℓ ˆ θ ℓ − ¯ θ ℓ = E ℓ [ ψ ( W, θ , ˆ γ ℓ , ˆ α ℓ )] − E ℓ [ ψ ( W )]= E ℓ [ ∂ γ ψ ( W )]( u ) + E ℓ [ ∂ α ψ ( W )]( v ) + E ℓ [ ∂ γ,α ψ ( W )]( u, v ) Finally appeal to Proposition C.1.
C.3 ResidualsDefinition C.3 (Mean square error) . Write the conditional mean square error of (ˆ γ ℓ , ˆ α ℓ ) trained on I cℓ as R (ˆ γ ℓ ) = E [ { ˆ γ ℓ ( W ) − γ ( W ) } | I cℓ ] R (ˆ α ℓ ) = E [ { ˆ α ℓ ( W ) − α ( W ) } | I cℓ ] Proposition C.4 (Residuals) . Suppose E [ Y − γ ( W )] ≤ ¯ σ , E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] , k α k ∞ ≤ ¯ α Then w.p. − ǫL , | ∆ ℓ | ≤ t = r Lǫ p ¯ L m + ¯ α p R (ˆ γ ℓ ) | ∆ ℓ | ≤ t = r Lǫ ¯ σ p R (ˆ α ℓ ) | ∆ ℓ | ≤ t = 3 √ Lǫ √ n p R (ˆ γ ℓ ) p R (ˆ α ℓ ) Proof.
I proceed in steps1. Markov inequality P ( | ∆ ℓ | > t ) ≤ E [∆ ℓ ] t P ( | ∆ ℓ | > t ) ≤ E [∆ ℓ ] t P ( | ∆ ℓ | > t ) ≤ E | ∆ ℓ | t
2. Law of iterated expectations E [∆ ℓ ] = E [ E [∆ ℓ | I cℓ ]] E [∆ ℓ ] = E [ E [∆ ℓ | I cℓ ]] E | ∆ ℓ | = E [ E [ | ∆ ℓ || I cℓ ]]
22. Bounding conditional momentsConditional on I cℓ , ( u, v ) are nonrandom. Moreover, observations within fold I ℓ are inde-pendent. Hence E [∆ ℓ | I cℓ ] = E [ { m ( W, u ) − α ( W ) u ( W ) } | I cℓ ] ≤ E [ m ( W, u ) | I cℓ ] + 2 E [ { α ( W ) u ( W ) } | I cℓ ] ≤
2( ¯ L m + ¯ α ) R (ˆ γ ℓ ) Similarly E [∆ ℓ | I cℓ ] = E [ { v ( W )[ Y − γ ( W )] } | I cℓ ]= E [ v ( W ) E [ { Y − γ ( W ) } | W, I cℓ ] | I cℓ ] ≤ ¯ σ R (ˆ α ℓ ) Finally E [ | ∆ ℓ || I cℓ ] = √ m E [ | − u ( W ) v ( W ) || I cℓ ] ≤ √ m E [ u ( W ) | I cℓ ] E [ v ( W ) | I cℓ ]= √ m p R (ˆ γ ℓ ) p R (ˆ α ℓ )
4. Collecting results P ( | ∆ ℓ | > t ) ≤
2( ¯ L m + ¯ α ) R (ˆ γ ℓ ) t = ǫ L P ( | ∆ ℓ | > t ) ≤ ¯ σ R (ˆ α ℓ ) t = ǫ L P ( | ∆ ℓ | > t ) ≤ √ m p R (ˆ γ ℓ ) p R (ˆ α ℓ ) t = ǫ L Therefore w.p. − ǫL , the following inequalities hold | ∆ ℓ | ≤ t = r Lǫ p ¯ L m + ¯ α p R (ˆ γ ℓ ) | ∆ ℓ | ≤ t = r Lǫ ¯ σ p R (ˆ α ℓ ) | ∆ ℓ | ≤ t = 3 Lǫ √ m p R (ˆ γ ℓ ) p R (ˆ α ℓ ) Finally recall m = nL C.4 Main argumentDefinition C.4 (Overall target and oracle) . ˆ θ = 1 L L X ℓ =1 ˆ θ ℓ , ¯ θ = 1 L L X ℓ =1 ¯ θ ℓ Proposition C.5 (Oracle approximation) . Suppose the conditions of Proposition C.4 hold. Then w.p. − ǫ √ nσ | ˆ θ − ¯ θ | ≤ ∆ = 3 Lǫ · σ n ( p ¯ L m + ¯ α ) p R (ˆ γ ℓ ) + ¯ σ p R (ˆ α ℓ ) + √ n p R (ˆ γ ℓ ) p R (ˆ α ℓ ) o Proof.
I proceed in steps 23. DecompositionWrite √ n (ˆ θ − ¯ θ ) = √ n √ m L L X ℓ =1 √ m (ˆ θ ℓ − ¯ θ ℓ )= √ L L L X ℓ =1 3 X j =1 ∆ jℓ
2. Union boundDefine the events E ℓ = {∀ j ∈ [3] , | ∆ jℓ | ≤ t j } , E = ∩ Lℓ =1 E ℓ , E c = ∪ Lℓ =1 E cℓ Hence by the union bound and Proposition C.4, P ( E c ) ≤ L X ℓ =1 P ( E cℓ ) ≤ L ǫL = ǫ
3. Collecting resultsTherefore w.p. − ǫ , √ n | ˆ θ − ¯ θ | ≤ √ L L L X ℓ =1 3 X j =1 | ∆ jk |≤ √ L L L X ℓ =1 3 X j =1 t j = √ L X j =1 t j Finally I simplify { t j } . For a, b > , √ a + b ≤ √ a + √ b . Moreover, ≥ √ , √ . Finally,for ǫ ≤ , √ ǫ ≤ ǫ . In summary t ≤ √ Lǫ ( p ¯ L m + ¯ α ) p R (ˆ γ ℓ ) t ≤ √ Lǫ ¯ σ p R (ˆ α ℓ ) t ≤ √ Lǫ √ n p R (ˆ γ ℓ ) p R (ˆ α ℓ ) Proof of Theorem 4.1.
By Berry Esssen theorem, as in [24, Proof of Theorem 4.1, Step 3] P (cid:18) √ nσ (ˆ θ − θ ) ≤ z (cid:19) − Φ( z ) ≤ c BE (cid:16) ησ (cid:17) n − + ∆ √ π + ǫ and P (cid:18) √ nσ (ˆ θ − θ ) ≤ z (cid:19) − Φ( z ) ≥ c BE (cid:16) ησ (cid:17) n − − ∆ √ π − ǫ where Φ( z ) is the standard Gaussian c.d.f. and ∆ is defined in Proposition C.5.24 .5 Variance estimation Proof of Proposition 4.1.
Write σ = E [ m ( W, γ ) + α min0 ( W ) { Y − γ ( W ) } − θ ] ≤ (cid:0) E [ m ( W, γ )] + E [ α min0 ( W ) { Y − γ ( W ) } ] + θ (cid:1) Then note E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] and E [ α min0 ( W ) { Y − γ ( W ) } ] ≤ ¯ α E [ Y − γ ( W )] ≤ ¯ α ¯ σ Definition C.5 (Shorter notation) . For i ∈ I ℓ , define ψ i = ψ ( W i , θ , γ , α )ˆ ψ i = ψ ( W i , ˆ θ, ˆ γ ℓ , ˆ α ℓ ) Proposition C.6 (Foldwise second moment) . E ℓ [ ˆ ψ i − ψ i ] ≤ [ˆ θ − θ ] + X j =4 ∆ jℓ where ∆ ℓ = E ℓ [ m ( W, u )] ∆ ℓ = E ℓ [ˆ α ℓ ( W ) u ( W )] ∆ ℓ = E ℓ [ v ( W ) { Y − γ ( W ) } ] Proof.
Write ψ i − ˆ ψ i = m ( W i , ˆ γ ℓ ) + ˆ α ℓ ( W i )[ Y i − ˆ γ ℓ ( W i )] − ˆ θ − { m ( W i , γ ) + α ( W i )[ Y i − γ ( W i )] − θ }± ˆ α ℓ [ Y − γ ( W i )]= [ θ − ˆ θ ] + m ( W i , u ) − ˆ α ℓ ( W i ) u ( W i ) + v ( W i )[ Y − γ ( W i )] Hence [ ψ i − ˆ ψ i ] ≤ n [ θ − ˆ θ ] + m ( W i , u ) + [ˆ α ℓ ( W i ) u ( W i )] + [ v ( W i ) { Y − γ ( W i ) } ] o Finally take E ℓ [ · ] of both sides. Proposition C.7 (Residuals) . Suppose E [ Y − γ ( W )] ≤ ¯ σ , E [ m ( W, γ )] ≤ ¯ L m E [ γ ( W )] , k ˆ α ℓ k ∞ ≤ ¯ α ′ Then w.p. − ǫ ′ L , ∆ ℓ ≤ t = 6 Lǫ ′ ¯ L m R (ˆ γ ℓ )∆ ℓ ≤ t = 6 Lǫ ′ (¯ α ′ ) R (ˆ γ ℓ )∆ ℓ ≤ t = 6 Lǫ ′ ¯ σ R (ˆ α ℓ ) Proof.
I proceed in steps analogous to Proposition C.4.25. Markov inequality P ( | ∆ ℓ | > t ) ≤ E [∆ ℓ ] t P ( | ∆ ℓ | > t ) ≤ E [∆ ℓ ] t P ( | ∆ ℓ | > t ) ≤ E [∆ ℓ ] t
2. Law of iterated expectations E [∆ ℓ ] = E [ E [∆ ℓ | I cℓ ]] E [∆ ℓ ] = E [ E [∆ ℓ | I cℓ ]] E [∆ ℓ ] = E [ E [∆ ℓ | I cℓ ]]
3. Bounding conditional momentsConditional on I cℓ , ( u, v ) are nonrandom. Moreover, observations within fold I ℓ are inde-pendent. Hence E [∆ ℓ | I cℓ ] = E [ { m ( W, u ) } | I cℓ ] ≤ ¯ L m R (ˆ γ ℓ ) Similarly E [∆ ℓ | I cℓ ] = E [ { ˆ α ℓ ( W ) u ( W ) } | I cℓ ] ≤ (¯ α ′ ) R (ˆ γ ℓ ) Finally E [∆ ℓ | I cℓ ] = E [ { v ( W )[ Y − γ ( W )] } | I cℓ ]= E [ v ( W ) E [ { Y − γ ( W ) } | W, I cℓ ] | I cℓ ] ≤ ¯ σ R (ˆ α ℓ )
4. Collecting results P ( | ∆ ℓ | > t ) ≤ ¯ L m R (ˆ γ ℓ ) t = ǫ ′ L P ( | ∆ ℓ | > t ) ≤ (¯ α ′ ) R (ˆ γ ℓ ) t = ǫ ′ L P ( | ∆ ℓ | > t ) ≤ ¯ σ R (ˆ α ℓ ) t = ǫ ′ L Therefore w.p. − ǫ ′ L , the following inequalities hold | ∆ ℓ | ≤ t = 6 Lǫ ′ ¯ L m R (ˆ γ ℓ ) | ∆ ℓ | ≤ t = 6 Lǫ ′ (¯ α ′ ) R (ˆ γ ℓ ) | ∆ ℓ | ≤ t = 6 Lǫ ′ ¯ σ R (ˆ α ℓ ) Proposition C.8 (Oracle approximation) . Suppose the conditions of Proposition C.7 hold. Then w.p. − ǫ ′ E n [ ˆ ψ i − ψ i ] ≤ ∆ ′ = 4 { ˆ θ − θ } + 24 Lǫ ′ { [ ¯ L m + (¯ α ′ ) ] R (ˆ γ ℓ ) + ¯ σ R (ˆ α ℓ ) } Proof.
I proceed in steps analogous to Proposition C.526. DecompositionBy Proposition C.6 E n [ ˆ ψ i − ψ i ] = 1 L L X ℓ =1 E ℓ [ ˆ ψ i − ψ i ] ≤ θ − θ ] + 4 L L X ℓ =1 6 X j =4 ∆ jℓ
2. Union boundDefine the events E ′ ℓ = {∀ j ∈ { , , } , | ∆ jℓ | ≤ t j } , E ′ = ∩ Lℓ =1 E ′ ℓ , ( E ′ ) c = ∪ Lℓ =1 ( E ′ ℓ ) c Hence by the union bound and Proposition C.7, P (( E ′ ) c ) ≤ L X ℓ =1 P (( E ′ ℓ ) c ) ≤ L ǫ ′ L = ǫ ′ L
3. Collecting resultsTherefore w.p. − ǫ ′ , E n [ ˆ ψ i − ψ i ] ≤ θ − θ ] + 4 L L X ℓ =1 6 X j =4 | ∆ jℓ |≤ θ − θ ] + 4 L L X ℓ =1 6 X j =4 t j = 4[ˆ θ − θ ] + 4 X j =4 t j Proposition C.9 (Markov inequality) . Suppose E [ ψ ( W )] < ∞ . Then w.p. − ǫ ′ | E n [ ψ i ] − σ | ≤ ∆ ′′ = r ǫ ′ p E [ ψ ( W )] √ n Proof.
Let Z i = ψ i , ¯ Z = E n [ Z i ] , E [ Z ] = σ By Markov inequality P ( | ¯ Z − E [ Z ] | > t ) ≤ V [ ¯ Z ] t = ǫ ′ Note that V [ ¯ Z ] = V ( Z ) n = E [ ψ ( W ) ] n = E [ ψ ( W )] n Solving, E [ ψ ( W )] nt = ǫ ′ ⇐⇒ t = r ǫ ′ p E [ ψ ( W )] √ n Proof of Theorem 4.2.
I proceed in steps. 27. Decomposition of variance estimatorWrite ˆ σ = E n [ ˆ ψ i ] = E n [ { ˆ ψ i − ψ i } + ψ i ] = E n [ ˆ ψ i − ψ i ] + 2 E n [ { ˆ ψ i − ψ i } ψ i ] + E n [ ψ i ] Hence ˆ σ − E n [ ψ i ] = E n [ ˆ ψ i − ψ i ] + 2 E n [ { ˆ ψ i − ψ i } ψ i ]
2. Decomposition of differenceNext write ˆ σ − σ = { ˆ σ − E n [ ψ i ] } + { E n [ ψ i ] − σ } Focusing on the former term ˆ σ − E n [ ψ i ] = E n [ ˆ ψ i − ψ i ] + 2 E n [ { ˆ ψ i − ψ i } ψ i ] Moreover E n [ { ˆ ψ i − ψ i } ψ i ] ≤ q E n [ ˆ ψ i − ψ i ] p E n [ ψ i ] ≤ q E n [ ˆ ψ i − ψ i ] p | E n [ ψ i ] − σ | + σ ≤ q E n [ ˆ ψ i − ψ i ] (cid:16)p | E n [ ψ i ] − σ | + σ (cid:17)
3. High probability eventsFrom the previous step, we see that to control | ˆ σ − σ | , it is sufficient to control twoexpressions: E n [ ˆ ψ i − ψ i ] and | E n [ ψ i ] − σ | . These are controlled in Propositions C.8and C.9, respectively. Therefore w.p. − ǫ ′ , | ˆ σ − σ | ≤ n ∆ ′ + 2 √ ∆ ′ ( √ ∆ ′′ + σ ) o + ∆ ′′ C.6 Corollary
Proof of Corollary 4.1.
Immediately from ∆ in Theorem 4.1, ˆ θ p → θ and lim n →∞ P (cid:18) θ ∈ (cid:20) ˆ θ ± σ √ n (cid:21)(cid:19) = 1 − a For the desired result, it is sufficient that ˆ σ p → σ , which follows from ∆ ′ and ∆ ′′ in Theorem 4.2. D Validity of bias correction
D.1 PreliminariesProposition D.1 (First order conditions) . Define g = M ∗ µ and ˆ g = M ∗ ˆ µ . Then T α min0 = g, α λ = ( T + λ ) − g, ˆ α = ( ˆ T + λ ) − ˆ g Proof.
Take derivatives of L , L λ , and L nλ Proposition D.2 (Relation of norms) . R ( α ) = E [ α ( W ) − α min0 ( W )] = k T ( α − α min0 ) k H Proof. [15, Proposition 1.iii], using the fact that mean square error equals excess risk.28 roposition D.3 (Concentration) . Let (Ω , F , ρ ) be a probability space. Let ξ be a random variableon Ω taking value in a real separable Hilbert space K . Assume there exists ( a, b ) > s.t. k ξ ( ω ) k K ≤ a , E k ξ k K ≤ b Then for all n ∈ N and η ∈ (0 , , P { ω i }∼ ρ n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ξ ( ω i ) − E [ ξ ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K ≤ /η ) (cid:20) an + b √ n (cid:21)! ≥ η Proof.
A simplification of [15, Proposition 2]
D.2 Learning theory quantities
Proof of Proposition 5.1.
Write E [ Y − γ ( W )] ≤ E [ Y ] + 2 E [ γ ( W )] Since γ ∈ H , | γ ( w ) | = |h γ , φ ( w ) i H | ≤ √ κ k γ k H γ ∈ H implies α min0 ∈ H , so | α min0 ( w ) | = |h α min0 , φ ( w ) i H | ≤ √ κ k α min0 k H and likewise | ˆ α ( w ) | = |h ˆ α, φ ( w ) i H | ≤ √ κ k ˆ α k H Finally, I bound k ˆ α k H . By identical arguments to Proposition 3.1, ˆ α = argmin α ∈H R nλ ( α ) , R nλ ( α ) = E n [ α min0 ( W ) − α ( W )] + λ k α k H Hence λ k ˆ α k H ≤ R nλ (ˆ α ) ≤ R nλ (0) = E n [ α min0 ( W )] ≤ κ k α min0 k H Therefore k ˆ α k H ≤ √ κ k α min0 k H √ λ , | ˆ α ( w ) | ≤ κ k α min0 k H √ λ Proof of Proposition 5.2.
For the first result, write f = P ∞ k =1 a k ϕ k and appeal to [15, Remark 2].The second result follows from Proposition 2.4. Proof of Proposition 5.3.
The result follows from [15, Remark 2].
Proof of Theorem 5.1.
Immediate from [15, Theorems 1 and 2].
Definition D.1 (Learning theory quantities) . Define A ( λ ) = R ( α λ ) B ( λ ) = k α λ − α min0 k H N ( λ ) = T r (( T + λ ) − T ) Proposition D.4 (Bounds on learning theory quantities) . Under Assumption 5.3 A ( λ ) ≤ λ c R, B ( λ ) ≤ λ c − R, R = k T − c γ k H Under Assumption 5.4 N ( λ ) = T r (( T + λ ) − T ) ≤ ( β b π/b sin( π/b ) λ − b if b < ∞ J if b = ∞ Proof.
See [15, Proposition 3] for A ( λ ) and B ( λ ) . See [89, Section 1] for N ( λ ) .29 .3 High probability events In preparation for the main argument, I prove high probability events by concentration, appealing toProposition D.3.
Definition D.2 (High probability events) . Define E = ( k ( T + λ ) − ( ˆ T − T ) k L ( H , H ) ≤ /η ) " κλn + r κ N ( λ ) λn E = ( k ( T − ˆ T )( α λ − α min0 ) k H ≤ /η ) " κ p B ( λ ) n + r κ A ( λ ) n E = ( k ( T + λ ) − (ˆ g − ˆ T α min0 ) k H ≤ /η ) " n r Γ κλ + r Σ N ( λ ) n where Σ = k M k op + √ κ k α min0 k H , Γ = 2Σ
I design E to resemble [15, eq. 48]. Proposition D.5 (High probability events) . Suppose Assumptions 2.1 and 2.2 as well Assump-tions 5.1 and 5.2 hold. Then P ( E cj ) ≤ η , j ∈ { , , } Proof.
The argument for E immediately precedes [15, eq. 41]. The argument for E is identical to[15, eq. 43]. I verify E appealing to Proposition D.3. Define ξ i = ( T + λ ) − { M ∗ φ ( w i ) − [ φ ( w i ) ⊗ φ ( w i )] α min0 } Note that E [ ξ i ] = ( T + λ ) − { g − T α min0 } = 0 Towards concentration, I analyze k ξ k H and E k ξ k H
1. Bound on k ξ k H Write k ξ i k H ≤ k ( T + λ ) − k op k M ∗ φ ( w i ) − [ φ ( w i ) ⊗ φ ( w i )] α min0 k H ≤ √ λ ( √ κ k M k op + κ k α min0 k H )= a
2. Bound on E k ξ k H Write ξ i = ( T + λ ) − Qφ ( w i ) , Q = { M ∗ − α min0 ( w i ) I } Then by properties of trace, k ξ i k H = h ( T + λ ) − Qφ ( w i ) , Qφ ( w i ) i H = T r ( Q ∗ ( T + λ ) − Q [ φ ( w i ) ⊗ φ ( w i )]) ≤ k Q k op T r (( T + λ ) − Q [ φ ( w i ) ⊗ φ ( w i )])= k Q k op T r ( Q [ φ ( w i ) ⊗ φ ( w i )]( T + λ ) − ) ≤ k Q k op T r ([ φ ( w i ) ⊗ φ ( w i )]( T + λ ) − )= k Q k op T r (( T + λ ) − [ φ ( w i ) ⊗ φ ( w i )]) Clearly k Q k op ≤ k M k op + √ κ k α min0 k H Hence E k ξ k H ≤ (cid:8) k M k op + √ κ k α min0 k H (cid:9) · N ( λ ) = b
30. ConcentrationIn summary a = 2 √ λ ( √ κ k M k op + κ k α min0 k H ) , b = (cid:8) k M k op + √ κ k α min0 k H (cid:9) p N ( λ ) Therefore w.p. − η/ (cid:13)(cid:13)(cid:13) ( T + λ ) − (ˆ g − ˆ T α min0 ) (cid:13)(cid:13)(cid:13) H ≤ /η ) (cid:18) an + b √ n (cid:19) = 2 ln(6 /η ) √ κ k M k op + κ k α min0 k H ) n √ λ + (cid:8) k M k op + √ κ k α min0 k H (cid:9) p N ( λ ) √ n ! = 2 ln(6 /η ) n r Γ κλ + r Σ N ( λ ) n ! D.4 Main argumentProposition D.6 (Abstract rate) . Suppose Assumptions 2.1 and 2.2 as well Assumptions 5.1 and 5.2hold. If C η = 4 ·
96 ln (6 /η ) , n ≥ C η κ N ( λ ) λ , λ ≤ k T k op , Σ = k M k op + √ κ k α min0 k H then w.p. − η R (ˆ α ) ≤ C η (cid:18) A ( λ ) + κ B ( λ ) n λ + κ A ( λ ) nλ + κ Σ nλ + Σ N ( λ ) n (cid:19) Proof.
I proceed in steps, following the proof structure of [15, Theorem 4].1. DecompositionBy [15, eq. 36], R (ˆ α ) ≤ {A ( λ ) + S ( λ, { w i } ) + S ( λ, { w i } ) } where S ( λ, { w i } ) = (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − (ˆ g − ˆ T α min0 ) (cid:13)(cid:13)(cid:13) H S ( λ, { w i } ) = (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − ( T − ˆ T )( α λ − α min0 ) (cid:13)(cid:13)(cid:13) H
2. Bound on S ( λ, { w i } ) Write S ( λ, { w i } ) ≤ (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ( T − ˆ T )( α λ − α min0 ) (cid:13)(cid:13)(cid:13) H (a) Bound on (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − (cid:13)(cid:13)(cid:13) op By [15, eq. 39], under E (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − (cid:13)(cid:13)(cid:13) op ≤ √ λ (cid:13)(cid:13)(cid:13) ( T − ˆ T )( α λ − α min0 ) (cid:13)(cid:13)(cid:13) H Under E (cid:13)(cid:13)(cid:13) ( T − ˆ T )( α λ − α min0 ) (cid:13)(cid:13)(cid:13) H ≤ /η ) κ p B ( λ ) n + r κ A ( λ ) n ! In summary, under E and E S ( λ, { w i } ) ≤ (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ( T − ˆ T )( α λ − α min0 ) (cid:13)(cid:13)(cid:13) H ≤ λ · { /η ) } · ( κ p B ( λ ) n ) + (r κ A ( λ ) n ) = 8 ln (6 /η ) (cid:18) κ B ( λ ) n λ + κ A ( λ ) nλ (cid:19)
3. Bound on S ( λ, { w i } ) Write S ( λ, { w i } ) ≤ (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − ( T + λ ) (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ( T + λ ) − (ˆ g − ˆ T α min0 ) (cid:13)(cid:13)(cid:13) H (a) Bound on (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − ( T + λ ) (cid:13)(cid:13)(cid:13) op By [15, eq. 47], under E (cid:13)(cid:13)(cid:13) T ( ˆ T + λ ) − ( T + λ ) (cid:13)(cid:13)(cid:13) op ≤ (b) Bound on (cid:13)(cid:13)(cid:13) ( T + λ ) − (ˆ g − ˆ T α min0 ) (cid:13)(cid:13)(cid:13) H Under E k ( T + λ ) − (ˆ g − ˆ T α min0 ) k H ≤ /η ) " n r Γ κλ + r Σ N ( λ ) n Therefore by [15, eq. 49], under E and E S ( λ, { w i } ) ≤
32 ln (6 /η ) (cid:18) κ Γ n λ + Σ N ( λ ) n (cid:19) Proof of Theorem 5.2.
Finally, I combine the abstract rate in Proposition D.6 with the bounds ofProposition D.4, following [89, Section 2]. The absolute constant depends only on ( R, κ, Σ , β, b, c ) .Note that Σ introduces dependence on ( k M k op , k α min0 k H ) . Proof of Corollary 5.1.
I verify the conditions of Corollary 4.1 appealing to Theorems 5.1 and 5.2.In particular, by the symmetry of Theorems 5.1 and 5.2, I must show ( p ¯ L m + ¯ α + ¯ α ′ + ¯ σ ) √ r n → , √ n · r n → By Proposition 2.1, p ¯ L m < ∞ is a constant. By Proposition 5.1, ¯ α < ∞ and ¯ σ < ∞ are alsoconstants. However, ¯ α ′ scales as λ − . Therefore the remaining conditions to verify are λ − √ r n → , √ n · r n → I consider each case1. b = ∞ . 32a) n · n − → (b) √ n · n − → b ∈ (1 , ∞ ) , c ∈ (1 , .(a) n bbc +1 · n − bcbc +1 → ⇐⇒ c > (b) √ n · n − bcbc +1 → ⇐⇒ bc > b ∈ (1 , ∞ ) , c = 1 .(a) ln − bb +1 ( n ) n bb +1 · ln bb +1 ( n ) n − bb +1 = 1 (b) √ n · ln bb +1 ( n ) · n − bb +1 → ⇐⇒ b > Proof of Corollary 5.2.
Since k α min0 k ∞ ≤ ¯ α , the trimming can only possibly improve mean squareerror, and the rates in Theorem 5.2 continue to hold for the trimmed estimator ˜ α defined in Algo-rithm 5.1.As in Corollary 5.1, I must show ( p ¯ L m + ¯ α + ¯ α ′ + ¯ σ ) √ r n → , √ n · r n → By Proposition 2.1, p ¯ L m < ∞ is a constant. By Proposition 5.1, ¯ α < ∞ and ¯ σ < ∞ are alsoconstants. Moreover, trimming implies that ¯ α ′ = ¯ α < is a constant. Therefore the remainingcondition to verify is √ n · r n → In the proof of Corollary 5.1, I verify this condition for across regimes.
E Tuning
E.1 Ridge penalty