Bias-Aware Inference in Regularized Regression Models
aa r X i v : . [ ec on . E M ] D ec Bias-Aware Inference in Regularized Regression Models ∗ Timothy B. Armstrong † Yale University Michal Koles´ar ‡ Princeton University Soonwoo Kwon § Yale UniversityDecember 29, 2020
Abstract
We consider inference on a regression coefficient under a constraint on the magni-tude of the control coefficients. We show that a class of estimators based on an auxiliaryregularized regression of the regressor of interest on control variables exactly solves atradeoff between worst-case bias and variance. We derive “bias-aware” confidence in-tervals (CIs) based on these estimators, which take into account possible bias whenforming the critical value. We show that these estimators and CIs are near-optimalin finite samples for mean squared error and CI length. Our finite-sample results arebased on an idealized setting with normal regression errors with known homoskedasticvariance, and we provide conditions for asymptotic validity with unknown and possiblyheteroskedastic error distribution. Focusing on the case where the constraint on themagnitude of control coefficients is based on an ℓ p norm ( p ≥ ∗ Parts of this paper include material from Section 4 of the working paper Armstrong and Koles´ar (2016),which was taken out in the final published version (Armstrong and Koles´ar, 2018). An earlier version of thispaper was circulated under the title “Optimal Inference in Regularized Regression Models.” We thank MarkLi and Ulrich M¨uller for sharing their code. Koles´ar acknowledges support by the Sloan Research Fellowship. † email: [email protected] ‡ email: [email protected] § email: [email protected] Introduction
We are interested in estimation and inference on a scalar coefficient β in a linear regressionmodel Y i = w i β + z ′ i γ + ε, i = 1 , . . . , n, (1)where the k -vector of controls may be large. In such settings, the classic ordinary leastsquares (OLS) estimator exhibits variance that is too large to yield informative results, andit is not even defined when k > n . To ameliorate this, the regularized regression literaturehas considered modifying the OLS objective function to penalize large values of γ , therebylowering variance at the cost of increased bias.The most popular of these approaches is to use the lasso (Tibshirani, 1996) or othervariants of ℓ penalization (e.g. Cand`es and Tao, 2007; Belloni et al., 2011). There is alarge literature (see, e.g. B¨uhlmann and van de Geer, 2011, for a review) showing favorablemean squared error (MSE) properties of these estimates under the assumption of sparsityon γ . For inference, several papers have proposed CIs based on “double lasso” estimators(see, among others, Belloni et al., 2014; Javanmard and Montanari, 2014; van de Geer et al.,2014; Zhang and Zhang, 2014), with asymptotic justification relying on rate conditions forthe sparsity of γ . However, in many applications in economics, the sparsity assumption maynot be compelling. Furthermore, it is unclear what sparsity bound this approach implicitlyimposes in a given finite sample.In this paper, we take a different approach. Our approach is based on imposing an apriori bound on the magnitude of the control coefficients, formalized using a penalty functionPen( · ): we assume Pen( γ ) ≤ C . In our leading specifications, we take the penalty to be an ℓ p norm, but our framework can incorporate any restrictions on γ that place it in a convexsymmetric set. For example, if z ′ i γ is a basis approximation to some smooth function, wecan define Pen( γ ) to incorporate bounds on the derivatives of this function. The regularityparameter C plays a role analogous to sparsity bound.In this setting, we obtain sharp finite-sample results deriving near-optimal estimatorsand CIs under the idealized assumption that the regression errors ε i are Gaussian witha known homoskedastic variance. We also study the optimal rates of convergence underhigh-dimensional asymptotics when k ≫ n . Finally, we discuss the use of heteroskedasticity-robust variance estimators to form feasible versions of our CIs, and give conditions for theirasymptotic validity.Our main finite-sample result shows that the class of estimators that exactly resolves the While we rule out sparsity constraints (which are non-convex), our results have implications for this caseas well. See Section 5 for a discussion and comparison. w i on z i using Pen( · ) as the penalty function with weight λ and then (2) using the residuals from thisregression as the instrument in the regression of Y i on w i . CIs based on these estimators canbe constructed by using a critical value that incorporates the worst-case bias of the estimator,which we show can be obtained automatically as a byproduct of the regularized regressionin step (1). These CIs are “bias-aware” as they account for the potential finite-sample biasof the estimator, and they are consequently valid in finite samples in the idealized Gaussiansetting. We show how to choose the tuning parameter λ to optimize the MSE of the resultingestimator, or the length of the resulting CI.We also consider the behavior of the bias-aware CI under high dimensional asymptoticswhere k ≫ n , and ( w i , z ′ i ) ′ are independent over i with eigenvalues of the variance matrixbounded away from zero and infinity. We derive the rate at which the optimal CI shrinksin the case where Pen( γ ) is an ℓ p norm. We show that, in the case where k ≫ n and C does not shrink with n , the optimal CI shrinks more slowly than n − / , so that the bias termasymptotically dominates. Furthermore, we show that, in the ℓ case, this rate cannot beimproved even if one imposes the same ℓ bound in the regression of w i on z i , as well as acertain degree of sparsity in both of these regressions.As a key input for our approach, we require the researcher to explicitly specify theregularity parameter C bounding the magnitude of Pen( γ ). Our efficiency bounds show thatit is impossible to automate the choice of C when forming CIs. We therefore recommendvarying C as a form of sensitivity analysis and reporting a “breakdown” value given by thelargest value of C such that a given finding, such as rejecting a particular null hypothesis,holds. We discuss how the choice of C can be guided by relating it to regression R , and wepresent a lower CI for C that can be used as a specification check to ensure that the chosenvalue is not too low. As we discuss further in Section 5.2, CIs that do not choose regularityconstants (such as C or, for sparsity-based approaches, the sparsity bound) explicitly involveimplicit choices of these parameters. Our finite-sample approach has the advantage of makingsuch choices explicit. This ensures that our coverage guarantees and efficiency bounds arenot merely based on “asymptotic promises” about tuning parameters that may be hard toevaluate in a particular sample.Our results relate to several strands of literature. Our procedures and efficiency boundsapply the general theory for linear functionals in convex Gaussian models developed inIbragimov and Khas’minskii (1985), Donoho (1994), Low (1995) and Armstrong and Koles´ar(2018). In particular, the optimal estimators are linear in outcomes, and the CIs are “bias-aware” fixed-length confidence interval (FLCI) centered at such estimators. Our results addto a growing recent literature applying this approach to various settings, including Armstrong3nd Koles´ar (2020a,b), Koles´ar and Rothe (2018), Imbens and Wager (2019), Rambachanand Roth (2019), Noack and Rothe (2020), and Kwon and Kwon (2020). Muralidharan et al.(2020) apply the approach in the present paper to experiments with factorial designs andbounds on interaction effects.The class of estimators we derive and, in particular, the idea of incorporating a regressionof w i on z i to estimate β , is related to various estimators proposed for this problem, goingback at least to the work of Robinson (1988) on the partly linear model. Our results providea novel finite-sample justification for this idea, as well as an exact result giving the optimalform of this regression and the optimal estimator that incorporates it. Our results allow for ageneral form of Pen( · ), which reduce to existing estimators in a few special cases: our resultscan in such cases be used to derive novel bias-aware CIs to accompany these estimators.The results of Li (1982) imply that the optimal estimator uses ridge regression when thepenalty corresponds to an ℓ norm. Li and M¨uller (2020) consider the weighted ℓ normPen( γ ) = ( P ni =1 ( z ′ i γ ) ) / . They take a somewhat different approach, which leverages theparticular invariance properties of this penalty function. Heckman (1988) derives optimallinear estimators in the partly linear model, where the penalty function bounds the first orsecond derivative of a univariate nonparametric regression function.The problem of estimation and CI construction for β is distinct from the problem ofestimation of the regression function itself or the entire parameter vector, using global loss.For the latter problem, see Zhang (2013) for the case where p ≤ γ that we consider when p = 1) and Shao and Deng (2012) for p = 2.These papers also differ from the present paper in focusing on asymptotic results.The rest of this paper is organized as follows. Section 2 presents finite-sample results inthe idealized model with Gaussian errors. Section 3 discusses implementation in the morerealistic setting with unknown error distribution. Section 4 presents asymptotic character-izations of the efficiency bounds in the high dimensional setting under bounds on an ℓ p norm. Section 5 compares our approach to CIs designed for sparsity constraints. Proofs andauxiliary results are in appendices. This section sets up an idealized version of our model with Gaussian homoskedastic errors.We then show how to construct estimators and CIs in this model that are near-optimal infinite samples. 4 .1 Setup
We write the model in eq. (1) in vector form as Y = wβ + Zγ + ε, (2)where w = ( w , . . . , w n ) ′ ∈ R n is the variable of interest with coefficient β ∈ R and Z =( z ′ , . . . , z ′ n ) ′ ∈ R n × k is a matrix of control variables. The design matrix X = ( w, Z ) is treatedas fixed. To obtain finite-sample results, we further assume that the errors are normal andhomoskedastic, ε ∼ N (0 , σ I n ) , (3)with σ known. We discuss implementation with possibly heteroskedastic and non-Gaussianerrors in Section 3. To make inference on β informative in settings where k is large relativeto n (including the case where k > n ), the researcher needs to make a priori restrictions onthe control coefficients γ . We assume that these restrictions can be formalized by restrictingthe parameter space for ( β, γ ′ ) ′ to be R × Γ where, for some linear subspace G of R k ,Γ = Γ( C ) = { γ ∈ G : Pen( γ ) ≤ C } , where Pen( · ) is a seminorm on G . (4)The requirement that Pen( · ) be a seminorm means that it satisfies the triangle inequality(Pen( γ + ˜ γ ) ≤ Pen( γ ) + Pen(˜ γ )), and homogeneity (Pen( cγ ) = | c | Pen( γ ) for any scalar c ),but, unlike a norm, it is not necessarily positive definite (Pen( γ ) = 0 does not imply γ = 0).This allows us to cover settings where only a subset of the control coefficients is restricted.To illustrate our methods, we focus on the case where Pen( · ) corresponds to a weighted ℓ p norm. To this end, partition the controls into a set of k ≥ k = k − k additional controls, Z = ( Z , Z ). Partition γ = ( γ ′ , γ ′ ) ′ accordingly.Let H A denote the projection matrix onto the column space of a matrix A . Example 2.1 ( ℓ penalty) . We specify the penalty asPen( γ ) = k M γ k = p γ ′ M ′ M γ, (5)where the k × k matrix M incorporates scaling the variables and picking out which variablesare to be constrained. If M = (0 , I k ), then Pen( γ ) = k γ k , with γ unconstrained. Setting M = (0 , ( Z ′ ( I − H Z ) Z /n ) / ) corresponds to the specification considered in Li and M¨uller(2020), which restricts the average of the squared mean effects z ′ i γ on Y i , after controllingfor the baseline controls Z . 5 xample 2.2 ( ℓ penalty) . We replace the norm in eq. (5) with an ℓ norm. For simplicity,we focus on the unweighted case, setting M = (0 , I k ). This leads to Pen( γ ) = k γ k = P kj = k +1 | γ j | .In addition to the choice of the penalty, the specification of Γ also requires the researcherto choose the regularity parameter C ; here we take it as given, and defer a discussion of thischoice to Section 3.While we have formulated the parameter space Γ in terms of a seminorm, this formulationis not restrictive in the sense that essentially any convex set Γ that is symmetric ( γ ∈ Γimplies − γ ∈ Γ) can be defined in this way (see Yosida, 1995, Proposition 5, p. 26). Whilewe rule out non-convex constraints on Γ, such as sparsity, our results nonetheless haveimplications for such settings, as we discuss in Section 5.Our goal is to construct estimators and CIs for β . To evaluate estimators ˆ β of β , we con-sider their worst-case performance over the parameter space R × Γ under the MSE criterion,R
MSE ( ˆ β ; Γ) = sup β ∈ R ,γ ∈ Γ E β,γ [( ˆ β − β ) ] , where E β,γ denotes expectation under ( β, γ ′ ) ′ . To evaluate CIs, we first require that theysatisfy a coverage requirement. A 100 · (1 − α )% CI with half-length ˆ χ = ˆ χ ( Y, X ) is aninterval { ˆ β ± ˆ χ } that satisfiesinf β ∈ R ,γ ∈ Γ P β,γ (cid:16) β ∈ { ˆ β ± ˆ χ } (cid:17) ≥ − α, where P β,γ denotes probability under ( β, γ ′ ) ′ . To compare two CIs under a particular pa-rameter vector ( β, γ ′ ) ′ , we prefer the one with shorter expected length E β,γ [2 ˆ χ ]. Note thatoptimizing expected length will not necessarily lead to CIs centered at an estimator ˆ β thatis optimal under the MSE criterion. We start by considering estimators that are linear in the outcomes Y , ˆ β = a ′ Y , and we showhow to construct CIs based on such estimators. The n -vector of weights a may depend onthe design matrix X or the known variance σ . In Section 2.3 below, we show how to choosethe weights a optimally, and in Section 2.4 we show that when a is optimally chosen, theresulting estimators and CIs are optimal among all procedures, not just linear ones.Under a given parameter vector ( β, γ ′ ) ′ , the bias of ˆ β = a ′ Y is given by a ′ ( wβ + Zγ ) − β . As ( β, γ ′ ) ′ ranges over the parameter space R × Γ, the bias ranges over the6et [ − bias Γ ( ˆ β ) , bias Γ ( ˆ β )], wherebias Γ ( ˆ β ) = sup β ∈ R ,γ ∈ Γ a ′ ( wβ + Zγ ) − β (6)denotes the worst-case bias. The variance of ˆ β does not depend on ( β, γ ′ ) ′ , and is given byvar( ˆ β ) = σ a ′ a .To form a CI centered at ˆ β , note that the t -statistic ( ˆ β − β ) / var( ˆ β ) / follows a N ( b, | b | ≤ bias Γ ( ˆ β ) / var( ˆ β ) / . Thus, denoting the 1 − α quantile of a |N ( B, | distribution by cv α ( B ), a two-sided CI can be formed as ˆ β ± χ, where χ = var( ˆ β ) / · cv α (cid:16) bias Γ ( ˆ β ) / var( ˆ β ) / (cid:17) . (7)We refer to this as a fixed-length confidence interval (FLCI), following the terminology inDonoho (1994), since its length 2 χ is fixed: it depends only on the non-random design matrix X , and known variance σ , but not on Y or the parameter vector ( β, γ ′ ) ′ . Both the MSE R ( ˆ β ; Γ) = bias Γ ( ˆ β ) + var( ˆ β ) and the CI half-length χ given in eq. (7) areincreasing in the variance of ˆ β and in its worst-case bias bias Γ ( ˆ β ). Therefore, to find theoptimal weights, it suffices to minimize variance subject to a bound B on worst-case bias,which we can write as min a ∈ R a ′ a s.t. sup β ∈ R ,γ ∈ Γ a ′ ( wβ + Zγ ) − β ≤ B. (8)We can then vary the bound B to find the optimal tradeoff for the given criterion (MSE orCI length). Since this optimization does not depend on the outcome data Y , optimizing theweights in this way does not affect the coverage properties of the resulting CI.Our main computational result shows that the optimization problem in eq. (8) can becomputed using regularized regression of w on Z . With slight abuse of terminology, we referto this regression as a propensity score regression (even though we do not require w i to bebinary). To state the result, let π ∗ λ denote the coefficient estimate on Z in a regularizedpropensity score regression with penalty Pen( π ),min π k w − Zπ k s.t. Pen( π ) ≤ t λ , (9) The critical value cv − α ( B ) can be computed as the square root of the 1 − α quantile of a non-central χ distribution with 1 degree of freedom and non-centrality parameter B . t λ is a bound on the penalty term. Here, λ indexes the weight placed on the constraintin eq. (9). It can be the Lagrange multiplier in a Lagrangian formulation of (9), or we cansolve (9) directly and take t λ = λ . Theorem 2.1.
Let π ∗ λ be a solution to (9) , and suppose that k w − Zπ ∗ λ k > . Then a ∗ λ = w − Zπ ∗ λ ( w − Zπ ∗ λ ) ′ w solves (8) with the bound given by B = Ct λ · ( w − Zπ ∗ λ ) ′ Zπ ∗ λ ( w − Zπ ∗ λ ) ′ w . Consequently, theworst-case bias and variance of the estimator ˆ β λ = a ∗ λ ′ Y = ( w − Zπ ∗ λ ) ′ Y ( w − Zπ ∗ λ ) ′ w (10) are given by bias Γ ( ˆ β λ ) = CB λ , V λ = σ k w − Zπ ∗ λ k [( w − Zπ ∗ λ ) ′ w ] , where B λ = 1Pen( π ∗ λ ) ( w − Zπ ∗ λ ) ′ Zπ ∗ λ ( w − Zπ ∗ λ ) ′ w . (11)The result follows by applying the general theory of Ibragimov and Khas’minskii (1985),Donoho (1994), Low (1995), and Armstrong and Koles´ar (2018) to our setting, which allowsus to rewrite (8) as a convex optimization problem. Solving this convex problem then yieldsthe result. Theorem 2.1 shows that the class of linear estimators that optimally trade offbias and variance (i.e. they solve eq. (8) for some B ) can be obtained by a simple two-stepprocedure. In the first step, we estimate a penalized propensity score regression (9), indexedby the penalty term λ , with the penalty given by the penalty Pen that determines Γ. In thesecond step, we use the residuals w − Zπ ∗ λ from this regression as instruments in a regressionof Y on w . The penalties λ ∗ MSE and λ ∗ FLCI that yield linear estimators ˆ β λ ∗ MSE and ˆ β λ ∗ FLCI thatoptimize the MSE criterion, and yield the shortest CI length (which for linear estimators isfixed; see eq. (7)), correspond to the solutions to the univariate optimization problems λ ∗ MSE = argmin λ V λ + ( CB λ ) , λ ∗ FLCI = argmin λ cv α ( CB λ / p V λ ) p V λ , (12)respectively, where V λ and B λ are given in (11).As t λ →
0, then, provided that Pen( · ) is a norm on Z , ˆ β λ converges to the short regressionestimate ˆ β short = w ′ ( I − H Z ) Yw ′ ( I − H Z ) w that only includes the unrestricted controls Z . This estimatorminimizes variance among all linear estimators with finite worst-case bias. In the otherdirection, as t λ → ∞ , ˆ β λ converges to the long regression estimate ˆ β long = w ′ ( I − H Z ) Yw ′ ( I − H Z ) w , providedthat w is not in the column space of Z (which ensures that the condition k w − Zπ ∗ λ k > λ ). This estimator minimizes variance among all linear estimatorsthat are unbiased, so Theorem 2.1 reduces to the Gauss-Markov theorem in this case. Inother words, the short and long regressions are corner solutions of the bias-variance tradeoff,8n which weight is entirely placed on variance, or on bias. Example 2.1 ( ℓ penalty, continued) . In this case, a convenient Lagrangian formulationfor (9) is π ∗ λ = argmin π k w − Zπ k + λ k M π k , If Z ′ Z + λM ′ M is invertible , taking first order conditions immediately leads to the closedform solution π ∗ λ = ( Z ′ Z + λM ′ M ) − Z ′ w which is a (generalized) ridge regression estimator of the propensity score. Simple algebrashows that ˆ β λ = ( w − Zπ ∗ λ ) ′ Y ( w − Zπ ∗ λ ) ′ w = e ′ X ′ X + λ M ′ M − X ′ Y, (13)where e = (1 , , . . . , ′ is the first standard basis vector. Thus, the optimal estimate canalso be obtained from a generalized ridge regression of Y onto X . The optimality of ridgeregression in this setting was shown by Li (1982), and the above derivation gives this resultas a special case of Theorem 2.1. If M = (0 , ( Z ′ ( I − H Z ) Z /n ) / ), then the estimatorfurther simplifies to a weighted average of the short and long regression estimates,ˆ β λ = ω ( λ ) ˆ β short + (1 − ω ( λ )) ˆ β long , with weights ω ( λ ) = λ/nλ/n + ς , ς = w ′ ( I − H Z ) ww ′ ( I − H Z ) w = var( ˆ β short )var( ˆ β long ) . The weight on the short regression increases with λ (as the relative weight on variance inthe bias-variance tradeoff increases), and decreases with ς . Example 2.2 ( ℓ penalty, continued) . In this case, the solution to (9) is given by a variantof the lasso estimate (Tibshirani, 1996) that only penalizes γ .The resulting estimator ˆ β λ is related to estimators recently proposed for constructingCIs using the lasso (see, among others, Zhang and Zhang, 2014; Javanmard and Montanari,2014; van de Geer et al., 2014; Belloni et al., 2014). These papers propose estimators for β that combine lasso estimates from the outcome regression of Y onto X with lasso estimates This holds so long as no element π = 0 satisfies Zπ = 0 and M π = 0 simultaneously. Intuitively, if Z has rank less than k , then the data is not informative about certain directions π , and we require the matrix M to place sufficient restrictions on π in these directions. The term “ridge regression” is sometimes reserved for the case where M ′ M = I k . Here, we use the termto include generalizations such as this one. Y . Incontrast, our estimator only uses lasso estimates for the propensity score regression, and islinear in Y . We give a detailed comparison between our estimator and this “double lasso”approach in Section 5. Example 2.3 (Partly linear model) . To flexibly control for a low-dimensional set of covari-ates ˜ z i , one may specify a semiparametric model y i = w i β + h (˜ z i ) + ε i , g Pen( h ) ≤ e C, where the penalty g Pen( h ) is a seminorm on functions h ( · ) that penalizes the “roughness”of h , such as the H¨older or Sobolev seminorm of order q . Minimax linear estimation inthis model for particular choices of g Pen( h ) has been considered in Heckman (1988). Thissetting also falls directly into our setup by defining Z = I n , γ i = h (˜ z i ), and Pen( γ ) =min h : h (˜ z i )= γ i , i =1 ,...n g Pen( h ) (assuming the minimum is taken). Theorem 2.1 then impliesthat the optimal estimator takes the formˆ β λ = P ni =1 ( w i − g ∗ λ (˜ z i )) Y i P ni =1 ( w i − g ∗ λ (˜ z i )) w i , where g ∗ λ ( · ) is analogous to the regularized regression estimate π ∗ λ in (9): it solvesmin g n X i =1 ( w i − g (˜ z i )) s.t. g Pen( g ) ≤ t λ . When g Pen is the Sobolev seminorm, this yields a spline estimate g ∗ λ (see, for example Wahba,1990). The partly linear model was treated in an influential paper by Robinson (1988), aswell as earlier papers cited therein. Interestingly, the estimator proposed by Robinson (1988)takes a similar form to the estimator ˆ β λ , involving residuals from a nonparametric regressionof w on ˜ z i . While the analysis in Robinson (1988) is asymptotic, our results imply that aversion of this estimator has sharp finite-sample optimality properties. So, far we have restricted attention to procedures that are linear in the outcomes Y . Wenow show that the estimator ˆ β λ ∗ MSE , and CIs based on the estimator ˆ β λ ∗ FLCI are in fact highlyefficient among all procedures, not just linear ones. This is due to the fact that the parameterspace Γ is convex and symmetric, and follows from the general results in Donoho (1994),Low (1995) and Armstrong and Koles´ar (2018) for estimation of linear functionals in normal10odels with convex parameter spaces.
Corollary 2.1.
Let λ ∗ MSE and λ ∗ FLCI be given in eq. (12) , where the optimization is over all λ with t λ > such that k w − Zπ ∗ λ k > . Let ˆ β λ , B λ and V λ be given in (11). Let ˜ β and ˜ β ± ˜ χ denote some other (possibly non-linear) estimator and some other (possibly non-linear,variable-length) CI.(i) For any λ , sup β ∈ R ,γ ∈ Γ var β,γ ( ˜ β ) ≤ V λ implies bias Γ ( ˜ β ) ≥ CB λ , and bias Γ ( ˜ β ) ≤ CB λ implies sup β ∈ R ,γ ∈ Γ var β,γ ( ˜ β ) ≥ V λ .(ii) The worst-case MSE improvement of ˜ β over ˆ β λ ∗ MSE is bounded by R MSE ( ˜ β )R MSE ( ˆ β λ ∗ MSE ) ≥ κ ∗ MSE ( X, σ, Γ) ≥ . , where κ ∗ MSE ( X, σ, Γ) is given in Appendix A.2.(iii) The improvement of the expected length of the CI ˜ β ± ˜ χ over the optimal linear FLCI ˆ β λ ∗ FLCI ± cv α ( CB λ ∗ FLCI /V / λ ∗ FLCI ) V / λ ∗ FLCI at γ = 0 and any β is bounded by E β, [ ˜ χ ]cv α ( CB λ ∗ FLCI /V / λ ∗ FLCI ) V / λ ∗ FLCI ≥ κ ∗ FLCI ( X, σ, Γ) , where κ ∗ FLCI ( X, σ, Γ) is given in Appendix A.2 and is at least . when α = 0 . . By construction, the estimator ˆ β λ minimizes variance among all linear estimators with abound CB λ on the bias (or equivalently, it minimizes bias among all linear estimators with abound V λ on the variance). Corollary 2.1(i) shows that this optimality property is retained ifwe enlarge the class of estimators to all estimators, including non-linear ones. As a result, theminimax linear estimator ˆ β λ ∗ MSE (i.e. the estimator attaining the lowest worst-case MSE inthe class of linear estimators) continues to perform well among all estimators, including non-linear ones: by Corollary 2.1(ii), its worst-case MSE efficiency is at least 80%. The exactefficiency bound κ ∗ MSE ( X, σ,
Γ) depends on the design matrix, noise level, and particularchoice of the parameter space, and can be computed explicitly in particular applications.We have found that typically the efficiency is considerably higher.Finally, Corollary 2.1(iii) shows that it is not possible to substantively improve upon theFLCI based on ˆ β λ ∗ FLCI in terms of expected length when γ = 0, even if we consider variablelength CIs that “direct power” at γ = 0 (potentially at the expense of longer expectedlength when γ = 0). The construction of the FLCI may appear conservative: its lengthdepends on the worst-case bias over the parameter space for ( β, γ ′ ) ′ , which, as the proof of11heorem 2.1 shows, attains at γ = Ct − λ ∗ FLCI π ∗ λ ∗ FLCI , with Pen( γ ) = C . Therefore, one maybe concerned that when the magnitude of γ is much smaller than C , the FLCI is too long.Corollary 2.1(iii) shows that this is not the case, and the efficiency of the FLCI is at least71.7% relative to variable-length CIs that optimize their expected length when γ = 0. Theexact efficiency bound κ ∗ MSE ( X, σ,
Γ) can be computed explicitly in particular applications,and we have found that it is typically considerably higher than 71 . C that bounds Pen( γ ). In the present setting, anadaptive CI would have length that automatically reflects the true regularity Pen( γ ) whilemaintaining coverage under a conservative a priori bound on Pen( γ ). However, according toCorollary 2.1(iii), any CI must have expected width that reflects the conservative a prioribound C rather than the true regularity Pen( γ ), even when Pen( γ ) is much smaller than theconservative a priori bound C . In particular, it is impossible to automate the choice of theregularity parameter C when forming a CI. We therefore recommend varying C as a formof sensitivity analysis, or using auxiliary information to choose C ; see Remark 3.3. We now discuss practical implementation issues, allowing ε to be non-Gaussian and het-eroskedastic. As a baseline, we propose the following implementation: Algorithm 3.1 (Baseline implementation) .Input
Data (
Y, X ), penalty Pen( · ), regularity parameter C , and initial estimates of residualsˆ ε init , , . . . , ˆ ε init ,n . Output
Estimator and CI for β .1. Compute an initial variance estimator, ˆ σ = n P ni =1 ˆ ε ,i , assuming homoskedasticity.2. Compute the solution path { π ∗ λ } λ> for the regularized propensity score regression ineq. (9), indexed by the penalty weight λ . For each λ , compute ˆ β λ as in eq. (10), and B λ , and V λ as in eq. (11), with ˆ σ in place of σ in the formula for V λ .3. Compute λ ∗ MSE and λ ∗ FLCI as in eq. (12), and compute the robust variance estimateˆ V λ, rob = P ni =1 a ∗ λ,i ˆ ε ,i , where a ∗ λ = w − Zπ ∗ λ ( w − Zπ ∗ λ ) ′ w .Return the estimator ˆ β λ ∗ MSE and the CI ˆ β λ ∗ FLCI ± cv α (cid:16) CB λ ∗ FLCI / ˆ V / λ ∗ FLCI , rob (cid:17) · ˆ V / λ ∗ FLCI , rob .12et us now discuss the implementation choices and the optimality and validity propertiesof the procedure in a series of remarks. Remark 3.1 (Validity) . As the initial residual estimates ˆ ε init ,i , we can take residuals froma regularized outcome regression of Y on X . We give conditions for asymptotic validityof the resulting CIs in Appendix B.2. The key requirement is that the maximal Lindebergweight Lind( a ∗ λ ) = max ≤ i ≤ n a ∗ λ,i / P nj =1 a ∗ λ,j associated with the estimator ˆ β λ shrink quicklyenough relative to error in the estimator used to form the residuals. Ensuring that Lind( a ∗ λ )is small prevents the estimator from putting too much weight on a particular observation,so that the Lindeberg condition for the central limit theorem holds.Whether these conditions hold for the optimal estimator will in general depend on theform of Pen( γ ) and on the magnitude of C relative to n . To ensure that Lind( a ∗ λ ) is smallenough in a particular sample for a normal approximation to work well, one may impose abound on this term by only minimizing eq. (12) over λ such that Lind( a ∗ λ ) is small enoughwhen computing λ ∗ FLCI . This is similar to proposals by Noack and Rothe (2020), and Javan-mard and Montanari (2014) in other settings. See Appendix B.2 for further discussion.
Remark 3.2 (Efficiency) . The weights a ∗ λ ∗ FLCI and a ∗ λ ∗ MSE are not optimal under heteroskedas-ticity. One could in principle generalize the feasible generalized least squares (FGLS) ap-proach used for unconstrained estimation by deriving optimal weights under the assumption ε ∼ N (0 , Σ) (which simply follows the above analysis after pre-multiplying by Σ − / ), andderive conditions under which the estimator and CI that plug in an estimate of Σ are optimalasymptotically when the assumption of known variance and Gaussian errors is dropped. Weinstead generalize the common approach of reporting OLS with Eicker-Huber-White (EHW)standard errors in the unconstrained setting. The optimal weights a ∗ λ are computed underthe assumption of homoskedasticity, but we use a robust standard error to compute the CIto ensure its validity when this assumption is violated. Remark 3.3 (Choice of C ) . By Corollary 2.1(iii), one cannot use a data-driven rule toautomate the choice of C when forming a CI. We therefore recommend varying C as a formof sensitivity analysis, and reporting a “breakdown value” C ∗ as the largest value of C suchthat some empirical finding holds.In settings where the plausible values of γ cannot be assessed using prior knowledge, onecan relate the magnitude of Pen( γ ) to other quantities. One possibility is to run a regularizedoutcome regression of Y on X , with the constraint Pen( γ ) ≤ C and report R ( C ) = 1 − P ni =1 ˆ ε i,C P ni =1 ( Y i − ¯ Y ) as a function of C , where { ˆ ε i,C } are the residuals from this regression, and Y = n P ni =1 Y i . The quantity R (0) corresponds to the R in the regression with only the baselinecontrols Z included. One can then examine how R ( C ) varies with C to relate bounds on13en( γ ) to R . This mirrors the common practice in empirical applications in economics ofexamining how the magnitude of regression estimates and R change when regressors areadded (see Oster, 2019, for further discussion and references). We note, however, that due tothe impossibility result described above, additional assumptions would be needed to formallyjustify choosing C based on such a procedure.Finally, one can form a lower CI [ ˆ C, ∞ ) for C to assess the plausibility of a given boundon Pen( γ ). We present such a CI in Appendix B.3 for the case where Pen( γ ) imposes an ℓ p constraint. Such CIs can be useful as a specification check to ensure that the chosen valueof the regularity parameter C is not too small. Remark 3.4 (Computational issues) . Step 2 involves computing the solution path of aregularized regression estimator. Efficient algorithms exist for computing these paths under ℓ penalties and its variants (Efron et al., 2004; Rosset and Zhu, 2007). Under ℓ penalty, theregularized regression has a closed form, so that our algorithm can again be implemented in acomputationally efficient manner. For other types of penalties, the convexity of optimizationproblem in eq. (9) can be exploited to yield efficient implementation. We also note that sincethe solution path π ∗ λ does not depend on C , it only needs to be computed once, even whenmultiple choice of C are considered in a sensitivity analysis. We now consider the asymptotic behavior of CIs and efficiency bounds as n → ∞ . For easeof notation, we assume all coefficients are constrained, and focus on the case Pen( γ ) = k γ k p for some p ≥
1, and the case Pen( γ ) = k Zγ/ √ n k (see Example 2.1). We allow for sequences C = C n for the bound on Pen( γ ), which may go to 0 or ∞ with the sample size, as well ashigh dimensional asymptotics where k = k n ≫ n . We consider a standard “high dimensional”setting, placing conditions on the design matrix X that hold with high probability when w i , z i are drawn i.i.d. over i , with the eigenvalues of var(( w i , z ′ i ) ′ ) bounded away from zeroand infinity.Let q ∈ [0 , ∞ ] denote the H¨older conjugate of p , satisfying 1 /p + 1 /q = 1. We will showthat when Pen( γ ) = k γ k p , the optimal linear FLCI shrinks at the rate n − / + Cr q ( k, n ) where r q ( k, n ) = k /q / √ n if q < ∞ , √ log k/ √ n if q = ∞ . . (14)Furthermore, for p = 1 and p = 2, we will show that no other CI can shrink at a faster rate.For p = 1, we will in fact prove a stronger result showing that imposing sparsity bounds on14he outcome and propensity score regressions, in addition to the bound on Pen( γ ), does nothelp achieve a faster rate, unless one assumes sparsity of order greater than C n p n/ log( k )(termed the “ultra sparse” case in Cai and Guo (2017)). For the case Pen( γ ) = k Zγ/ √ n k ,we will show that the optimal rate is given by n − / + C when k > n .In the case where C = C n does not decrease to zero with n , these rates require p < q >
2) for consistent estimation when k/n → ∞ . In the case where p = 1, we canthen allow k to grow exponentially with n , whereas the cases 1 < p < k/n → ∞ with k growing at a polynomial rate in n that depends on p . Since taking C n → p < p = 1 offering the best rate conditions. It also follows from theserate results that if C n = C does not decrease to zero with n , the bias term can dominateasymptotically, making it necessary to explicitly account for bias in CI construction even inlarge samples. To state the result, given η >
0, let E n ( η ) denote the set of design matrices X for whichthere exists δ ∈ R k such that1 n k w − Zδ k ≤ η , n w ′ ( w − Zδ ) ≥ η, n k Z ′ ( w − Zδ ) k q ≤ r q ( k, n ) η . Let R ∗ FLCI ( X, C ) = 2 cv α ( CB λ ∗ FLCI /V / λ ∗ FLCI ) · V / λ ∗ FLCI denote the length of the optimal linearFLCI.
Theorem 4.1. (i) Suppose
Pen( γ ) = k γ k p . There exists a finite constant K η dependingonly on η such that R ∗ FLCI ( X, C ) ≤ K η n − / (1 + Ck /q ) for p > , and R ∗ FLCI ( X, C ) ≤ K η n − / (1 + C √ log k ) for p = 1 for any X ∈ E n ( η ) . (ii) Suppose Pen( γ ) = k Zγ/ √ n k .There exists a finite constant K η depending only on η such that R ∗ FLCI ( X, C ) ≤ K η ( n − / + C ) for any X such that η ≤ w ′ w/n . The second part of the theorem follows since the short regression without any controlsachieves a bias that is of the order C . The first part shows that the upper bounds on therate of convergence match those in eq. (14) if the high-level condition X ∈ E n ( η ) holds. Thenext lemma shows that this high-level condition holds with high probability when w i , Z i aredrawn i.i.d. from a distribution satisfying mild conditions on moments and covariances. Lemma 4.1.
Suppose w i , z i are drawn i.i.d. over i , and let δ = argmin b E [( w i − z ′ i b ) ] sothat z ′ i δ is the population best linear predictor error of w i . Suppose that the linear prediction rror E [( w i − z ′ i δ ) ] is bounded away from zero as k → ∞ , E [ w i ] < ∞ , and that sup j E [ | ( w i − z ′ i δ ) z ij | max { ,q } ] < ∞ when p > , and, for some c > , P ( | ( w i − z ′ i δ ) z ij | ≥ t ) ≤ − ct ) for all j when p = 1 . Then, for any ˜ η > , there exists η such that X ∈ E n ( η ) with probabilityat least − ˜ η for large enough n . We now show that the rates in eq. (14) are sharp when p = 2, or p = 1. p = 2As with the upper bound in Section 4.1, we derive a bound that holds when the designmatrix X is in some set, and then show that this set has high probability when w i , z i aredrawn i.i.d. from a sequence of distributions satisfying certain conditions. We focus on thecase k ≥ n . Let e E n ( η ) denote the set of design matrices X such that η ≤ n w ′ w ≤ η − , min eig( ZZ ′ /k ) ≥ η, where eig( A ) denotes the set of eigenvalues of a square matrix A . Theorem 4.2.
Let ˆ β ± ˆ χ be a CI with coverage at least − α under Pen( γ ) ≤ C . (i) If Pen( γ ) = k γ k , there exists a constant c η > depending only on η such that the expectedlength under β = 0 , γ = 0 satisfies E , [ ˆ χ ] ≥ c η n − / (1 + Ck / ) for any X ∈ e E n ( η ) . (ii)If Pen( γ ) = k Zγ/ √ n k , there exists a constant c η > depending only on η such that theexpected length under β = 0 , γ = 0 satisfies E , [ ˆ χ ] ≥ c η n − / (1 + C ) for any X ∈ e E n ( η ) . If z i is i.i.d. over i , then EZZ ′ /k is equal to the n × n identity matrix times the scalar k P kj =1 E [ z ij ]. Thus, the condition on the minimum eigenvalue of ZZ ′ /k will hold underconcentration conditions on the matrix Z ′ Z so long as the second moments of the covariatesare bounded from below. Here, we state a result for a special case where the z ij ’s are i.i.d.normal, which is immediate from Donoho (2006, Lemma 3.4). Lemma 4.2.
Suppose that w i are i.i.d. over i and that z ij are i.i.d. normal over i and j .Then, for any ˜ η > , there exists η > such that X ∈ e E n ( η ) with probability at least − ˜ η once n and k/n are large enough. p = 1We now consider the case where p = 1, as in Example 2.2. Rather than imposing condi-tions on X in a fixed design setting that hold with high probability (as in Section 4.1 and16ection 4.2.1), we directly consider a random design setting, and we do not condition on X when requiring coverage of CIs. This allows us to strengthen the conclusion of our theoremby showing that the rate in Theorem 4.1 is sharp even if one imposes a linear model for w i given z i along with sparsity and ℓ bounds on the coefficients in this model.We introduce some additional notation to cover the random design setting, which we useonly in this section. We consider a random design model Y = wβ + Zγ + ε, ε | Z, w ∼ N (0 , σ I n ) ,w = Zδ + v, v | Z ∼ N (0 , σ v I n ) ,z ij ∼ N (0 ,
1) i.i.d. over i, j.
We use P ϑ and E ϑ for probability and expectation when Y, X follow this model with parame-ters ϑ = ( β, γ ′ , δ ′ , σ , σ v ) ′ . Let σ > σ v, > C, s, η ) denote the setof parameters ϑ = ( β, γ ′ , δ ′ , σ , σ v ) where | σ − σ | ≤ η , | σ v − σ v, | ≤ η , k γ k ≤ C , k δ k ≤ C , k γ k ≤ s and k δ k ≤ s . Theorem 4.3.
Let ˆ β ± ˆ χ be a CI satisfying P ϑ ( β ∈ { ˆ β ± ˆ χ } ) ≥ − α for all ϑ in Θ( C n , C n · K p n/ log k, η n ) where α < / . Suppose k → ∞ , C n √ log k/n → and C n ≤ p k/n · k − ˜ η for some ˜ η > . Then, there exists c such that, if K is large enough and η n → slowlyenough, the expected length of this CI under the parameter vector ϑ ∗ given by β = 0 , γ = 0 , δ = 0 , σ = σ , σ v = σ v, satisfies E ϑ ∗ [ ˆ χ ] ≥ c · n − / (1 + C n √ log k ) once n is large enough. Theorem 4.3 follows from similar arguments to Cai and Guo (2017) and Javanmard andMontanari (2018), who provide similar bounds for the case where only a sparsity boundis imposed. According to Theorem 4.3, imposing sparsity does not allow one to improveupon the CIs that uses only the ℓ bound k γ k ≤ C n (thereby attaining the rate in Theo-rem 4.1), unless one imposes sparsity of order greater than C n p n/ log k . We provide furthercomparison with CIs that impose sparsity in the next section. Several authors have considered CIs for β using “double lasso” estimators (see, among others,Belloni et al., 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014; Zhang andZhang, 2014). These CIs are valid under the parameter space e Γ( s ) = { γ : k γ k ≤ s } , (15)17here k γ k = { j : γ j = 0 } is the ℓ “norm,” which indexes the sparsity of γ , and with s increasing slowly enough relative to n and k . Since k γ k is not a true norm or seminorm(it is non-convex), this falls outside our setting. Here, we discuss some connections withthe optimal estimators we derive under ℓ constraints with these double lasso estimators(Section 5.1), and we provide a discussion comparing our approach to CIs based on theseestimators (Section 5.2). ℓ constraints In the case where Pen( γ ) = k γ k (example 2.2), the solution π ∗ λ to (9) is the lasso estimatein the propensity score regression of w on Z , and our estimator (10) uses residuals from thislasso regression. This is related to recently proposed “double lasso” estimators used to formCIs for β under sparsity constraints on γ (see, among others, Belloni et al., 2014; Javanmardand Montanari, 2014; van de Geer et al., 2014; Zhang and Zhang, 2014). For concreteness,we focus on the estimator in Zhang and Zhang (2014), which is given byˆ β ZZ = ˆ β lasso + ( w − Zπ ∗ λ ) ′ ( Y − w ˆ β lasso − Z ˆ γ lasso )( w − Zπ ∗ λ ) w , where ˆ β lasso , ˆ γ lasso are the lasso estimates from regressing Y on X :ˆ β lasso , ˆ γ lasso = argmin β,γ k Y − wβ − Zγ k + ˜ λ ( | β | + k γ k )for some penalty parameter ˜ λ > Remark 5.1.
Note that ˆ β ZZ is non-linear in Y , due to non-linearity of the lasso estimatesˆ β lasso , ˆ γ lasso , which is consistent with the goal of efficiency in the non-convex parameter space(15). In contrast, Corollary 2.1 shows that under the convex parameter space Γ = { γ : k γ k ≤ C } , the estimator ˆ β λ in (10) which only uses lasso in the propensity score regression of w on Z , is already highly efficient among all estimators, so that there is no further role forsubstantive efficiency gains from the lasso regression of Y on X , or from the use of othernon-linear estimators.To further understand the connection between these estimators, we note that Zhang andZhang (2014) motivate their approach by bounds of the form k ˆ γ lasso − γ k ≤ ˜ C where ˜ C = const. · s p log k/ √ n, (16)18hich hold with high probability with the constant depending on certain “compatibilityconstants” that describe the regularity of the design matrix X (see B¨uhlmann and vande Geer, 2011, Theorem 6.1, and references in the surrounding discussion). This suggestscorrecting the initial estimate ˆ β lasso by estimating ˜ β = β − ˆ β lasso in the regression˜ Y = w ( β − ˆ β lasso ) + Z ( γ − ˆ γ lasso ) + ε = w ˜ β + Z ˜ γ + ε, where ˜ Y = Y − ˆ β lasso − Z ˆ γ lasso . Heuristically, we can treat the bound (16) as a constraint k ˜ γ k ≤ ˜ C on the unknown parameter ˜ γ = γ − ˆ γ lasso and search for an optimal estimator of˜ β = ( β − ˆ β lasso ) under this constraint. Applying the optimal estimator derived in Theorem 2.1then suggests estimating β − ˆ β lasso with ( w − Zπ ∗ λ ) ′ ˜ Y ( w − Zπ ∗ λ ) w . Adding this estimate to ˆ β lasso gives theestimate ˆ β ZZ proposed by Zhang and Zhang (2014). Whereas Zhang and Zhang (2014)motivate their approach as one possible way of correcting the initial estimate ˆ β lasso using thebound (16), the above analysis shows that their correction is in fact identical to an approachin which one optimizes this correction numerically. Using the bound (16) it follows that ˆ β ZZ − β = ˜ b + a ∗ λ ′ ε where a ∗ λ = ( w − Zπ ∗ λ )( w − Zπ ∗ λ ) ′ w are theoptimal weights under the ℓ constraint k ˜ γ k ≤ ˜ C , given in Theorem 2.1. Furthermore, | ˜ b | ≤ ˜ CB λ , with B λ given in Theorem 2.1 and ˜ C given in (16), and the variance of therandom term a ∗ λ ′ ε is given by V λ in Theorem 2.1. Using arguments similar to those used toprove Theorem 4.1, it follows that ˜ CB λ / √ V λ is bounded by a constant times s (log k ) / √ n ,so that one can ignore bias in large samples as long as this term converges to zero. Thisleads to the CI proposed by Zhang and Zhang (2014), which takes the form { ˆ β ZZ ± z − α/ ˆ V / λ } , (17)where ˆ V λ is an estimate of the variance V λ . We use the term “double lasso CI” to refer tothis CI, and to related CIs such as those proposed in Belloni et al. (2014); Javanmard andMontanari (2014); van de Geer et al. (2014). Remark 5.2.
To avoid having to assume that s (log k ) / √ n → The estimator proposed by Javanmard and Montanari (2014) performs a numerical optimization ofthis form, but with the constraint (16) replaced by a constraint on | ˆ β lasso − β | + k ˆ γ lasso − γ k . Thus,Theorem 2.1 shows that a modification of the constraint used in Javanmard and Montanari (2014) yieldsthe same estimator as Zhang and Zhang (2014). { ˆ β ZZ ± [ ˜ CB λ + z − α/ ˆ V / λ ] } Unfortunately, finding a computable constant ˜ C in (16) that is sharp enough to yield usefulbounds in practice appears to be difficult, although it is an interesting area for future research. When should one use a double lasso CI, and when should one use the approach in thepresent paper? In principle, this depends on the a priori assumptions one is willing to make,and whether they are best captured by a sparsity bound or a bound on convex penaltyfunction, such as the ℓ or ℓ norm. In many settings, it may be difficult to motivate theassumption that a regression function has a sparse approximation, whereas upper boundson the magnitude of the coefficients may be more plausible.A key advantage of the CIs and estimators we propose is that they have sharp finite-sample optimality properties and coverage guarantees in the fixed design Gaussian modelwith known error variance. While this is an idealized setting, the worst-case bias calcula-tions do not depend on the error distribution, and remain the same under non-Gaussian,heteroskedastic errors. Our approach directly accounts for the potential finite-sample biasof the estimator, rather than relying on “asymptotic promises” about rates at which certainconstants involved in bias terms converge to zero.A flip side of this approach is that our CIs require an explicit choice of the regularityparameter C in order to form a “bias-aware” CI. In contrast, CIs based on double lassoestimators do not require explicitly choosing the regularity (in this case, the sparsity s ),since they ignore bias. This is justified under asymptotics in which s increases more slowlythan √ n/ log k , which lead to the bias of ˆ β ZZ decreasing more quickly than its standarddeviation. Thus, one can say that the CI in eq. (17) is “asymptotically valid” withoutexplicitly specifying the sparsity index s : one need only make an “asymptotic promise” that s increases slowly enough. However, such asymptotic promises are difficult to evaluate in agiven finite-sample setting. Indeed, shown in, for example, Li and M¨uller (2020), the doublelasso CI leads to undercoverage in finite samples even in relatively sparse settings. To ensuregood finite-sample coverage of the CI in eq. (17), one needs to ensure that the actual finite- We use the slightly more conservative approach of adding and subtracting the bound ˜ CB λ rather thanusing the critical value cv α ( ˜ CB λ / ˆ V / λ ) as in eq. (7), since the “bias” term for ˆ β ZZ is correlated with ε through the first step estimates ˆ β lasso , ˆ γ lasso . s (as in the bound in eq. (16)), this gets us back tohaving to specify s .Thus, CIs that ignore bias such as conventional CIs based on double lasso estimatorsdo not avoid the problem of specifying s or C : they merely make such choices implicit inasymptotic promises. These issues show up formally in the asymptotic analysis of such CIs.In particular, double lasso CIs require the “ultra sparse” asymptotic regime s = o ( √ n/ log k ),and they undercover asymptotically in the “moderately sparse” regime where s increasesmore slowly than n with s ≫ √ n/ log k . Indeed, Theorem 4.3 above, as well as the resultsof Cai and Guo (2017) and Javanmard and Montanari (2018) show that it is impossible toavoid explicitly specifying s if one allows for the moderately sparse regime. On the otherend of the spectrum, in the “low dimensional” regime where k ≪ n , the double lasso CI isasymptotically equivalent to the usual CI based on the long regression. Thus, the doublelasso CI cannot be used when the goal is to use a priori information on γ to improve uponthe CI based on the long regression (as in, for example, Muralidharan et al., 2020), even if s is small enough that such improvements would be warranted with prior knowledge of s .In contrast, our approach optimally incorporates the bound C regardless of the asymptoticregime. 21 ppendix A Proofs This Appendix gives proofs for all results in the main text.
A.1 Proof of Theorem 2.1
To prove Theorem 2.1, we first explain how our results fall into the general setup used inDonoho (1994), Low (1995) and Armstrong and Koles´ar (2018). In the notation of Armstrongand Koles´ar (2018), ( β, γ ′ ) ′ plays the role of the parameter f , the functional of interest is givenby L ( β, γ ′ ) ′ = β and K ( β, γ ′ ) ′ = wβ + Zγ . The parameter space R × Γ is centrosymmetric,so that the modulus of continuity (eq. (25) in Armstrong and Koles´ar, 2018) is given by ω ( δ ) = sup β,γ β s.t. k wβ + Zγ k ≤ δ/ , Pen( γ ) ≤ C. Using the substitution π = − γ/β , we can write this as ω ( δ ) = sup β,π β s.t. β k w − Zπ k ≤ δ/ , β Pen( π ) ≤ C. (18)Let β mod δ , γ mod δ and π mod δ = − γ mod δ /β mod δ denote a solution to this problem when it exists. Inthe notation of Armstrong and Koles´ar (2018), ( β mod δ , γ mod δ ′ ) ′ plays the role of g ∗ δ , and thesolution ( f ∗ δ , g ∗ δ ) satisfies f ∗ δ = − g ∗ δ = − ( β mod δ , γ mod δ ′ ) ′ by centrosymmetry.This optimization problem is clearly related to the problem in eq. (9): we want to make k w − Zπ k and Pen( π ) small so that large values of β satisfy the constraint in (18). Thefollowing lemma formalizes the connection. Lemma A.1.
If there exists π ∈ G such that w = Zπ and Pen( π ) = 0 , then ω ( δ ) = ∞ for all δ ≥ . Otherwise, (i) for any δ > , the modulus problem (18) has a solution β mod δ , π mod δ with β mod δ > . For t λ = C/β mod δ = 2 C/ω ( δ ) , this solution π mod δ is also a solution to the penalizedregression (9) with optimized objective k w − Zπ mod δ k = δ/ (2 β mod δ ) = δ/ω ( δ ) > ; and (ii) forany t λ > , the penalized regression problem (9) has a solution π ∗ λ . Setting β ∗ λ = C/t λ and δ λ = 2 β ∗ λ k w − Zπ ∗ λ k = (2 C/t λ ) k w − Zπ ∗ λ k , the pair β ∗ λ , π ∗ λ solves the modulus problem (18) at δ = δ λ , with optimized objective ω ( δ λ ) = 2 C/t λ , so long as k w − Zπ ∗ λ k > .Proof. If there exists π ∈ G such that w = Zπ and Pen( π ) = 0, then the result is immediate.Suppose there does not exist such a π .First, we show that the problem (9) has a solution. Let G (0) denote the linear subspaceof vectors π ∈ G such that Zπ = 0 and Pen( π ) = 0, and let G (1) be a subspace such that G = G (0) ⊕ G (1) , so that we can write π ∈ G uniquely as π = π (0) + π (1) where π (0) ∈ (0) and π (1) ∈ G (1) . Note that Zπ = Zπ (1) and, applying the triangle inequality twice,Pen( π (1) ) = Pen( π (1) ) − Pen( − π (0) ) ≤ Pen( π ) ≤ Pen( π (0) ) + Pen( π (1) ) = Pen( π (1) ) so thatPen( π ) = Pen( π (1) ). Thus, the problem (9) can be written in terms of π (1) ∈ G (1) only.The level sets of this optimization problem are bounded and are closed by continuity ofthe seminorm Pen( · ) (Goldberg, 2017), and so it has a solution, which is also a solutionin the original problem. Similarly, to show that the problem (18) has a solution, notethat feasible values of β are bounded by a constant times the inverse of the minimum ofmax {k w − Zπ k , Pen( π ) } over π , which is strictly positive by continuity of Pen( π ) and thefact that there does not exist π with max {k w − Zπ k , Pen( π ) } = 0. Thus, we can restrict β, ˜ π (1) to a compact set without changing the optimization problem.To show the first statement in the lemma, note that β mod δ >
0, since it is feasible to set π = 0 and β = δ/ (2 k w k ), and that k w − Zπ mod δ k >
0, since otherwise a strictly larger valueof β could be achieved by multiplying π mod δ by 1 − η for η > π with Pen(˜ π ) ≤ C/β mod δ such that k w − Z ˜ π k ≤k w − Zπ mod δ k − ν for small enough ν >
0. Then, letting ˜ π η = (1 − η )˜ π , we would have k w − Z ˜ π η k ≤ k w − Z ˜ π k + η k Z ˜ π k ≤ k w − Zπ mod δ k − ν + η k Z ˜ π k ≤ δ/ (2 β mod δ ) − ν + η k Z ˜ π k .Thus, for small enough η , k w − Z ˜ π η k will be strictly less than δ/ (2 β mod δ ) for small enough η and Pen(˜ π η ) ≤ (1 − η ) C/β mod δ < C/β mod δ . This is a contradiction, since it would allow astrictly larger value of β by setting π = ˜ π η .The second statement follows immediately, since any pair ˜ β, ˜ π satisfying the constraintsin the modulus (18) for δ = δ λ with ˜ β > β ∗ λ would have to have k w − Z ˜ π k < k w − Zπ ∗ λ k while maintaining the constraint Pen( π ∗ λ ) ≤ t λ .We now prove Theorem 2.1. The class of bias-variance optimizing estimators, ˆ L δ inthe notation of Armstrong and Koles´ar (2018), is given by ( wβ mod δ + Zγ mod δ ) ′ Y ( wβ mod δ + Zγ mod δ ) ′ w , where we useeq. (26) in Armstrong and Koles´ar (2018) to compute the form of this estimator undercentrosymmetry, and Lemma D.1 in Armstrong and Koles´ar (2018) to calculate the derivative ω ′ ( δ ), since the problem is translation invariant with ι given by the parameter β = 1, γ = 0.Given λ with k w − Zπ ∗ λ k >
0, it follows from Lemma A.1 that, for δ λ given in the lemma,this estimator ˆ L δ λ is equal to ˆ β λ = a ∗ λ ′ Y where a ∗ λ = w − Zπ ∗ δ ( w − Zπ ∗ δ ) ′ w , as defined in Theorem 2.1.The worst-case bias formula in Theorem 2.1 then follows from the fact that the maximumbias is attained at γ = − γ mod δ λ = Ct − λ π ∗ λ by Lemma A.1 in Armstrong and Koles´ar (2018)(or Lemma 4 in Donoho, 1994). 23 .2 Proof of Corollary 2.1 Part (i) of Corollary 2.1 follows from Low (1995). In particular, consider the one-dimensionalsubmodel β ∈ [ − C/t λ , C/t λ ], γ = − π ∗ λ β . Let b λ = ( w − Zπ ∗ λ ) / k w − Zπ ∗ λ k , and let B ∈ R ( n − × n be an orthogonal matrix that’s orthogonal to b λ . Note that in this submodel, B ′ Y = B ′ ( w − Zπ ∗ λ ) β + B ′ ε = B ′ ε , which does not depend on the unknown parameter β ,and is independent of b ′ λ Y . Therefore, b ′ λ Y ∼ N ( β, k b λ k σ ) is a sufficient statistic in thissubmodel. By Theorem 1 in Low (1995), in this submodel, the estimator ˆ β λ = a ∗ λ ′ Y = κb ′ λ Y ,where κ = k w − Zπ ∗ λ k / ( w − Zπ ∗ λ ) ′ w minimizes sup β var( δ ( Y )) among all estimators δ ( Y ) withsup β | E β [ δ ( Y )] − β | ≤ (1 − κ ) C/t λ = CB λ , and, likewise, it minimizes sup β | E β [ δ ( Y )] − β | among all estimators with sup β var( δ ( Y )) ≤ κ σ k b λ k = V λ . Since the worst-case biasbias Γ ( ˆ β λ ) ≤ CB λ and variance ( ˆ β λ ) = V λ are the same in the full model by Theorem 2.1, theresult follows.Part (ii) of Corollary 2.1 is immediate from Donoho (1994). In particular, it holds with κ ∗ MSE ( X, σ,
Γ) = sup δ> ( ω ( δ ) /δ ) ρ N ( δ/ , σ )sup δ> ( ω ( δ ) /δ ) ρ A ( δ/ , σ ) ≥ . , where ω ( δ ) is defined in eq. (18), and ρ A and ρ N are the minimax risk among affine estimators,and among all estimators, respectively, in the bounded normal means problem Y ∼ N ( θ, σ ), | θ | ≤ τ , defined in Donoho (1994), and the last inequality follows from eq. (4) in Donoho(1994).Finally, Part (iii) of Corollary 2.1 follows from Corollary 3.3 in Armstrong and Koles´ar(2018), with κ ∗ FLCI ( X, σ,
Γ) = (1 − α ) E [ ω (2( z − α − Z )) | Z ≤ z − α ]2 min δ cv α (cid:16) ω ( δ )2 ω ′ ( δ ) − δ (cid:17) ω ′ ( δ ) , where Z ∼ N (0 , ω ( δ ) is given in eq. (18), and by Lemma D.1 in Armstrong and Koles´ar,since the problem is translation invariant with ι given by the parameter β = 1, γ = 0, ω ′ ( δ ) = δ/ [ w ′ ( w − Zπ mod δ ) · ω ( δ )]. The universal lower bound 0.717 when α = 0 .
05 followsfrom Theorem 4.1 in Armstrong and Koles´ar (2020b).
A.3 Proof of Theorem 4.1
To prove that the claimed upper bound holds for X ∈ E n ( η ), we first note that, since the FLCIbased on ˆ β λ ∗ FLCI is shorter than the FLCI based on any linear estimator a ′ Y , it suffices to showthat there exists a sequence of weight vectors a such that the worst-case bias and standarddeviation are bounded by constants times n − / (1 + Ck /q ) when p > n − / (1 + C √ log k )when p = 1. We consider the weights ˜ a i = v i P nj =1 v j w j , where v i = w i − z ′ i δ , with δ given in the24efinition of E ( η ). The variance of the estimator ˜ a ′ Y is P ni =1 v i ( P ni =1 v i w i ) ≤ η − /n . The worst-casebias is sup γ : k γ k p ≤ C ˜ a ′ Zγ = C k Z ′ ˜ a k q = n − / C n − / k Z ′ ( w − Zδ ) k q n − | w ′ ( w − Zδ ) | ≤ C r q ( k, n ) η , where the first equality follows by H¨older’s inequality, and the last quality follows by def-inition of E n ( η ). This yields the convergence rate n − / + Cr q ( k, n ), as claimed. Forpart (ii), by analogous reasoning, it suffices to consider the short regression estimatorˆ β = w ′ Y /w ′ w . The variance of this estimator is σ /w ′ w ≤ η − σ /n . The bias of theestimator is w ′ Zγ/w ′ w . By the Cauchy-Schwarz inequality, this quantity is bounded in ab-solute value by k w/w ′ w k k Zγ k = k Zγ/ √ n k / p w ′ w/n ≤ η − / C . This yields the desiredconvergence rate. A.4 Proof of Lemma 4.1
By the orthogonality condition for the best linear predictor, we have E [ w i v i ] = E [ v i ], where v i = w i − z ′ i δ , which is bounded from below uniformly over k by assumption. Since E [ w i v i ]is bounded from above by Ew i < ∞ , it follows from the law of large numbers for triangu-lar arrays that n P ni =1 w i v i ≥ η with probability approaching one once η is small enough.Similarly, n P ni =1 v i ≤ /η for large enough η by the law of large numbers for triangulararrays.For the last inequality in the definition of E n ( η ), first consider the case p > q < ∞ . We then have E k √ n P ni =1 z i v i k qq = E P kj =1 | P ni =1 v i z ij / √ n | q ≤ k · K by von Bahr(1965), where K is a constant that depends only on an upper bound for max j E [ | v i z ij | max { q, } ].Applying Markov’s inequality gives the required bound. When p = 1, then q = ∞ so that P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n n X i =1 z i v i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ≥ η − p log k ≤ k X j =1 P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X i =1 v i z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > η − p log k ! , which is bounded by 2 k exp ( − K · η − log k ) = 2 k − Kη − for some constant K by Hoeffding’sinequality for sub-Gaussian random variables (Vershynin, 2018, Theorem 2.6.3). This canbe made arbitrarily small uniformly in k by making η small, as required. A.5 Proof of Theorem 4.2
By Corollary 2.1(iii), it suffices to show the bound for R ∗ FLCI ( X, C ). We first note that anyestimator a ′ Y that does not have infinite worst-case bias must satisfy a ′ w = 1, which implies1 ≤ k a k · k w k by the Cauchy-Schwarz inequality, so that the variance σ a ′ a is bounded by25 / k w k ≤ σ η − /n . It therefore suffices to show that the worst-case bias is bounded by aconstant times C ( k/n ) / (for (i)), or a constant times C (for (ii)).For part (i), let ˜ γ = − Cη p k/nZ ′ ( Z ′ Z ) − w . Observe Pen( γ ) = C ( k/n ) η p w ′ ( Z ′ Z ) − w ≤ Cη · (max eig ( Z ′ Z/k ) − ) / p w ′ w/n ≤ C . Let ˜ β = Cη p k/n . Then w ˜ β + Z ˜ γ = 0. Thus,˜ β, ˜ γ is observationally equivalent to the parameter vector β = 0, γ = 0, which implies thatthe length of any CI must be at least Cη p k/n .Part (ii), follows by an analogous argument, with ˜ γ = − Z ′ ( ZZ ′ ) − w · Cη / and ˜ β = Cη / . A.6 Proof of Theorem 4.3
Since the lower bound c · n − / follows from standard efficiency bounds with finite dimensionalparameters (e.g. taking the submodel where δ = γ = 0), we show the lower bound E ϑ ∗ ˆ χ ≥ C n · c · √ log k/ √ n . To show this, we follow essentially the same arguments as Cai and Guo(2017, Theorem 3) and Javanmard and Montanari (2018, Proposition 4.2), noting that therequired bounds on k δ k and k γ k hold for the distributions used in the lower bound. Under agiven parameter vector ϑ = ( β, γ ′ , δ ′ , σ , σ v ), the data ( Y i , w i , z i ) ′ are i.i.d. normal with meanzero and variance matrixΣ ϑ = σ + β ( σ v + k δ k ) + 2 βδ ′ γ + k γ k β ( σ v + k δ k ) + γ ′ δ βδ ′ + γ ′ β ( σ v + k δ k ) + γ ′ δ σ v + k δ k δ ′ βδ + γ δ I k . Let f π denote the distribution of the data { Y i , w i , z i } ni =1 when the parameters follow a priordistribution π , and let χ ( f π , f π ) denote the chi-square distance between these distributionsfor prior distributions π and π . By Lemma 1 in Cai and Guo (2017), it suffices to find aprior distribution π over the parameter space Θ( C n , C n · K p n/ log k, η n ) such that π placesprobability one on β = β ,n for some sequence with | β ,n | bounded from below by a constanttimes C n √ log k/ √ n and such that χ ( f π , f π ) →
0, where π is the distribution that placesprobability one on ϑ ∗ given in the statement of the theorem.To this end, we first note that we can assume σ = σ v, = 1 without loss of generality,since dividing Y i and w i by σ and σ v, leads to the same model with parameters multipliedby constants that depend only on σ and σ v, .Let π be defined by a uniform prior for δ over the set with k δ k = s and each element δ j ∈ { , ν } , where s and ν will be determined below. We then set the remaining parametersas deterministic functions of δ : β = −k δ k / (1 − k δ k ), γ = (1 − β ) δ , σ v = 1 − k δ k and σ = (1 − k δ k ) / (1 − k δ k ). We note that k δ k is constant under this prior, so that β is a26nit point mass as required. This leads to the variance matrixΣ ϑ = δ ′ δ ′ δ δ I k for ϑ in the support of π , and Σ ϑ ∗ = I k +2 under the point mass π . It now follows fromeqs. (118) and (119) in Javanmard and Montanari (2018) (which are applications of Lemmas2 and 3 in Cai and Guo (2017)) that χ ( f π , f π ) ≤ e s k − s (cid:16) sk ( e nν − (cid:17) s − . We set ν = ( √ c ν / · √ log k/ √ n for some c ν > e nν = k c ν . We then set s tobe the greatest integer less than C n /ν = (2 C n / √ c ν ) · ( √ n/ √ log k ). The condition that C n ≤ p k/n · k − ˜ η for some ˜ η > s ≤ k ψ for some ψ < /
2, so that theabove display is bounded by e k ψ − (1 − k ψ − ) − (cid:18) s k ψ − ( k c ν − (cid:19) s − . This converges to zero as required if c ν is chosen small enough so that 2 ψ + c ν < π , k δ k = (1 + o (1)) sν = (1 + o (1)) C n ν = (1 + o (1)) · C n ( √ c ν / · √ log k/ √ n and | β | = k δ k (1 + o (1)) = (1 + o (1)) C n ( √ c ν / · √ log k/ √ n . Thus,we obtain a lower bound of C n · c · √ log k/ √ n as required. Appendix B Additional results
We present some additional results that are useful for practical implementation with unknownerror variance, and for assessing the plausibility of the assumption Pen( γ ) ≤ C . Appendix B.1considers the problem of estimating the regression function globally, and derives propertiesof a regularized regression estimator in this problem. Appendix B.2 applies this estimatoras an initial estimator used to construct residuals for standard errors with unknown errorvariance. Appendix B.3 presents a lower CI for C that can be used to assess the plausibilityof the assumption Pen( γ ) ≤ C .Throughout most of this section, we focus on the case where Pen( γ ) = k γ k p , with k → ∞ and k /n →
0. We use the following notation. Let θ = ( β, γ ′ ) ′ , and let X = ( X , X ),where X = ( w, Z ), and X = Z . We partition θ accordingly, with θ = ( β, γ ′ ) ′ , and θ = γ .27et H X = X ( X ′ X ) − X ′ and M X = I − H X denote projections onto the column spaceof X and its orthogonal complement.For some results in this section, we allow for the possibility that the distribution of ε i isunknown and possibly non-Gaussian, which requires some additional notation. We considercoverage over a class Q n of distributions for ε , and we use P θ,Q and E θ,Q to denote probabilityand expectation when the data Y are drawn according to Q ∈ Q n and θ = ( β, γ ′ ) ′ ∈ R × Γ,and we use the notation P Q and E Q for expressions that depend on ε only and not on θ . Weassume throughout that ε i is independent across i under each Q ∈ Q n . B.1 Estimating the regression function globally
Consider the regularized regression estimate of θ , given byˆ θ = argmin ϑ k Y − Xϑ k /n + λ k ϑ k p . (19)We first give an elementary property of ˆ θ , following standard arguments (see B¨uhlmann andvan de Geer (2011, Section 6.2) and van de Geer (2000, Chapter 10.1)), and we derive ratesof convergence for this estimator. In the remainder of the appendix, use the estimator toconstruct feasible CIs with unknown error distribution, and to construct a lower CI for theregularity parameter C . Lemma B.1. If k X ′ M X ε k q /n ≤ λ , then k M X X (ˆ θ − θ ) k /n + ( λ − λ ) k ˆ θ k p ≤ ( λ + λ ) k θ k p .Proof. We can write the objective function as k H X Y − X ϑ − H X X ϑ k /n + k M X Y − M X X ϑ k /n + λ k ϑ k p . The first part of the objective can be set to zero for any ϑ by taking ϑ = ( X ′ X ) − X ′ Y − ( X ′ X ) − X ′ X ϑ . Therefore,ˆ θ = argmin ϑ k M X Y − M X X ϑ k /n + λ k ϑ k p , with ˆ θ = ( X ′ X ) − X ′ Y + ( X ′ X ) − X ′ X ˆ θ . This implies H X ε = H X Y − H X X ′ θ = H X X ′ (ˆ θ − θ ), so that k X (ˆ θ − θ ) k /n = k H X ε k /n + k M X X (ˆ θ − θ ) k /n, (20)Using the fact that ˆ θ attains a lower value of the objective than the true parameter value28 , we obtain an ℓ p version of what in the ℓ case B¨uhlmann and van de Geer (2011, Lemma6.1) term “the Basic Inequality,” k M X X (ˆ θ − θ ) k /n + λ k ˆ θ k p ≤ ε ′ M X X (ˆ θ − θ ) /n + λ k θ k p . By H¨older’s inequality, we have 2 ε ′ M X X (ˆ θ − θ ) ≤ k X ′ M X ε k q k ˆ θ − θ k p so that, on theevent k X ′ M X ε k q /n ≤ λ , we have k M X X (ˆ θ − θ ) k /n + λ k ˆ θ k p ≤ λ k ˆ θ − θ k p + λ k θ k p ≤ λ k ˆ θ k p + ( λ + λ ) k θ k p , which implies the result.We now use this result to derive rates of convergence for the regularized regression esti-mator in eq. (19) for estimating the regression function in ℓ loss. For simplicity, we use afixed sequence for the penalty parameter satisfying certain sufficient conditions. In practice,data-driven methods such as cross-validation may be appealing. We discuss another possiblechoice based on moderate deviations bounds in Remark B.1 in Appendix B.3 below. Our aimhere is to present simple sufficient conditions to allow this estimator to be used for auxiliarypurposes such as standard error construction, and we leave the analysis of such extensionsfor future research. Theorem B.1.
Suppose that, for some η > , and for all n and all Q ∈ Q n , we have E Q [ | ε i | max { η,q } ] < /η when p > and P Q ( | ε i | > t ) ≤ − ηt ) when p = 1 . Supposethe elements of M X X are bounded by some constant K X uniformly over n . Let ˆ θ be thepenalized regression estimator defined in eq. (19) with λ = K n r q ( k , n ) , where K n → ∞ and r q ( k, n ) given in eq. (14) . Then sup θ ∈ R k +1 sup Q ∈Q n P θ,Q (cid:16) k X (ˆ θ − θ ) k /n > K n ( k /n + 2 k θ k p r q ( k , n )) (cid:17) → , Proof.
Lemma B.1 and Lemma B.2 below, we have k M X X (ˆ θ − θ ) k /n ≤ K n k θ k p r q ( k , n )with probability approaching one uniformly over θ ∈ R k +1 and Q ∈ Q n . In addition, since H X is idempotent with rank ( k + 1) /n and E Q εε ′ is diagonal with elements boundeduniformly over Q ∈ Q n , we have E Q k H X ε k /n ≤ ˜ Kk /n for some constant ˜ K . The resultfollows by Markov’s inequality and eq. (20). Lemma B.2.
Under the conditions of Theorem B.1, for any sequence K n → ∞ , we have inf Q ∈Q n P Q ( k X ′ M X ε k q /n ≤ K n r q ( k , n )) → . roof. Let ˜ x ij = (2 M X X ) ij . For q < ∞ , we have E Q k X ′ M X ε k qq = E Q k X j =1 n X i =1 ˜ x ij ε i ! q ≤ k · K · n q/ for some constant K that depends only on η , q and K X , by von Bahr (1965). The resultthen follows by Markov’s inequality. For q = ∞ , we have P Q (cid:16) k X ′ M X ε k q /n > K n p log k / √ n (cid:17) = P Q max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˜ x ij ε i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) /n > K n p log k / √ n ! , which, for some ˜ K >
0, is bounded by 2 k exp( − ˜ K · K n log k ) = 2 k − ˜ K · K n → B.2 Standard errors
We consider standard errors for linear estimators of the form ˆ β = a ′ Y considered in the maintext. We assume that the weights a are nonrandom: they can depend on X but not on Y .Let ˆ θ be an estimate of θ , and let ˆ ε = Y − X ˆ θ . Consider the estimator ˆ V = P ni =1 a i ˆ ε i of V Q = var Q ( a ′ Y ) = P ni =1 E Q ε i . The weights a are allowed to depend on n so that a , . . . , a n is a triangular array rather than a sequence, but we leave this implicit in the notation. Weconsider coverage of the feasible bias-aware CIˆ β ± cv α (bias Γ ( ˆ β ) / p ˆ V ) · p ˆ V , where bias Γ ( ˆ β ) is the worst-case bias, given in (6), for the parameter space Θ = R × Γ forthe parameter θ = ( β, γ ′ ) ′ . We first present a general result for an arbitrary parameter spaceΘ. We then specialize to the case where Θ = R × R k × { γ : k γ k ≤ C n } and the residualsˆ ε i are formed using the regularized regression from Appendix B.1. Theorem B.2.
Suppose that, for some η > , η ≤ E Q ε i and E Q | ε i | η ≤ /η for all i andall Q ∈ Q n , and that √ nc n max ≤ i ≤ n a i / P nj =1 a j → and inf θ ∈ Θ ,Q ∈Q n P θ,Q ( k X (ˆ θ − θ ) k ≤ c n ) → for some sequence c n such that c n / √ n is bounded from above. Then, for any δ > , inf θ ∈ Θ ,Q ∈Q n P Q (cid:16) | ( ˆ V − V Q ) /V Q | < δ (cid:17) → . Furthermore, lim inf n inf θ ∈ Θ ,Q ∈Q n P Q (cid:16) β ∈ n ˆ β ± cv α (bias Γ ( ˆ β ) / p ˆ V ) · p ˆ V o(cid:17) ≥ − α. (21)30 roof. We have ˆ V − V Q V Q = P ni =1 a i (ˆ ε i − ε i ) V Q + P ni =1 a i ( ε i − E Q ε i ) V Q . Let ˜ b i = a i / P nj =1 a i . The second term is bounded by | P ni =1 ˜ b i ( ε i − E Q ε i ) | /η . The absolute1 + η moment of this quantity is bounded by a constant times P ni =1 ˜ b ηi · /η η by von Bahrand Esseen (1965). This is bounded by max ≤ i ≤ n ˜ b ηi · P ni =1 ˜ b i /η η = max ≤ i ≤ n ˜ b ηi /η η → ≤ i ≤ n ˜ b i /η times n X i =1 | ˆ ε i − ε i | = n X i =1 | ˆ ε i + ε i | · | ˆ ε i − ε i | ≤ k ˆ ε + ε k k ˆ ε − ε k ≤ ( k ˆ ε − ε k + 2 k ε k ) k ˆ ε − ε k . For some constant K that depends only on η , we have 2 k ε k ≤ K √ n with probabilityapproaching one uniformly over Q ∈ Q n . Since k ˆ ε − ε k = k X (ˆ θ − θ ) k ≤ c n it follows thatthe above display is bounded by ( K √ n + c n ) · c n with probability approaching one uniformlyover θ ∈ Θ, Q ∈ Q n . Plugging in the conditions on c n , it follows that for any δ > θ ∈ Θ ,Q ∈Q n P Q (cid:16)(cid:12)(cid:12)(cid:12) ( ˆ V − V Q ) /V Q (cid:12)(cid:12)(cid:12) < δ (cid:17) →
1. Coverage of the CI then follows from TheoremF.1 in Armstrong and Koles´ar (2018), with the central limit theorem condition following byusing the weights and moment bounds to verify the Lindeberg condition (see Lemma F.1 inArmstrong and Koles´ar (2018)).The condition on Lind( a ) = max ≤ i ≤ n a i / P nj =1 a j can be checked on a case-by-casebasis, or it can be imposed directly by incorporating a bound on Lind( a ) in the optimizationproblem that defines the optimal weights. In the latter case, we note that, by the proofof Theorem 4.1, the optimal rate under ℓ p constraints can be achieved by a linear FLCIwith weights a proportional to w i − z ′ i δ , where P ni =1 ( w i − z ′ i δ ) is bounded from below by aconstant times n . If w i − z ′ i δ is bounded, we can thus obtain the optimal rate of convergenceif we impose the upper bound n Lind( a ) ≤ ˜ K for some large constant ˜ K (or, more generally,we can allow the bound ˜ K to increase with n under appropriate conditions on C n and p ).The conditions of this theorem hold with high probability when the design matrix is drawnsuch that w i − z ′ i δ is the population best linear predictor error for predicting w i with z i , sothis essentially requires tail conditions on this best linear predictor error.For the setting in Theorem B.1, we can take c n = p K n n ( k /n + C n r q ( k , n )) for a slowlyincreasing constant K n , so long as p K n ( k /n + C n r q ( k , n )) · n Lind( a ) →
0. This gives thefollowing result.
Corollary B.1.
Suppose that, for some η > , η ≤ E Q ε i and E Q ε ηi ≤ /η for all i and Q ∈ Q n , and let ˆ ε be the residuals from the regularized regression in (19), with λ iven in Theorem B.1 for some K n → ∞ , and suppose the conditions from Theorem B.1hold. Then, if p K n ( k /n + C n r q ( k , n )) · n Lind( a ) → , the coverage result (21) holds with Θ = R × R k × { γ : k γ k ≤ C n } . To interpret the conditions on Lind( a ), consider the case where k is fixed, k/n → ∞ and C n is bounded away from zero. Then the condition on Lind( a ) reduces to p K n · C n r q ( k, n ) · n Lind( a ) →
0. Note also that, in this case, C n r q ( k, n ) is the optimal rate of convergencefor ˆ β from Theorem 4.1. Thus, in this case, we require n Lind( a ) to increase more slowlythan the inverse of the square root of the optimal rate of convergence for ˆ β . In particular,if n Lind( a ) is bounded as n → ∞ , then we can always construct a feasible bias-aware CIthat is asymptotically valid. As described above, this bound can be imposed on the weightswithout affecting the rate of convergence of the width of the CI. B.3 Lower CIs for C We present a lower CI for the regularity parameter C , which can be used to assess theplausibility of the assumption Pen( γ ) ≤ C . Let ˆ θ ( λ ) denote the regularized regressionestimator of γ , given in (19), with penalty λ . Let λ ∗ α denote an upper bound for the 1 − α quantile of k X ′ M X ε k q /n . Let ˆ C = sup λ>λ ∗ α λ − λ ∗ α λ + λ ∗ α k ˆ θ ( λ ) k p . (22)In the idealized finite sample setting with ε ∼ N (0 , σ I n ) with σ known, λ ∗ α can be computedexactly, so that ˆ C is feasible. Theorem B.3.
Let ˆ C be given in (22) with λ ∗ α given by the − α quantile of k X ′ M X ε k q /n .Then, for any β, γ , γ with k γ k p ≤ C , we have P β,γ ,γ ( C ∈ [ ˆ C, ∞ )) ≥ − α .Proof. It follows from Lemma B.1 that, on the event k X ′ M X ε k q /n ≤ λ ∗ α (which holds withprobability at least 1 − α by assumption), we have λ − λ ∗ α λ + λ ∗ α k ˆ θ ( λ ) k p ≤ k γ k p ≤ C for all λ > λ ∗ α .Thus, the supremum of this quantity over λ in this set is also no greater than C on thisevent.We now present a feasible version of this CI when the error distribution is unknown andpossibly heteroskedastic in the case where p = 1. Let ˜ x ij = ( M ′ X X ) ij . Since q = ∞ in thiscase, we need to choose ˆ λ ∗ α such that2 k X M ′ X ε k ∞ /n = max ≤ j ≤ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 x ij ε i /n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ˆ λ ∗ α − α asymptotically. Let ˆ V j = P ni =1 (2˜ x ij /n ) ˆ ε i , where ˆ ε i is theresidual from an initial regularized regression with λ chosen as in Theorem B.1 for someslowly increasing K n . This leads to the moderate deviations critical value ˆ λ ∗ α , which sets α = k X j =1 − ˆ λ ∗ α / ˆ V / j ) . (23) Remark B.1.
The analysis in Theorem B.1 of the regularized regression estimator (19) relieson choosing a penalty parameter that is greater than 2 k X M ′ X ε k ∞ /n with high probability,which is precisely the goal of the critical value ˆ λ ∗ α given in (23). This suggests an iterativeprocedure in which one uses ˆ λ ∗ α (perhaps with some sequence α n converging slowly to zero)as a data-driven penalty parameter in the regression (19) after using some initial penaltychoice satisfying the conditions of Theorem B.1 to form the residuals used to compute ˆ λ ∗ α .The penalty choice ˆ λ ∗ α is related to data-driven choices of the lasso penalty in the casewith unknown error distribution. Belloni et al. (2012) use similar ideas to choose the penaltyparameter in this setting under ℓ constraints, although our implementation is somewhat dif-ferent, since our parameter space constrains the penalty loadings we place on each parameter.While ˆ λ ∗ α does not take into account correlations between the moments, one could take intoaccount these correlations using a bootstrap implementation, as suggested by Chernozhukovet al. (2013). Theorem B.4.
Suppose that, for some η > , the conditions of Theorem B.1 hold with p = 1 ,and that n P ni =1 ˜ x ij ≥ η for j = 1 , . . . , k for all n , where ˜ x ij = ( M X X ) ij . Let ˆ λ ∗ α be givenin (22) with ˆ V j formed using residuals ˆ ε from the regularized regression (19) with penalty λ chosen as in Theorem B.1 for some K n → ∞ with K n ( k /n +( C n +1) √ log k / √ n ) · (log k ) → . Then, lim sup n sup β,γ : k γ k ≤ C n sup Q ∈Q n P θ,Q (cid:16) max ≤ j ≤ k | P ni =1 x ij ε i /n | > ˆ λ ∗ α (cid:17) ≤ α . Inparticular, letting ˆ C be given in (22) with λ ∗ α given by ˆ λ ∗ α , we have lim inf n inf β,γ : k γ k ≤ C n inf Q ∈Q n P θ,Q (cid:16) C n ∈ [ ˆ C, ∞ ) (cid:17) ≥ − α. Proof.
Let ˜ V j = P ni =1 (2˜ x ij /n ) ε i and let V Q,j = P ni =1 (2˜ x ij /n ) E Q ε i . Note that | ˆ V j − ˜ V j | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (2˜ x ij /n ) (ˆ ε i − ε i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (2˜ x ij /n ) (ˆ ε i + ε i )(ˆ ε i − ε i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (2 K X /n ) k ˆ ε + ε k k ˆ ε − ε k ≤ (2 K X /n ) (2 k ε k + k ˆ ε − ε k ) k ˆ ε − ε k .
33n the event that 2 k ε k ≤ √ n ˜ K and k ˆ ε − ε k = k X (ˆ θ − θ ) k ≤ p nK n · ( k /n + 2 C n p log n/ √ n ) / , which holds with probability approaching one uniformly over Q ∈ Q n when ˜ K is large enough,this is bounded by (2 K X /n ) ( ˜ K √ n + √ nK n ( k /n + 2 C n √ log k / √ n ) / ) · √ nK n ( k /n +2 C n √ log k / √ n ) / . Since V Q,j ≥ ˜ η/n uniformly over j and over n for some ˜ η >
0, thisimplies that, on this event, max ≤ j ≤ k (cid:12)(cid:12)(cid:12) ˆ V j − ˜ V j (cid:12)(cid:12)(cid:12) /V Q,j is bounded by4˜ η − ( K X /n )( ˜ K √ n + p nK n ( k /n +2 C n p log k / √ n ) / ) · p nK n ( k /n +2 C n p log k / √ n ) / , which in turn is bounded by a constant times K / n ( k /n + 2 C n √ log k / √ n ) / so long asthis quantity converges to zero.In addition, note that ( ˜ V j − V Q,j ) /V Q,j = P ni =1 ˜ a ij ( ε i − E Q ε i ) /n , where ˜ a ij = ˜ x ij / ( nV j,Q ) ≤ K X ˜ η − and ˜ η is a lower bound for nV Q,j . Using this bound on ˜ a ij and the tail bound on ε i ,it follows from Bernstein’s inequality for sub-exponential random variables that, for δ < P Q ( | ˜ V j − V Q,j | /V Q,j ≥ δ ) is bounded from above by 2 exp( − cnδ ) for some constant c thatdepends only on K X , ˜ η and η . Thus, for any sequence δ n , we have P Q (max ≤ j ≤ k | ˜ V j − V Q,j | /V Q,j ≥ δ ) ≤ k exp( − cnδ n ), which converges to zero so long as δ n is bounded frombelow by a large enough constant times √ log k / √ n .This gives the rate of convergence for ˆ V j /V Q,j to one which, by continuous differentiabilityof t
7→ √ t at t = 1, gives the same rates for q ˆ V j / p V Q,j . In particular, letting c n begiven by a large enough constant times K / n ( k /n + ( C n + 1) √ log k / √ n ) / , the eventmax ≤ j ≤ k (cid:12)(cid:12)(cid:12)(cid:12)q ˆ V j / p V Q,j − (cid:12)(cid:12)(cid:12)(cid:12) ≤ c n holds with probability approaching one uniformly over Q ∈ Q n and β, γ with k γ k ≤ C n . On this event, we have α = k X j =1 − ˆ λ ∗ α / q ˆ V j ) ≥ k X j =1 − ˆ λ ∗ α / ( p V Q n ,j (1 − c n ))) . Thus, letting λ α,n solve α = P k j =1 − λ α,n / ( p V Q n ,j )), we have ˆ λ ∗ α / (1 − c n ) = λ ˜ α,n for some˜ α ≤ α , so that ˆ λ ∗ α / (1 − c n ) ≥ λ α,n . It follows that the non-coverage probability under anysequence of parameters with k γ k p ≤ C n and any sequence Q n ∈ Q n is bounded by a termthat converges to zero plus P Q n max ≤ j ≤ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 x ij ε i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (1 − c n ) λ α,n ! ≤ k X j =1 F n,j ( − (1 − c n ) λ α,n / p V Q n ,j )34 k X j =1 − λ α,n / p V Q n ,j ) · A n,j · B n,j , where F n,j ( t ) = P Q n (cid:0)(cid:12)(cid:12)P ni =1 x ij ε i / p V Q n ,j (cid:12)(cid:12) > t (cid:1) , A n,j = Φ( − (1 − c n ) λ α,n / √ V Qn,j )Φ( − λ α,n / √ V Qn,j ) and B n,j = F n,j ( − (1 − c n ) λ α,n / √ V Qn,j )2Φ( − (1 − c n ) λ α,n / √ V Qn,j ) . Since P k j =1 − λ α,n / p V Q n ,j ) = α by definition, it suffices to showthat lim sup n →∞ max ≤ j ≤ k max { A n,j , B n,j } ≤ A n,j , we use the bound Φ( − s ) / Φ( − t ) ≤ [ s − / ( t − − t − )] exp(( t − s ) /
2) (this followsfrom the bound ( t − − t − ) exp( − t / / √ π ≤ Φ( − t ) ≤ t − exp( − t / / √ π given in Lemma2, Section 7.1 in Feller (1968)), which gives A n,j ≤ (1 − c n ) − − ( λ α,n / p V Q n ,j ) − exp (cid:0) [1 − (1 − c n ) ] λ α,n / (2 V Q n ,j ) (cid:1) . Using standard calculations and the fact that nV Q n ,j is uniformly bounded from above andbelow, we have (log k ) /K ≤ λ α,n /V Q n ,j ≤ K log k for some constant K . Thus, the right-hand side of the above display converges to 1 uniformly over n and 1 ≤ j ≤ k so long as c n log k →
0, which is guaranteed by the assumptions of the theorem.For B n,j , we use a moderate deviations bound as in Feller (1971, Chapter 16.7). Inparticular, the bound | F n,j ( t ) / (2Φ( t )) − | ≤ ˜ Kt / √ n holds for all 1 ≤ t < t n , where t n isany sequence with t n /n / →
0, and ˜ K depends only on t n and the moment conditions andtail bounds on ε i (Armstrong and Chan, 2016, Lemma B.5). Using the fact that λ α,n / p V Q n ,j is bounded by a constant times √ log k , it follows that lim sup n →∞ max ≤ j ≤ k B n,j ≤ k ) / / √ n →
0, which is guaranteed by the conditions of the theorem.
References
Armstrong, T. B. and Chan, H. P. (2016). Multiscale adaptive inference on conditionalmoment inequalities.
Journal of Econometrics , 194(1):24–43.Armstrong, T. B. and Koles´ar, M. (2018). Optimal inference in a class of regression models.
Econometrica , 86(2):655–683.Armstrong, T. B. and Koles´ar, M. (2016). Optimal inference in a class of regression models.arXiv:1511.06028v2.Armstrong, T. B. and Koles´ar, M. (2020a). Finite-sample optimal estimation and inferenceon average treatment effects under unconfoundedness. arXiv: 1712.04594.35rmstrong, T. B. and Koles´ar, M. (2020b). Sensitivity analysis using approximate momentcondition models.
Quantitative Economics , forthcoming.Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methodsfor optimal instruments with an application to eminent domain.
Econometrica , 80(6):2369–2429.Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects afterselection among high-dimensional controls.
The Review of Economic Studies , 81(2):608–650.Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: Pivotal recovery ofsparse signals via conic programming.
Biometrika , 1(4):791–896.B¨uhlmann, P. and van de Geer, S. A. (2011).
Statistics for High-Dimensional Data: Methods,Theory and Applications . Springer, Berlin, Heidelberg.Cai, T. T. and Guo, Z. (2017). Confidence intervals for high-dimensional linear regression:Minimax rates and adaptivity.
The Annals of Statistics , 45(2):615–646.Cand`es, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is muchlarger than n . The Annals of Statistics , 35(6):2313–2351.Chernozhukov, V., Chetverikov, D., and Kato, K. (2013). Gaussian approximations andmultiplier bootstrap for maxima of sums of high-dimensional random vectors.
The Annalsof Statistics , 41(6):2786–2819.Donoho, D. L. (1994). Statistical estimation and optimal recovery.
The Annals of Statistics ,22(1):238–270.Donoho, D. L. (2006). For most large underdetermined systems of linear equations theminimal ℓ -norm solution is also the sparsest solution. Communications on Pure andApplied Mathematics , 59(6):797–829.Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. J. (2004). Least angle regression.
The Annals of Statistics , 32(2):407–451.Feller, W. (1968).
An Introduction to Probability Theory and Its Applications , volume 1.Wiley, New York, NY, third edition.Feller, W. (1971).
An Introduction to Probability Theory and Its Applications , volume 2.Wiley, New York, NY. 36oldberg, M. (2017). Continuity of seminorms on finite-dimensional vector spaces.
LinearAlgebra and its Applications , 515:175–179.Heckman, N. E. (1988). Minimax estimates in a semiparametric model.
Journal of theAmerican Statistical Association , 83(404):1090–1096.Ibragimov, I. A. and Khas’minskii, R. Z. (1985). On nonparametric estimation of the valueof a linear functional in Gaussian white noise.
Theory of Probability & Its Applications ,29(1):18–32.Imbens, G. and Wager, S. (2019). Optimized regression discontinuity designs.
Review ofEconomics and Statistics , 101(2):264–278.Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing forhigh-dimensional regression.
Journal of Machine Learning Research , 15(82):2869–2909.Javanmard, A. and Montanari, A. (2018). Debiasing the lasso: Optimal sample size forGaussian designs.
Annals of Statistics , 46(6A):2593–2622.Koles´ar, M. and Rothe, C. (2018). Inference in regression discontinuity designs with a discreterunning variable.
American Economic Review , 108(8):2277–2304.Kwon, K. and Kwon, S. (2020). Inference in regression discontinuity designs under mono-tonicity. arXiv: 2011.14216.Li, C. M. and M¨uller, U. K. (2020). Linear regression with many controls of limited explana-tory power.
Quantitative Economics , forthcoming.Li, K.-C. (1982). Minimaxity of the method of regularization of stochastic processes.
TheAnnals of Statistics , 10(3):937–942.Low, M. G. (1995). Bias-variance tradeoffs in functional estimation problems.
The Annalsof Statistics , 23(3):824–835.Muralidharan, K., Romero, M., and W¨uthrich, K. (2020). Factorial designs, model selection,and (incorrect) inference in randomized experiments. Working Paper 26562, NationalBureau of Economic Research.Noack, C. and Rothe, C. (2020). Bias-aware inference in fuzzy regression discontinuitydesigns. arXiv: 1906.04631.Oster, E. (2019). Unobservable selection and coefficient stability: Theory and evidence.
Journal of Business & Economic Statistics , 37(2):187–204.37ambachan, A. and Roth, J. (2019). An honest approach to parallel trends. Unpublishedmanuscript, Harvard University.Robinson, P. M. (1988). Root- N -consistent semiparametric regression. Econometrica ,56(4):931–954.Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths.
The Annals ofStatistics , 35(3):1012–1030.Shao, J. and Deng, X. (2012). Estimation in high-dimensional linear models with determin-istic design matrices.
The Annals of Statistics , 40(2):812–831.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B (Methodological) , 58(1):267–288.van de Geer, S. A. (2000).
Empirical Processes in M -Estimation . Cambridge UniversityPress, Cambridge, UK.van de Geer, S. A., B¨uhlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptoticallyoptimal confidence regions and tests for high-dimensional models. The Annals of Statistics ,42(3):1166–1202.Vershynin, R. (2018).
High-Dimensional Probability: An Introduction with Applications inData Science . Cambridge University Press, Cambridge, UK, first edition.von Bahr, B. (1965). On the convergence of moments in the central limit theorem.
Annalsof Mathematical Statistics , 36(3):808–818.von Bahr, B. and Esseen, C.-G. (1965). Inequalities for the r th absolute moment of a sumof random variables, 1 ≤ r ≤ The Annals of Mathematical Statistics , 36(1):299–303.Wahba, G. (1990).
Spline models for observational data . SIAM, Philadelphia, PA.Yosida, K. (1995).
Functional Analysis . Springer-Verlag, Berlin, reprint of 6th edition.Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parametersin high dimensional linear models.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 76(1):217–242.Zhang, L. (2013). Nearly optimal minimax estimator for high-dimensional sparse linearregression.