Bootstrapping ℓ p -Statistics in High Dimensions
aa r X i v : . [ m a t h . S T ] A ug Bootstrapping ℓ p -Statistics in High Dimensions Alexander Giessing ∗ and Jianqing Fan †‡ August 18, 2020
Abstract
This paper considers a new bootstrap procedure to estimate the distribution of high-dimensional ℓ p -statistics, i.e. the ℓ p -norms of the sum of n independent d -dimensionalrandom vectors with d ≫ n and p ∈ [1 , ∞ ]. We provide a non-asymptotic character-ization of the sampling distribution of ℓ p -statistics based on Gaussian approximationand show that the bootstrap procedure is consistent in the Kolmogorov-Smirnov dis-tance under mild conditions on the covariance structure of the data. As an applicationof the general theory we propose a bootstrap hypothesis test for simultaneous infer-ence on high-dimensional mean vectors. We establish its asymptotic correctness andconsistency under high-dimensional alternatives, and discuss the power of the test aswell as the size of associated confidence sets. We illustrate the bootstrap and testingprocedure numerically on simulated data. Keywords:
Bootstrap; high-dimensional inference; Berry-Esseen bound; anti-concen-tration; Gaussian approximation; Gaussian comparison inequality.
Let X = { X i } ni =1 be a random sample of independent and centered random vectors in R d ,where dimension d = d n may grow with sample size n . Consider the re-scaled sum S Xn = (cid:0) S Xn, , . . . , S Xn,d (cid:1) ′ := 1 √ n n X i =1 X i , and define the ℓ p -statistic T n,p by T n,p := k S Xn k p = (cid:16)P dk =1 | S Xn,k | p (cid:17) /p , p ∈ [1 , ∞ )max ≤ k ≤ d | S Xn,k | , p = ∞ . (1) ∗ Department of ORFE, Princeton University, Princeton, NJ 08544, USA. E-mail: [email protected]. † Department of ORFE, Princeton University, Princeton, NJ 08544, USA. E-mail: [email protected]. ‡ The project is supported by DMS-1662139 and DMS-1712591, NIH grant 2R01-GM072611-13, and ONRgrant N00014-19-1-2120. ℓ p -statistics when the dimension d exceeds the sample size n . This distribution isof interest in many statistical applications. In particular, ℓ -statistic T n, and maximumstatistic T n, ∞ are frequently applied to a broad spectrum of statistical problems such as test-ing of multiple means, construction of simultaneous confidence regions, and model selection(Bai and Saranadasa, 1996; Chen et al., 2010; Fan et al., 2015).In low dimensions, when the dimension d is fixed, the asymptotic properties of ℓ p -statisticsare well-understood: If the data are i.i.d. with finite second moments, the central limit theo-rem (CLT) applied to the re-scaled sum S Xn and the continuous mapping theorem guaranteethat T n,p d → k Z k p , where Z ∼ N (0 , E [ X X ′ ]). Thus, the limiting distribution of T n,p depends on the data only through the first two moments. Closed-form expressions of thelimiting distribution of ℓ p -statistics remain somewhat elusive, but for T n, and T n, ∞ tractablecharacterizations exist under additional assumptions on the covariance structure.The situation is very different in high dimensions. If the dimension d grows faster than √ n , the classical CLT does no longer apply to the re-scaled sum S Xn . So, to approximatethe distribution of T n,p one has to target directly the scalar random variable k S Xn k p . Since k S Xn k p is a highly non-linear function of the random sample X , this calls for a non-parametricapproach. In this direction, Chernozhukov et al. (2013, 2015, 2017a) have made importantprogress by developing a non-parametric multiplier bootstrap procedure to approximate thedistribution of the maximum statistic T n, ∞ . In this paper, we further develop this lineof research. While there exist specialized results for high-dimensional sum-of-squares type T n, -statistics (Bai and Saranadasa, 1996; Bentkus, 2003; Chen et al., 2010; Fan et al., 2015;Pouzo, 2015; Xu et al., 2019), a unified investigation on the weak convergence of general ℓ p -statistics remains highly challenging due to the lack of smoothness of the ℓ p -norm. Inthis sense, our work solves a long-standing open problem initiated by the aforementionedpioneering work.The primary methodological contribution of this paper is a bootstrap procedure for ℓ p -statistics with p ∈ [1 , ∞ ]. Our bootstrap procedure draws inspiration from above observationthat in low dimensions the limiting distribution of ℓ p -statistics depends only on the first twomoments of the data. Specifically, the proposed algorithm involves sampling bootstrap datafrom a Gaussian distribution that is parameterized by an estimate of the covariance matrix.The algorithm works with any estimate of the covariance matrix; it is easy to implementand very versatile. In particular, it can be combined with estimates of the covariance matrixthat leverage special structures such as low rank, (approximate) sparsity or bandedness. Thealgorithm is best understood as a hybridization of non-parametric and parametric bootstrap,and we call it the Gaussian parametric bootstrap .A secondary methodological contribution is a bootstrap hypothesis test for testing manylinear restrictions on high-dimensional mean vectors. This hypothesis test is based on theGaussian parametric bootstrap for ℓ p -statistics and it is asymptotically correct and consistentunder certain high-dimensional alternatives. We give precise recommendations on how tochoose the exponent p ∈ [1 , ∞ ] based on characteristics of the random sample (tails andcovariance structure) and to maximize the power for given alternative hypotheses. For smallexponents p , the test is useful when the goal is to identify significant subsets from a largecollection of means, e.g. sets of genes in micro-array and genetic sequence studies. Whereas2or large exponents p , the test is powerful when the purpose is to detect significant singletons,e.g. anomaly detection in materials science and medical imaging.The two main theoretical contributions of this paper are a non-asymptotic characteriza-tion of the sampling distribution of ℓ p -statistics T n,p in high dimensions and the consistencyof the Gaussian parametric bootstrap. The non-asymptotic characterization is based on aGaussian approximation, i.e. a proxy statistic constructed from Gaussian random vectors.The quality of the Gaussian and bootstrap approximation improves as the sample size n increases and shows a subtle interplay between dimension d , exponent p , and the tail dis-tribution of the data. Among other things, we demonstrate that if the data has light tailsthe approximation errors vanish for log d = o ( n ) and all p ∈ [1 , ∞ ]; whereas if the data isheavy-tailed with at most s ≥ d log d = o ( n s/ ) and all p ∈ [1 , s ]. These theoretical results provide a comprehensive view onthe asymptotic distribution theory of ℓ p -statistics and are relevant in guiding practitioners inchoosing between different ℓ p -statistics given the properties of the random sample at hand.Qualitatively, our numerical experiments lend further support to these theoretical findings.Establishing the Gaussian approximation and consistency of the bootstrap is non-trivialand we develop a significant amount of new technical tools. The following three technicalresults are of interest beyond the scope of this paper: First, we derive an abstract Berry-Esseen-type CLT for ℓ p -statistics in high dimensions, which extends and improves the knownBerry-Esseen-type CLTs for p = 2 (Bentkus, 2003) and p = ∞ (Chernozhukov et al., 2017a).Second, we establish an anti-concentration inequality for ℓ p -norms of random vectors withlog-concave probability measure. For p ∈ { , ∞} this inequality is sharper than relatedinequalities by G¨otze et al. (2019) and Chernozhukov et al. (2017b). Third, we develop aGaussian comparison inequality to compare the distributions of ℓ p -norms of different Gaus-sian random vectors in Kolmogorov-Smirnov distance. For p = ∞ this inequality improvesthe corresponding result in Chernozhukov et al. (2015). Organization.
The paper is organized as follows. We introduce the Gaussian parametricbootstrap in Section 2 and present our main theoretical results on the Gaussian approxima-tion of ℓ p -statistics and the consistency of the Gaussian parametric bootstrap in Section 3.We develop applications to testing high-dimensional mean vectors in Section 4 and reportresults from several numerical experiments in Section 5. In Appendix A we discuss technicalresults, including the abstract Berry-Esseen-type CLT, the anti-concentration inequalitiesfor ℓ p -statistics, and the new Gaussian comparison theorems. Appendix B contains proofsto all our results. Notation.
For non-negative real-valued sequences { a n } n ≥ and { b n } n ≥ , the relation a n . b n means that there exists an absolute constant c > n, d, p and an integer n ∈ N such that a n ≤ cb n for all n ≥ n . We write a n ≍ b n if a n . b n and b n . a n . We define a n ∨ b n = max { a n , b n } and a n ∧ b n = min { a n , b n } . For a vector a ∈ R d and p ∈ [1 , ∞ ) we write k a k p = ( P dk =1 | a k | p ) /p . Also, we write k a k ∞ = max ≤ k ≤ d | a k | . For a scalar random variable ξ and α ∈ (0 ,
2] we define the ψ α -Orlicz norm by k ξ k ψ α = inf { t > | ξ | α /t α )] ≤ } . Fora sequence of scalar random variables { ξ n } n ≥ we write ξ n = O p ( a n ) if ξ n /a n is stochasticallybounded. For any symmetric real-valued matrix M ∈ R d × d we denote its largest and smallesteigenvalue by λ max ( M ) and λ min ( M ), respectively. We denote its operator norm by k M k op (its largest singular value) and k M k → p = sup k u k ≤ k M u k p . We write M (cid:23) M is positive semi-definite. For any convex body K ⊂ R d we write Vol( K ) = R K dλ d ,where λ d is the Lebesgue measure in d dimensions. We introduce the new Gaussian parametric bootstrap for ℓ p -statistics and discuss its relationto the non-parametric Gaussian multiplier bootstrap. Let X = { X i } ni =1 be a random sample of independent and centered random vectors. TheGaussian parametric bootstrap algorithm requires as input a consistent and positive semi-definite estimate b Σ n of the (averaged) population covariance matrix,Σ n := E " n n X i =1 X i X ′ i . We will discuss candidates for b Σ n in subsequent sections. Let V X | X ∼ N (0 , b Σ n ) and definethe Gaussian parametric bootstrap estimate of the ℓ p -statistic T n,p by T ∗ n,p := (cid:13)(cid:13) V X (cid:13)(cid:13) p = (cid:16)P dk =1 (cid:12)(cid:12) V Xk (cid:12)(cid:12) p (cid:17) /p , p ∈ [1 , ∞ )max ≤ k ≤ d | V Xk | , p = ∞ . (2)The rationale for this bootstrap statistic is easiest to understand in low dimensions: Ifdimension d is fixed and the data X = { X i } ni =1 is i.i.d. with finite second moments, the CLTand the continuous mapping theorem imply that T n,p d → k Z k p , where Z ∼ N (0 , E [ X X ′ ]).Hence, in this scenario, the bootstrap statistic T ∗ n,p is just the parametric bootstrap estimateof the limiting random variable k Z k p . Of course, if d ≥ √ n and the data is non-identicallydistributed, the CLT does not apply and the limiting random variable Z needs not to exist.The gist of the theoretical results in Sections 3.2 and 3.3 is that we do not need the CLTto hold for the distributions of T ∗ n,p and T n,p to be close. For ℓ p -statistics this result isnew, but it is in line with similar results on linear regression functions, empirical processesin infinite-dimensional Banach spaces, as well as maximum and spectral statistics in highdimensions (e.g. Bickel and Freedman, 1983; Radulovi´c, 1998; Chernozhukov et al., 2013;R¨ollin, 2013; Lopes et al., 2019). The Gaussian multiplier bootstrap was first proposed by Chernozhukov et al. (2013) in thecontext of the maximum statistic T n, ∞ . It is a special case of the wild bootstrap method (Wu,1986; Liu, 1988; Mammen, 1993) and its adaptation to general ℓ p -statistics T n,p is straight-forward: 4et g = { g i } ni =1 be a sequence of i.i.d. standard normal random variables independent ofthe random sample X = { X i } ni =1 . The Gaussian multiplier bootstrap algorithm builds onthe centered random sample X − ¯ X n , . . . , X n − ¯ X n , where ¯ X n := n − P ni =1 X i . We set S gXn := (cid:16) S gXn , . . . , S gXnd (cid:17) ′ := 1 √ n n X i =1 g i ( X i − ¯ X n ) , and define the Gaussian multiplier bootstrap estimate of the ℓ p -statistic T n,p by T gn,p := (cid:13)(cid:13) S gXn (cid:13)(cid:13) p (3)Since S gXn | X ∼ N (0 , b Σ naive ) with b Σ naive = n − P ni =1 ( X i − ¯ X n )( X i − ¯ X n ) ′ , the Gaus-sian multiplier bootstrap statistic is in fact equivalent to a Gaussian parametric bootstrapstatistic based on the sample covariance matrix b Σ naive . The key advantage of the Gaussianparametric over the Gaussian multiplier bootstrap is that it allows for more refined estimatesof the population covariance matrix Σ n that leverage additional structure such as low-rank,(approximate) sparsity, and bandedness. This is particularly important in high dimensionswhere the sample covariance matrix b Σ naive is a poor estimate of the population covariancematrix. We present a non-asymptotic characterization of ℓ p -statistics via Gaussian approximationand establish the consistency of the Gaussian parametric bootstrap procedure. Unless otherwise stated, X = { X i } ni =1 denotes a random sample of independent and centeredrandom vectors in dimension d , where d = d n grows with the sample size n . We analyzethe theoretical properties of ℓ p -statistics and the Gaussian parametric bootstrap under thefollowing three different assumptions on the tails of random vectors. Assumption 1 (Sub-Gaussian) . Let X = { X i } ni =1 be a sequence of independent and centeredrandom vectors in R d such that for all ≤ i ≤ n , ∀ u ∈ R d : k u ′ X i k ψ . E (cid:2) ( u ′ X i ) (cid:3) / . Assumption 2 (Sub-Exponential) . Let X = { X i } ni =1 be a sequence of independent andcentered random vectors in R d such that for all ≤ i ≤ n , ∀ u ∈ R d : k u ′ X i k ψ . E (cid:2) ( u ′ X i ) (cid:3) / . Assumption 3 (Finite s th moments) . Let X = { X i } ni =1 be a sequence of independent andcentered random vectors in R d such that for some s ≥ and all ≤ i ≤ n , ∀ u ∈ R d : E [ | u ′ X i | s ] /s . K s E (cid:2) ( u ′ X i ) (cid:3) / . u ′ X i ) ] / ≤ k E[ X i X ′ i ] k op k u k . Hence, wecan easily incorporate characteristics of the covariance matrix such as sparsity, bandedness,low-rank, etc. Assumptions 2 and 3 relax and generalize Assumption 1 in an obvious way.Most importantly, if X satisfy Assumption 3 for all s ≥ K s = √ s ( K s = s ) then X is sub-Gaussian (sub-Exponential) and also satisfy Assumption 1 (Assumption 2). In this section we show that the distribution of the ℓ p -statistic T n,p can be approximatedby the distribution of a proxy statistic based on Gaussian random vectors. This resultrationalizes the Gaussian parametric bootstrap procedure in high dimensions. It is alsorelevant for establishing bootstrap consistency in the next section.Let Z = { Z i } ni =1 be a sequence of independent multivariate Gaussian random vectors Z i ∼ N (0 , E[ X i X ′ i ]) which are independent of X = { X i } ni =1 . We define the Gaussian proxystatistic of the ℓ p -statistic T n,p as e T n,p = k S Zn k p , where S Zn := 1 √ n n X i =1 Z i . (4)To state the Gaussian approximation result we need to define the following additionalquantities: the rank of the (averaged) covariance matrices of the X i ’s, r n := rank E " n n X i =1 X i X ′ i , (5)the smallest and largest variances of the X i ’s, σ n, min := min ≤ k ≤ d min ≤ i ≤ n E (cid:2) X ik (cid:3) and σ n, max := max ≤ k ≤ d max ≤ i ≤ n E[ X ik ] , (6)and the largest ratio of the variances of the X i ’s, κ n := max ≤ k ≤ d (cid:18) max ≤ i ≤ n E[ X ik ] . min ≤ i ≤ n E[ X ik ] (cid:19) . (7)Our first theorem shows that the distribution of e T n,p can approximate the distribution of T n,p in Kolmogorov-Smirnov distance uniformly over all p ∈ [1 , ∞ ]. Theorem 1 (Gaussian approximation) . (i) For all p ∈ [1 , ∞ ) and X satisfying Assumption 1, sup t ≥ (cid:12)(cid:12)(cid:12) P( T n,p ≤ t ) − P( e T n,p ≤ t ) (cid:12)(cid:12)(cid:12) . s p (log d ) r /pn n / σ n, max σ n, min . (8)6 ii) For all p ∈ [1 , ∞ ) and X satisfying Assumption 2, sup t ≥ (cid:12)(cid:12)(cid:12) P( T n,p ≤ t ) − P( e T n,p ≤ t ) (cid:12)(cid:12)(cid:12) . s p (log d ) r /pn n / σ n, max σ n, min . (9) (iii) For all p ∈ [log d, ∞ ] and X satisfying either Assumption 1 or 2, sup t ≥ (cid:12)(cid:12)(cid:12) P( T n,p ≤ t ) − P( e T n,p ≤ t ) (cid:12)(cid:12)(cid:12) . (cid:18) κ n log dn (cid:19) / . (10) (iv) For X satisfying Assumption 3 with s ≥ and all p ∈ [1 , s ] , sup t ≥ (cid:12)(cid:12)(cid:12) P( T n,p ≤ t ) − P( e T n,p ≤ t ) (cid:12)(cid:12)(cid:12) . ( K s ∨ √ s ) s p d / (3 s ) r /pn n / σ n, max σ n, min . (11) Remark 1.
This result is a special case of an abstract Berry-Esseen-type CLT for ℓ p -normsof sums of high-dimensional random vectors. We present this more general result togetherwith a discussion of the related literature in Appendix A.1. Remark 2.
The dependence of the bounds on σ n, max , σ n, min , and κ n is not necessarily opti-mal; e.g., if the X i ’s exhibit variance decay in the sense of Lopes et al. (2020), directly apply-ing the abstract Berry-Esseen-type CLT in Appendix A.1 can yield better bounds. Moreover,we can always replace σ n, min by the larger quantity min ≤ k ≤ d p n − P ni =1 E[ X ik ] . The theorem reveals that even in high dimensions the distribution of T n,p depends on thedata mostly through the first and second moments, i.e. mean zero and covariance matrixΣ n . This insight significantly simplifies the task of estimating the distribution of T n,p and isthe rationale for the Gaussian parametric bootstrap procedure.Another striking aspect of this result is the dependence on exponent p ∈ [1 , ∞ ]. Namely,as the exponent p crosses the threshold log d , the upper bounds in ( i ) − ( iv ) undergo aphase transition from polynomial in r n to logarithmic in d . This phase transition is di-rectly related to similar behavior of the variance of ℓ p -norms of Gaussian random vec-tors (Paouris and Valettas, 2018). We discuss this technical aspect in greater detail inAppendix A.2.Since this Gaussian approximation result is non-asymptotic we can take limits (withrespect to n, d, p ) in any order. Given the scope of the paper, we are most interested inthe high-dimensional setting with n, d → ∞ and p ∈ [1 , ∞ ) fixed. For this asymptoticregime we note the following: The bounds in cases ( i ), ( ii ), and ( iv ) imply that the largerthe exponent p and the stronger the moment conditions on the X i ’s, the faster d can grow(relative to n ) while still guaranteeing that the distributions of T n,p and e T n,p are close. Case( iii ) (with p = ∞ ) covers the case of the max-statistic considered in Chernozhukov et al.(2013, 2015, 2017a) and improves their bound by removing the dependence on the inverseof σ n, min . If the X i ’s are identically distributed then κ n = 1 and the bound is independentof any characteristic of the covariance matrix of the data (rank, eigenvalues, or diagonalvalues).Since the T n, statistic is of particular interest in many statistical applications, we providethe following easy corollary with a short discussion.7 orollary 1 (Gaussian approximation of T n, ) . (i) If X is sub-Gaussian (satisfies Assumption 1), then sup t ≥ (cid:12)(cid:12)(cid:12) P( T n, ≤ t ) − P( e T n, ≤ t ) (cid:12)(cid:12)(cid:12) . s (log d ) r / n n / σ n, max σ n, min . (12) (ii) If X has finite s ≥ moments (satisfies Assumption 3 with s ≥ ), then sup t ≥ (cid:12)(cid:12)(cid:12) P( T n, ≤ t ) − P( e T n, ≤ t ) (cid:12)(cid:12)(cid:12) . (cid:0) K s ∨ √ s (cid:1) s d / (3 s ) r / n n / σ n, max σ n, min . (13) Remark 3.
A similar result holds for sub-Exponential random variables satisfying Assump-tion 2.
The result that is most related to Corollary 1 is the dimension-dependent Berry-Essenbound by Bentkus (2003). Bentkus (2003) addresses a slightly more general problem thanwe do: He derives a Berry-Esseen-type CLT for S Xn that holds uniformly over the class ofEuclidean balls with arbitrary radii and arbitrary centers. In contrast, our Corollary 1 cor-responds to a Berry-Esseen-type CLT for S Xn that holds uniformly over the class of localizedEuclidean balls with arbitrary radii but center fixed to the origin. The upper bound in The-orem 1.1 in Bentkus (2003) is at least of order d / n − / . It appears that part of the reasonwhy we obtain a better dependence on dimension d (relative to n ) is that we consider onlylocalized Euclidean balls.There is a rich literature on the closely related problem of Gaussian approximationsof quadratic forms (e.g. Bentkus and G¨otze, 1997; G¨otze and Zaitsev, 2014; Pouzo, 2015;Spokoiny and Zhilova, 2015; G¨otze et al., 2019; Xu et al., 2019). The Berry-Esseen-typebounds in this literature often feature a better dependence on the sample size n , but eitherhave a worse dependence on dimension d relative to n , leave the dependence on d whollyunaddressed, or do not apply to degenerate distributions (i.e. low-rank covariance matrix).In general, the existing bounds appear to be less useful for applications to high-dimensionalstatistics than our results in this section. In this section we provide non-asymptotic bounds on the Kolmogorov-Smirnov distance be-tween the distributions of the ℓ p -statistic T n,p and the Gaussian parametric bootstrap statistic T ∗ n,p . As corollary we also show the consistency of the Gaussian parametric bootstrap.Recall from Section 2.1 that the Gaussian parametric bootstrap requires a positive semi-definite estimate b Σ n of the (averaged) population covariance matrix Σ n . The non-asymptoticbounds in this section depend on the following quantities b ∆ op := k b Σ n − Σ n k op and b ∆ p := k vec( b Σ n − Σ n ) k p , p ∈ [1 , ∞ ] . (14)8ote that b ∆ p corresponds to the entry-wise ℓ p -norm of b Σ n − Σ n with b ∆ being the Frobeniusnorm. To establish the bootstrap consistency, we usesup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p ≤ t | X ) (cid:12)(cid:12) ≤ sup t ≥ (cid:12)(cid:12)(cid:12) P( T n,p ≤ t ) − P( e T n,p ≤ t ) (cid:12)(cid:12)(cid:12) + sup t ≥ (cid:12)(cid:12)(cid:12) P( e T n,p ≤ t ) − P( T ∗ n,p ≤ t | X ) (cid:12)(cid:12)(cid:12) . The first term on the right hand side in above display is deterministic and can be boundedby using Theorem 1. The second term is stochastic and can be handled by the Gaussiancomparison inequality in Appendix A.3.The following theorem shows that the distributions of T n,p and T ∗ n,p are close in Kolmogorov-Smirnov distance uniformly over all p ∈ [1 , ∞ ] and for generic estimates b Σ n . Theorem 2 (Consistency of the Gaussian parametric bootstrap) . (i) For all p ∈ [1 , ∞ ) and X satisfying Assumption 1, sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p ≤ t | X ) (cid:12)(cid:12) . s p (log d ) r /pn n / σ n, max σ n, min + s p r /pn d /p b ∆ p σ n, min . (15) (ii) For all p ∈ [log d, ∞ ] and X satisfying Assumption 1, sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p ≤ t | X ) (cid:12)(cid:12) . (cid:18) κ n log dn (cid:19) / + κ n (log d ) s b ∆ op ∧ b ∆ ∞ σ n, max . (16) (iii) For X satisfying Assumption 3 with s ≥ and all p ∈ [1 , s ] , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p ≤ t | X ) (cid:12)(cid:12) . ( K s ∨ √ s ) s p d / (3 s ) r /pn n / σ n, max σ n, min + s p r /pn d /p b ∆ p σ n, min . (17) Remark 4.
The second term on the right hand side of inequalities 15–17 reflects the differ-ence between e T n,p and T ∗ n,p in the Kolmogrov-Smirnov distance. Theorem 2 ( i ) and ( ii ) holdalso for sub-Exponential random variables with the obvious modifications. Theorem 2 is only practically relevant in combination with estimates b Σ n for which thestochastic estimation errors b ∆ p and b ∆ op ∧ b ∆ ∞ are small. In Appendix A.5 we provide boundson these quantities for several different estimates b Σ n . For the remainder of this section weconsider the special case b Σ n = b Σ naive := n − P ni =1 ( X i − ¯ X n )( X i − ¯ X n ) ′ . We define the naiveGaussian parametric bootstrap estimate based on the sample covariance matrix b Σ naive by T ∗ n,p, naive := k V naive k p , V naive | X ∼ N (0 , b Σ naive ) . Since T ∗ n,p, naive is equivalent to the Gaussian multiplier statistic T gn,p , the following result isalso a statement about the Gaussian multiplier bootstrap.9 orollary 2 (Consistency of the naive Gaussian parametric bootstrap) . Suppose that X satisfies Assumption 1. Let ζ ∈ (0 , arbitrary and set λ n ≍ q log d +log(2 /ζ ) n W log d +log(2 /ζ ) n .(i) For all p ∈ [1 , ∞ ) with probability at least − ζ , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p, naive ≤ t | X ) (cid:12)(cid:12) . s p (log d ) r /pn n / σ n, max σ n, min + s p λ n d /p r /pn σ n, max σ n, min . (18) (ii) For all p ∈ [log d, ∞ ] with probability at least − ζ , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p, naive ≤ t | X ) (cid:12)(cid:12) . (cid:18) κ n log dn (cid:19) / + q λ n κ n log d. (19) Remark 5.
The bound in case ( ii ) depends on the covariance matrix only through the ratio κ n ≥ . If the X i ’s are identically distributed then κ n = 1 . For p = ∞ this is a usefulimprovement over the bounds in Theorem 4.1 and Proposition 4.1 in Chernozhukov et al.(2017a). The main message of this corollary is that in high dimensions the naive Gaussian para-metric and the Gaussian multiplier bootstrap can be consistent for large exponents p ≥ log d but may fail to be consistent for small exponents p ∈ [1 , log d ). More precisely, cases ( i ) and( ii ) imply that the naive Gaussian parametric and the Gaussian multiplier bootstrap are con-sistent in probability for small p ∈ [1 , log d ) if d /p log d = o ( n ) and for large p ∈ [log d, ∞ ]if log d = o ( n ). Using the Borel-Cantelli lemma, one can easily turn this into sufficientconditions for “almost sure” bootstrap consistency. We establish two refined consistency results for the Gaussian parametric bootstrap in highdimensions. In particular, we significantly improve the rates of bootstrap consistency forsmall exponents p ∈ [1 , log d ) (cf. Corollary 2 ( i )) by exploiting certain sparsity and band-edness properties of the covariance matrix. We do not present results for large exponents p ∈ [log d, ∞ ] because in this regime sparsity and bandedness properties cannot be leveraged(and are also not needed) to further improve the rates given in Corollary 2 ( ii ).To keep the discussion simple, we now assume that X = { X i } ni =1 is a random sample ofi.i.d. random vectors in R d with mean zero and covariance matrix Σ = ( σ jk ) dj,k =1 . We willdrop the subscript n in r , σ , and σ . Assumption 4 (Approximately sparse covariance matrix) . There exist constants γ ∈ [0 , , θ ∈ [1 , ∞ ] and R γ,θ > such that max ≤ j ≤ d d X k =1 | σ jk | γθ ! /θ ≤ R γ,θ . (20)10or γ = 0 this assumption is most restrictive and implies that the covariance matrix issparse with at most R θ ,θ non-zero entries in each row. The covariance matrix of an AR-process is a prominent example satisfying this assumption for some positive γ > Assumption 5 (Approximately bandable covariance matrix) . There exist constants α ∈ (0 , ∞ ] and θ ∈ [1 , ∞ ] such that for all ≤ ℓ ≤ d − , max ≤ k ≤ d d X j =1 (cid:8) | σ jk | θ : | j − k | > ℓ (cid:9)! /θ ≤ B θ ℓ − α , (21) for some B θ > . The larger α >
0, the more the covariance matrix Σ resembles a diagonal matrix. Co-variance matrices of MA-processes satisfies this assumption for some finite α > θ = 1 Assumptions 4 and 5 reduce to two frequently adopted assumptions in the liter-ature on high-dimensional covariance estimation (e.g. Bickel and Levina, 2008a,b; Cai et al.,2010; Cai and Liu, 2011; Avella-Medina et al., 2018, and references therein). The larger θ ,the milder are the restrictions imposed on the covariance matrix.Under Assumption 4 it is natural to estimate the covariance matrix via thresholdingof the naive sample covariance (e.g. Bickel and Levina, 2008a; Lam and Fan, 2009). Forsimplicity, here we only consider the hard-thresholding operator; Appendix A.5 containsresults for more general thresholding operators. For a matrix M = ( m jk ) dj,k =1 and λ >
0, wedefine the hard-thresholding operator by T λ ( M ) := (cid:0) m jk {| m jk | > λ } (cid:1) dj,k =1 . (22)Under Assumption 5 it is common to estimate the covariance matrix via banding of thenaive sample covariance (e.g. Bickel and Levina, 2008b): For a given ℓ >
0, define B ℓ ( M ) := (cid:0) m jk {| j − k | ≤ ℓ } (cid:1) dj,k =1 . (23)Recall that the Gaussian parametric bootstrap procedure requires a positive semi-definiteestimate of the covariance matrix. If λ min (Σ) and sample size n are sufficiently large,Bickel and Levina (2008a) and Bickel and Levina (2008b) show that T λ ( b Σ naive ) and B ℓ ( b Σ naive )are positive definite with probability one. If the sample size is small we suggest projectingthese estimates onto the cone of positive semi-definite matrices. Since the resulting positivesemi-definite projections T + λ ( b Σ naive ) and B + ℓ ( b Σ naive ) maintain the same order of ℓ p -error asthe original estimates, this projection step does not add any additional theoretical challenges.Indeed, define T + λ ( b Σ naive ) := arg min S (cid:23) k vec( T λ ( b Σ naive ) − S ) k p , (24)and observe that by triangular inequality and contraction property of projections, k vec( T + λ ( b Σ naive ) − Σ) k p ≤ k vec( T λ ( b Σ naive ) − Σ) k p . (25)11he same reasoning applies to B + ℓ ( b Σ naive ). In the following, we therefore tacitly assumethat this projection step has been applied and drop the superscript “+”.We define the Gaussian parametric bootstrap statistics based on T λ ( b Σ naive ) and B ℓ ( b Σ naive ),respectively, by T ∗ n,p,λ := k V λ k p , V λ | X ∼ N (cid:0) , T λ ( b Σ naive ) (cid:1) , (26)and T ∗ n,p,ℓ := k V ℓ k p , V ℓ | X ∼ N (cid:0) , B ℓ ( b Σ naive ) (cid:1) , (27)where thresholding level λ > ℓ > Corollary 3 (Consistency of the Gaussian parametric bootstrap under approximate spar-sity) . Let X = { X i } ni =1 be a random sample of i.i.d. random vectors in R d with mean zeroand covariance matrix Σ . Suppose that Σ satisfies Assumption 4.(i) Set λ n ≍ q log d +log(2 /ζ ) n W log d +log(2 /ζ ) n with ζ ∈ (0 , arbitrary. If in addition Assump-tion 1 holds, then for all p ∈ [ θ, ∞ ) with probability at least − ζ , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p,λ n ≤ t | X ) (cid:12)(cid:12) . s p (log d ) r /p n / σ σ + s p λ n r /p λ γn R γ,p σ γ max σ σ . (28) (ii) Set λ n ≍ q s ∧ log dn . If in addition Assumption 3 holds with s ≥ ∨ θ , then for all p ∈ [2 ∨ θ, s ] , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p,λ n ≤ t | X ) (cid:12)(cid:12) . ( K s ∨ √ s ) s p d / (3 s ) r /p n / σ σ + O p K − γs s p λ n d /s r /p ( λ n d /s ) γ R γ,p σ γ max σ σ ! . (29) Corollary 4 (Consistency of the Gaussian parametric bootstrap under approximate band-edness) . Let X = { X i } ni =1 be a random sample of i.i.d. random vectors in R d with mean zeroand covariance matrix Σ . Suppose that Σ satisfies Assumption 5.(i) Set ℓ n = B p/ (1+ pα ) p σ − p/ (1+ pα )max λ − p/ (1+ pα ) n , where λ n ≍ q log d +log(2 /ζ ) n W log d +log(2 /ζ ) n and ζ ∈ (0 , arbitrary. If in addition Assumption 1 holds, then for all p ∈ [ θ, ∞ ) withprobability at least − ζ , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p,ℓ n ≤ t | X ) (cid:12)(cid:12) . s p (log d ) r /p n / σ σ + s p λ n r /p λ / (1+ pα ) n B / (1+ pα ) p σ / (1+ pα )max σ σ . (30)12 ii) Set ℓ n = B p/ (1+ pα ) p σ − p/ (1+ pα )max λ − p/ (1+ pα ) n , where λ n ≍ q s ∧ log dn . If in addition Assump-tion 3 holds with s ≥ ∨ θ , then for all p ∈ [2 ∨ θ, s ] , sup t ≥ (cid:12)(cid:12) P( T n,p ≤ t ) − P( T ∗ n,p,ℓ n ≤ t | X ) (cid:12)(cid:12) . ( K s ∨ √ s ) s p d / (3 s ) r /p n / σ σ + O p K pα pα s vuut p λ n d /s r /p ( λ n d /s ) pα B pα p σ pα max σ σ . (31) Remark 6.
For large exponents p ∈ [log d, ∞ ] the bootstrap statistics T ∗ n,p,λ n and T ∗ n,p,ℓ n satisfy the upper bounds in Corollary 2 ( ii ) . The main takeaway from these two corollaries is that under reasonable assumptions onthe covariance structure and the tails of the data there exist Gaussian parametric bootstrapstatistics T ∗ n,p that are consistent in high dimensions for any fixed p ∈ [1 , ∞ ).In particular, inequality (28) (inequality (30)) implies that if the data is sub-Gaussianand the population covariance matrix is approximately sparse (approximately bandable) theGaussian parametric bootstrap based on the thresholded covariance matrix (the bandedcovariance matrix) is consistent in probability for all p ∈ [1 , ∞ ) provided that log d = o ( n ). As an application of the Gaussian parametric bootstrap we now present a bootstrap hy-pothesis test based on ℓ p -statistics for testing linear restrictions on high-dimensional meanvectors. We show that this test is asymptotic correct and consistent. Moreover, we discussthe effect of the exponent p on the size of simultaneous confidence sets and the power ofthe test. Lastly, we discuss an extension of the generic testing framework to simultaneousinference on high-dimensional linear models. Given a random sample X = { X i } ni =1 of i.i.d. random vectors in R d with unknown mean µ and unknown covariance matrix Σ we are interested in testing the high-dimensional linearrestrictions H : M µ = m vs . H : M µ = m , (32)for some M ∈ R d ′ × d and m ∈ R d ′ when dimension d and number of restrictions d ′ mayexceed the sample size n .We propose to test hypothesis (32) on the basis of the ℓ p -statistic S n,p := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n n X i =1 ( M X i − m ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p , p ≥ , (33)13nd, given a nominal level α ∈ (0 , S n,p ≥ c ∗ n,p (1 − α ) , (34)where c ∗ n,p ( α ) is the α -quantile of the Gaussian parametric bootstrap estimate S ∗ n,p := k V k p , V | X ∼ N (0 , b Ω n ) , (35)and b Ω n is a positive semi-definite estimate of Ω = M Σ M ′ .A distinguishing feature of this bootstrap hypothesis test is the exponent p ∈ [1 , ∞ ] andwe show that the exponent p has significant impact on the asymptotic correctness and thepower of the test. In practice, tests based on ℓ p -statistics S n,p with exponents 1, 2, and ∞ are of particular interest. For one, the ℓ -statistic S n, and the maximum statistic S n, ∞ lieat opposite ends of the spectrum of possible exponents p and therefore have power functionsthat are complementary in a sense to be made precise below. For another, the maximumstatistic S n, ∞ can also be applied to the problem of multiple hypothesis testing. Since thebootstrap test based on S n, ∞ accounts for the dependence between the multiple tests, itis (asymptotically) less conservative than the Bonferroni adjustment. Lastly, the sum-of-squares type statistic S n, is essentially a feasible version of Hotelling’s T -statistic in highdimensions and as such interesting in its own right (Fan et al., 2015).Let H = { µ ∈ R d : M µ = m } and H = H c . Write Ω = ( ω jk ) d ′ j,k =1 , and r ω := rank(Ω) , ω := ( ω kk ) d ′ k =1 , ω := min ≤ k ≤ d ′ ω k , ω := max ≤ k ≤ d ω k . (36)Let b Ω n be a positive semi-definite estimate of Ω and define b Γ op := k b Ω n − Ω k op and b Γ p := k vec( b Ω n − Ω) k p , p ∈ [1 , ∞ ] . (37)We also introduce the following high-level assumption. Assumption 6 (Asymptotic sufficient conditions) . At least one of the following statementsholds true.(i) Assumption 1 holds, p ∈ [1 , ∞ ) , (log d ′ ) r /pω ω ω − = o ( n ) , and b Γ p = o p (cid:16) r − /pω d ′ /p ω (cid:17) . (ii) Assumption 1 holds, p ∈ [log d ′ , ∞ ] , log d ′ = o ( n ) , and b Γ op ∧ b Γ ∞ = o p (cid:0) (log d ′ ) − ω (cid:1) . (iii) Assumption 3 holds with s ≥ , p ∈ [1 , s ] , ( K s ∨ s ) d ′ /s r /pω ω ω − = o ( n ) , and b Γ p = o p (cid:16) r − /pω d ′ /p ω (cid:17) . We emphasize that under rather mild conditions there exist estimates b Ω n such that b Γ n and b Γ op ∧ b Γ ∞ satisfy the conditions in Assumption 6; see Appendix A.5 for details.14 .2 Asymptotic correctness In this section we show that the bootstrap hypothesis test has asymptotic correct size. Westate the theorem in a non-asymptotic fashion to match the results from previous sections.
Theorem 3 (Asymptotic size α test) . Let ξ be an arbitrary real-valued random variable,whose role will be discussed afterwards.(i) For all p ∈ [1 , ∞ ) and X satisfying Assumption 1, sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12) P µ (cid:0) S n,p + ξ ≤ c ∗ n,p ( α ) (cid:1) − α (cid:12)(cid:12) . s p (log d ′ ) r /pω n / ω ω + inf δ> s p r /pω d ′ /p δω + P (cid:16)b Γ p > δ (cid:17) + inf η> s pr /pω d ′ /p η ω + P ( | ξ | > η ) . (38) (ii) For all p ∈ [log d, ∞ ] and X satisfying Assumption 1, sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12) P µ (cid:0) S n,p + ξ ≤ c ∗ n,p ( α ) (cid:1) − α (cid:12)(cid:12) . (cid:18) log d ′ n (cid:19) / + inf δ> ( (log d ′ ) s δω + P (cid:16)b Γ op ∧ b Γ ∞ > δ (cid:17)) + inf η> ( (log d ′ ) s η ω + P ( | ξ | > η ) ) . (39) (iii) For X satisfying Assumption 3 with s ≥ and all p ∈ [1 , s ] , sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12) P µ (cid:0) S n,p + ξ ≤ c ∗ n,p ( α ) (cid:1) − α (cid:12)(cid:12) . ( K s ∨ √ s ) s p d ′ / (3 s ) r /pω n / ω ω + inf δ> s p r /pω d ′ /p δω + P (cid:16)b Γ p > δ (cid:17) + inf η> s pr /pω d ′ /p η ω + P ( | ξ | > η ) . (40) Remark 7.
For b Ω n = b Ω naive := n − P ni =1 M ( X i − ¯ X n )( X i − ¯ X n ) ′ M ′ these bounds also holdfor quantiles c ∗ gn,p ( α ) obtained via the Gaussian multiplier bootstrap procedure. A special feature of this result is the real-valued random variable ξ . For now, assumethat ξ ≡ η ↓ ℓ p -statistic S n,p .15ext, consider the case in which ξ is not identical to zero. Then, Theorem 3 is a statementabout the test statistic R n,p := S n,p + ξ , where ξ may be interpreted as approximation error.This is particularly useful if we want to test hypotheses about a parameter β ∈ R d for whichthere exists an estimator ˆ β that admits the expansion √ nM ( ˆ β − β ) = 1 √ n n X i =1 ( M X i − m ) + r n . (41)In this case, the triangle inequality yields | ξ | ≤ k r n k p . The primary example that we havein mind is the de-biased lasso estimator for linear models (e.g. van de Geer et al., 2014;Zhang and Zhang, 2014). We elaborate on this idea in detail in Section 4.6. We can use Theorem 3 to construct consistent confidence sets C n,p ⊂ R d for a high-dimensionalparameter µ ∈ R d . To this end, set M = I d , m = µ , and define C n,p := µ ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 X i − µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p ≤ c ∗ n,p (1 − α ) √ n , (42)for a given nominal level α ∈ (0 , µ ( µ ∈ C n,p ) → − α. (43)Given the collection {C n,p } p ≥ a practitioner will be most interested in knowing which ofthese confidence sets is “smallest”. To answer this question, we study how p ∈ [1 , ∞ ] affectsthe volume of C n,p as d, n → ∞ . To simplify matters, we only consider Σ = σ I d .Obviously, the confidence sets C n,p are just ℓ p -norm balls with center ¯ X n and radii c ∗ n,p (1 − α ) / √ n . Recall that the volume of centered d -dimensional ℓ p -balls with radius r >
0, say B dp ( r ), is given by Vol (cid:0) B dp ( r ) (cid:1) = (2 r ) d Γ (1 + 1 /p ) d Γ (1 + d/p ) . (44)Also, by Lemma 10, Remark 16, and Lemma 2 in Schechtman and Zinn (1990), withprobability approaching one, for all α ∈ (0 , / c ∗ n,p (1 − α ) / √ n ≍ ( σd /p p p/n, p < log dσ p (log d ) /n, p ≥ log d. (45)Whence, by (44), (45), and Sterling’s formula we haveVol ( C n,p ) ≍ (cid:18) epc p (cid:19) d/p (cid:16) pd (cid:17) / (cid:18) σ pn (cid:19) d/ , p < log d (cid:18) epc p d ∧ ec p (cid:19) d/p (cid:16) pd ∧ (cid:17) / (cid:18) σ log dn (cid:19) d/ , p ≥ log d, (46)16here c /pp ∈ (0 . , C n,p is amonotonically increasing function of the exponent p . In other words, confidence sets basedon ℓ p -statistics S n,p with small exponents are less conservative than confidence sets basedon, say, the maximum statistic S n, ∞ . Asymptotically, C n, is the smallest confidence set. We now analyze the consistency of the bootstrap hypothesis test under high-dimensionalalternatives. Let Z ∼ N (0 , I d ′ ) and define A p := ( ( µ n ) n ∈ N , µ n ∈ R d n : E k Ω / Z k p ∨ p Var k Ω / Z k p √ n k M µ n − m k p = o (1) ) , (47)and its “complement” Z p := ( ( µ n ) n ∈ N , µ n ∈ R d n : √ n k M µ n − m k p E k Ω / Z k p ∨ p Var k Ω / Z k p = o (1) ) . (48)In words, A p contains alternatives ( µ n ) n ∈ N whose signals √ n k M µ n − m k p asymptoticallydominate the mean and standard deviation of the Gaussian proxy statistic k Ω / Z k p ; whereas Z p consists of alternatives whose signals are asymptotically negligible compared to mean andstandard deviation of k Ω / Z k p .The following result shows that the bootstrap hypothesis test is consistent for all ( µ n ) n ∈ N ∈A p and inconsistent for all ( µ n ) n ∈ N ∈ Z p . Theorem 4 (Consistency under high-dimensional alternatives) . Suppose that Assumption 6holds and p Var k Ω / Z k p = o (cid:0) E k Ω / Z k p (cid:1) .(i) For α ∈ (0 , and all ( µ n ) n ∈ N ∈ A p , lim n →∞ P µ n (cid:16) S n,p > c ∗ n,p (1 − α ) (cid:17) = 1 . (ii) For α ∈ (0 , / and all ( µ n ) n ∈ N ∈ Z p , lim n →∞ P µ n (cid:16) S n,p > c ∗ n,p (1 − α ) (cid:17) < . Remark 8.
Under mild moment conditions the “relative standard deviation” p Var k Ω / Z k p / E k Ω / Z k p tends to zero as the dimension d → ∞ grows (Boucheron et al., 2013; Biau and Mason,2015). In particular, by the Gaussian Poincar´e inequality, p Var k Ω / Z k p ≤ k Ω / k → p ≤k Ω / k op , where the first inequality holds for all p ∈ [1 , ∞ ] and the second for at least all p ≥ . .5 Power and the role of the exponent p It is part of statistical folklore that sum-of-squares type statistics have good power against“dense” alternatives, i.e alternatives whose signals in
M µ are spread out over a large numberof coordinates, whereas maximum type statistics are more powerful against “sparse” alter-natives, i.e. alternatives with only a few strong signals in
M µ
Fan et al. (2015). Theorem 4allows us to verify this statement more formally. Let M = I d , m = 0, Σ = σ I d , and definethe set of alternatives D δ,s := (cid:8) µ ∈ R s × { } d − s : δ/c ≤ µ k /σ ≤ δc, ≤ k ≤ s (cid:9) , (49)where c ≥ δ > s ∈ { , . . . , d } controls the sparsity.Given this setup, we ask the following question: What is the minimum signal strength δ ≡ δ ( n, d, s, p ) needed for the bootstrap test based on S n,p to reject the null hypothesis H : µ = 0 at significance level α ∈ (0 , /
2) when µ ∈ D δ,s ?By Remark 8 p Var k Ω / Z k p ≤ σ and by Lemma 2 in Schechtman and Zinn (1990)E k Ω / Z k p ≍ √ pd /p for p < log d and E k Ω / Z k p ≍ σ √ log d for p ≥ log d . Thus, by Theo-rem 4 ( ii ), a necessary condition for correctly rejecting the null hypothesis (with probabilityapproaching one) is √ n k µ k p & ( σ √ pd /p , p < log dσ √ log d, p ≥ log d. (50)Now, suppose that s ≍ d , i.e. D δ,s contains only dense alternatives. Then, for p ∈ [1 , log d )(50) holds if δ & p p/n , whereas for p ∈ [log d, ∞ ] (50) holds only if δ & p (log d ) /n . Thus,bootstrap tests based on ℓ p -statistics with small exponents are more powerful in detectingdense alternatives than those based on ℓ p -statistics with large exponents.Next, assume that s ≪ d , i.e. D δ,s contains only sparse alternatives. Then, for p ∈ [1 , log d ) (50) holds if δ & p p ( d/s ) /p /n , whereas for p ∈ [log d, ∞ ] (50) holds alreadyif δ & p (log d ) /n . Therefore, tests based on ℓ p -statistics with large exponents are moreresponsive to sparse alternatives than those based on ℓ p -statistics with small exponents. The bootstrap hypothesis test based on the ℓ p -statistic S n,p can be combined with the de-biased Lasso estimator (van de Geer et al., 2014; Zhang and Zhang, 2014) to conduct si-multaneous inference on high-dimensional linear models. This approach extends the oneby Zhang and Cheng (2017), who propose a bootstrap test for the de-biased lasso estimatorbased on the Gaussian multiplier bootstrap for the maximum statistic S n, ∞ .Consider the high-dimensional sparse model Y i = X ′ i β + ε i , i = 1 , . . . , n, (51)with response Y i ∈ R , i.i.d. predictors X i ∈ R d with mean µ and covariance matrix Σ, i.i.d.errors ε i (independent of X i ) with mean 0 and variance σ ε , and sparse regression vector β .18e are interested in testing the linear hypothesis H : M β = m vs . H : M β = m . (52)Write Y = ( Y , . . . , Y n ) ∈ R n , ε = ( ε , . . . , ε n ) ∈ R n , and X = [ X , . . . , X n ] ′ ∈ R n × d . For λ > β λ := arg min β ∈ R d k Y − X β k /n + 2 λ k β k , (53)and the de-biased lasso estimate by˘ β := ˆ β λ + b Θ X ′ ( Y − X ˆ β λ ) /n, (54)where b Θ is a suitable approximation of the inverse of the Gram matrix b Σ = X ′ X /n . Definethe ℓ p -statistic R n,p := √ n k M ˘ β − m k p , (55)and observe that √ nM ( ˘ β − β ) = 1 √ n n X i =1 M Σ − X i ε i + M ( b Θ − Σ − ) X ′ ε/ √ n | {z } =: r − √ nM ( b Θ b Σ − I d )( ˆ β λ − β ) | {z } =: r . (56)Further, note that the first term on the right hand side in above display is the re-scaledsum of n i.i.d. random vectors with mean zero and covariance matrix σ ε M Σ − M ′ . Hence, R n,p = S n,p + ξ , where S n,p = k n − / P ni = M Σ − X i ε i k p and | ξ | ≤ k r k p + k r k p . Under mildassumptions, k r k p ≤ k M ( b Θ − Σ − ) k q → p k X ′ ε k q / √ n and k r k p ≤ k M ( b Θ b Σ − I d ) k q → p k ˆ β λ − β k q , q ≥
1, are negligible (van de Geer et al., 2014). Thus, based on the expansion (56) andthe discussion in Section 4.2 we can approximate the distribution of R n,p under the nullhypothesis by the distribution of the Gaussian parametric bootstrap estimate S ∗ n,p := (cid:13)(cid:13) V debias (cid:13)(cid:13) p , where V debias | { Y, X } ∼ N (0 , ˆ σ ε M b Θ M ′ ) , (57)and ˆ σ ε is a consistent estimate of the error variance σ ε (Fan et al., 2012).We can now use the quantiles of S ∗ n,p to compute (bootstrap) critical values for the ℓ p -statistic S n,p and to construct confidence sets for M β . The purpose of the numerical experiments is in this section is threefold. First, they showthat for small exponents p ∈ [1 , log d ) the Gaussian parametric bootstrap outperforms theGaussian multiplier bootstrap, while for large exponents p ∈ [log d, ∞ ) both bootstrap pro-cedures perform similarly. Second, they confirm the theoretical claims from Section 3 thatfor heavy-tailed data the accuracy of the Gaussian parametric and multiplier bootstrap suf-fers as the exponent p increases. Third, they show that the exponent p affects the power ofthe bootstrap hypothesis test as described in Section 4.19 .1 Data generation We generate vectors X , . . . , X n ∈ R d via a Gaussian copula model X ij = F − (Φ( Y ij )) , ≤ i ≤ n, ≤ j ≤ d, (58)where the random vectors Y , . . . , Y n ∈ R d are sampled independently and identically from acentered Gaussian distribution with sparse covariance matrix Σ, Φ is the cdf of the N (0 , F is the distribution function of either the uniform distribution on [ − , t -distribution with 4 degrees of freedom (“heavy-tailed”). Wecreate the sparse and low-rank covariance matrix Σ in two steps: First, define the blockdiagonal matrix e Σ = diag(Λ , . . . , Λ) ∈ R d × d , where Λ = (Λ jk ) d/ j,k =1 with Λ jk = 0 . j + k − forall 1 ≤ j, k ≤ d/
100 is a rank-one matrix. Then, (randomly) generate a permutation matrix P and set Σ = P e Σ P ′ . The matrix Σ is positive semi-definite, sparse with d/
100 non-zeroelements in each row, and has rank 100. The permutation matrix P is generated only onceand is the same throughout all Monte Carlo simulations. The Gaussian parametric bootstrap procedure requires as input a positive semi-definite esti-mate of the population covariance matrix Σ. To exploit the sparsity of Σ while also ensuringpositive semi-definiteness of the estimate, we propose the following two-step procedure:First, compute a pilot estimate via correlation thresholding (Fan et al., 2011) of thesample covariance matrix b Σ naive = (ˆ σ jk ) dj,k =1 , i.e. for λ > b Σ n ( λ ) := T cor λ ( b Σ naive ) = ˆ σ jk ( | ˆ σ jk | p ˆ σ jj ˆ σ kk ≥ λ )! dj,k =1 . (59)Then, project the pilot estimate b Σ n ( λ ) onto the cone of positive semi-definite matrices bysetting all negative eigenvalues equal to 0. Denote the resulting estimate by b Σ + n ( λ ).It remains to choose the thresholding level λ >
0. We proceed as in Bickel and Levina(2008a,b) and select λ by cross-validation: At each fold ν ∈ { , . . . , N } , randomly split thesample X = { X i } ni =1 into two sub-samples X and X of sizes n = ⌈ n/ ⌉ and n = n − n ,respectively. Denote by b Σ ,ν and b Σ ,ν the sample covariance matrices of the ν th split basedon X and X . Let b Σ +1 ,ν ( λ ) be the correlation-thresholded and projected estimate based on b Σ ,ν . Define the cross-validated risk at level λ > b R ( λ ) := 1 N N X ν =1 (cid:13)(cid:13)(cid:13) vec (cid:16)b Σ +1 ,ν ( λ ) − b Σ ,ν (cid:17)(cid:13)(cid:13)(cid:13) , (60)and select the “optimal” thresholding level asˆ λ := arg min λ ∈ [0 , b R ( λ ) . (61)In practice, we set N = 10 and minimize the risk b R ( λ ) over a grid G ⊂ [0 ,
1] with |G| = 40 equally spaced points. The algorithm is sensitive to the number of folds and gridpoints; increasing N and |G| beyond 10 and 40, respectively, can improve the accuracy ofthe bootstrap approximation (at the cost of additional computational complexity).20 .3 Performance of Gaussian parametric and multiplier bootstrap To assess the performance of the Gaussian parametric and the multiplier bootstrap in finitesamples, we provide two types of plots: • Kolmogorov-Smirnov distance.
We plot side-by-side boxplots of the Kolmogorov-Smirnov distances between the estimated distributions of the ℓ p -statistic T n,p and (a)the Gaussian proxy statistic, e T n,p , (b) the Gaussian parametric bootstrap statisticbased on the naive sample covariance, T ∗ n,p, naive , (c) the Gaussian parametric bootstrapstatistic based on the thresholding estimate b Σ + n (ˆ λ ), T ∗ n,p,λ , (d) the Gaussian multiplierbootstrap, T gn,p . These boxplots give insight into the overall quality of the bootstrapprocedures. Note that (b) and (d) are the same, but are implemented by two differentalgorithms.Since the true distribution of the ℓ p -statistic T n,p is unknown, we evaluate it based on5000 Monte Carlo samples. To estimate the distributions of e T n,p , T ∗ n,p, naive , T ∗ n,p,λ , and T gn,p we generate 1000 Monte Carlo samples of X = { X i ∈ R d , ≤ i ≤ n } and 1000bootstrap samples for each Monte Carlo sample X = { X i ∈ R d , ≤ i ≤ n } . We reportresults for sample size n = 200, dimension d = 1000, and exponents p ∈ { , , log d, ∞} . • Lower tail probabilities.
We plot point estimates of P ( T n,p ≤ q . ), where q . is the95% quantile of the distribution of e T n,p , T ∗ n,p, naive , T ∗ n,p,λ , and T gn,p , respectively. Thesepoint estimates clarify the pointwise accuracy of the bootstrap procedures. They canalso be interpreted as the relative frequencies of the coverage of 95% simultaneousconfidence sets for the parameter µ = E[ X ] under H : µ = 0.We estimate these probabilities as follows. First, we draw 1000 Monte Carlo samples X (1) , . . . , X (1000) , where X ( m ) = { X ( m ) i ∈ R d , ≤ i ≤ n } , and compute the associated ℓ p -statistics T ( m ) n,p , 1 ≤ m ≤ X ( m ) , we generate1000 bootstrap samples and construct bootstrap estimates ˆ q ( m )0 . of the 95% quantileof the distributions of e T n,p , T ∗ n,p, naive , T ∗ n,p,λ , and T gn,p , respectively. Then, we estimateP ( T n,p ≤ q . ) as 1000 − P m =1 { T ( m ) n,p ≤ ˆ q ( m )0 . } . Again we report results for samplesize n = 200, dimension d = 1000, and exponents p ∈ { , , log d, ∞} .In the following discussion the Gaussian proxy statistic e T n,p serves as an oracle estimator.It tells us how good the bootstrap procedures could be if we knew the true covariance matrix.Any difference between e T n,p and the other statistics solely arises from the different estimatesof the covariance matrix.Figure 1 shows that if the data has light tails, the distribution of the Gaussian proxystatistic e T n,p provides an excellent approximation of the distribution of the ℓ p -statistic T n,p for all p ∈ { , , log d, ∞} . Moreover, the distribution of the Gaussian parametric bootstrapstatistic T ∗ n,p,λ based on b Σ + n (ˆ λ ) yields a comparably good approximation to the truth. Incontrast, the distributions of the Gaussian multiplier statistic and the naive Gaussian para-metric bootstrap are significantly poorer approximations to the truth. For large exponents p ∈ { log d, ∞} all four bootstrap approximations perform similarly. Thus, this plot fullysupports every aspect of the theoretical results derived in Sections 3.3 and 3.4.21 -Statistic ℓ -Statistic ℓ log d -Statistic ℓ ∞ -Statistic0.050.10 K S - D i s t a n c e Method
Gauss. Proxy GMB naive GPB opt. thr. GPB
Figure 1: Boxplots of 1000 Kolmogorov-Smirnov distances between the distribution of the ℓ p -statistic and its bootstrap estimates based on Gaussian Proxy (Gauss. Proxy), Gaus-sian Multiplier Bootstrap (GMB), Naive Gaussian Parametric Bootstrap (naive GPB), andGaussian Parametric Bootstrap based on b Σ + n (ˆ λ ) (opt. thr. GPB). Sample size n = 200,dimension d = 1000, F cdf of Uniform( − , e T n,p and thebootstrap procedures yield poorer approximations to the truth. In particular, we see thatthe quality of the approximation worsens substantially as the exponent p increases. Thisfurther corroborates the theoretical results derived in Sections 3.3 and 3.4.Figures 3 and 4 tell a similar, but more nuanced, story. From Figure 3 we infer thatif the data has light tails, the 95% quantiles of the distributions of the Gaussian proxystatistic e T n,p and the Gaussian parametric bootstrap statistic T ∗ n,p,λ based on b Σ + n (ˆ λ ) yieldgood approximations to the 95% quantile of the true distribution for all exponents p ∈{ , , log d, ∞} . From Figure 4 we learn that if the data has heavy tails, the approximationsare fairly good for small exponents p ∈ { , } , but fail spectacularly for large exponents p ∈{ log d, ∞} . Moreover, Gaussian multiplier and naive Gaussian parametric bootstrap yieldaccurate estimates of the 95% quantiles of the target distribution only for large exponents p ∈ { log d, ∞} and only when the data has light tails. This again supports the theoreticalresults from Sections 3.3 and 3.4. 22 -Statistic ℓ -Statistic ℓ log d -Statistic ℓ ∞ -Statistic K S - D i s t a n c e Method
Gauss. Proxy GMB naive GPB opt.thr. GPB
Figure 2: Boxplots of 1000 Kolmogorov-Smirnov distances between the distribution of the ℓ p -statistic and its bootstrap estimates based on Gaussian Proxy (Gauss. Proxy), Gaus-sian Multiplier Bootstrap (GMB), Naive Gaussian Parametric Bootstrap (naive GPB), andGaussian Parametric Bootstrap based on b Σ + n (ˆ λ ) (opt. thr. GPB). Sample size n = 200,dimension d = 1000, F cdf of t . To illustrate the effect of the exponent p on size and power of the bootstrap hypothesis testwe consider its power function in the following two high-dimensional testing scenarios: • Dense alternatives.
We test H : µ = 0 vs. H : µ = µ ( δ ) ≡ δ (1 , . . . , ′ ∈ R d at a 5%significance level. The signal strength δ is of order O (cid:16) / √ nd (cid:17) . • Sparse alternatives.
We test H : µ = 0 vs. H : µ = µ ( δ ) ≡ δ (1 , . . . , , , . . . , ′ ∈ R d at a 5% significance level. The alternative has 2 ⌈√ log d/ ⌉ non-zero entries and thesignal strength δ is of order O (cid:16)p (log d ) /n (cid:17) .In Figures 5 and 6 we plot Monte Carlo estimates of the power function β ( δ ) = P µ ( δ ) (cid:0) T n,p >c ∗ n,p (0 . (cid:1) , where c ∗ n,p (0 .
95) = inf (cid:8) t ∈ R : P( S ∗ n,p ≤ t | X ) ≥ . (cid:9) and S ∗ n,p is the Gaussianparametric bootstrap test statistic based on b Σ + n (ˆ λ ). The Monte Carlo estimate of β ( δ ) isbased on 1000 Monte Carlo samples of X = { X i ∈ R d , ≤ i ≤ n } and 1000 bootstrapsamples for each observed X = { X i ∈ R d , ≤ i ≤ n } . The specific estimation procedure isidentical to the one used to compute the lower tail probabilities in Section 5.3. We report23 -Statistic ℓ -Statistic ℓ log d -Statistic ℓ ∞ -Statistic R e l a t i v e F r e q u e n c y Method
Gauss. Proxy GMB naive GPB opt. thr. GPB
Figure 3: Relative frequencies of the simultaneous coverage of 1000 95% confidence sets under H : µ = 0 of the Gaussian Proxy (Gauss. Proxy), Gaussian Multiplier Bootstrap (GMB),Naive Gaussian Parametric Bootstrap (naive GPB), and Gaussian Parametric Bootstrapbased on b Σ + n (ˆ λ ) (opt. thr. GPB). The vertical bars indicate Monte Carlo standard errors.Sample size n = 200, dimension d = 1000, F cdf of Uniform( − , n = 200, dimension d = 400, exponents p ∈ { , , log d, ∞} , andlight-tailed data.Figure 5 shows the power function for dense alternatives. We observe that tests based on S n, and S n, (they are nearly indistinguishable in the figure) are more powerful than thosebased on S n, log d and S n, ∞ . This fully matches the theoretical predictions from Section 4.5.Figure 6 displays the power function for sparse alternatives. In this case the bootstrap testsbased on S n, log d and S n, ∞ are more powerful than those based on S n, and S n, . The powerfunctions associated with S n, log d and S n, ∞ are essentially the same with S n, log d being slightlymore powerful because of a larger constant (note µ has 2 ⌈√ log d/ ⌉ = 4 non-zero entries).Again, these findings fit well into the discussion in Section 4.5.Lastly, at δ = 0 the power functions of all four tests are about 0.05 in both Figures 5and 6. Thus, all four tests successfully control the type I error a 5% significance level. Forlarge values of δ all four tests unanimously reject the null hypothesis with probability (closeto) one. This confirms the results from Sections 4.2 and 4.4.24 -Statistic ℓ -Statistic ℓ log d -Statistic ℓ ∞ -Statistic R e l a t i v e F r e q u e n c y Method
Gauss. Proxy GMB naive GPB opt. thr. GPB
Figure 4: Relative frequencies of the simultaneous coverage of 1000 95% confidence sets H : µ = 0 of the Gaussian Proxy (Gauss. Proxy), Gaussian Multiplier Bootstrap (GMB),Naive Gaussian Parametric Bootstrap (naive GPB), and Gaussian Parametric Bootstrapbased on b Σ + n (ˆ λ ) (opt. thr. GPB). The vertical bars indicate Monte Carlo standard errors.Sample size n = 200, dimension d = 1000, F cdf of t . In this paper we have introduced the Gaussian parametric bootstrap to estimate the distri-bution of ℓ p -statistics of high-dimensional random vectors. The procedure is versatile anduser-friendly, since its implementation requires only a positive semi-definite estimate of thepopulation covariance matrix. The main theoretical contributions state the consistency ofthe Gaussian parametric bootstrap under various conditions on the covariance structure ofthe data. To showcase the applicability of the Gaussian parametric bootstrap we propose abootstrap hypothesis test for simultaneous inference on high-dimensional mean vectors. Wediscuss in detail asymptotic correctness, confidence sets, consistency under high-dimensionalalternatives, and power of the test.One of the current challenges in theoretical statistics is to understand when bootstrapprocedures work in high-dimensional problems. At least for bootstrapping ℓ p -statistics ofhigh-dimensional random vectors we can now give a definitive answer. The technical resultsin the appendix to this paper clarify that the success of bootstrapping ℓ p -statistics hinges onthree factors: (a) ℓ p -norms of high-dimensional random vectors satisfy a Berry-Esseen-typecentral limit theorem under relatively mild moment conditions; (b) ℓ p -norms of Gaussian25 % sig. level δ P o w e r Statistic ℓ -Statistic ℓ -Statistic ℓ log d -Statistic ℓ ∞ -Statistic Figure 5: Power functions under dense alternatives based on 1000 Monte Carlo samples.The gray bands indicate the Monte Carlo standard errors. Sample size n = 200, dimension d = 400, F cdf of Uniform( − ,
5% sig. level δ P o w e r Statistic ℓ -Statistic ℓ -Statistic ℓ log d -Statistic ℓ ∞ -Statistic Figure 6: Power functions under sparse alternatives based on 1000 Monte Carlo samples.The gray bands indicate the Monte Carlo standard errors. Sample size n = 200, dimension d = 400, F cdf of Uniform( − , ℓ p -statistics under centered Gaussian distributions vary smoothly over their covariance matrices.26 ppendices Organization.
The appendices are divided into two parts. In Appendix A we present ad-ditional results and technical lemmas. These include an abstract Berry-Esseen-type CLTfor ℓ p -statistics (Theorem 5), Gaussian anti-concentration inequalities (Theorems 6 and 7),Gaussian comparison inequalities (Theorem 8), smoothing inequalities (Section A.4), andauxiliary results for proving bootstrap consistency (Section A.5), for testing high-dimensionalmean vectors (Section A.6), and concerning the partial derivatives of ℓ p -norms (Section A.7).In Appendix B we provide proofs to all results from the main text and the appendix. Additional Notation.
We denote by C k ( R d ) the class of k times continuously differentiablefunctions from R d to R , and by C kb ( R d ) the class of all functions f ∈ C k ( R d ) with boundedsupport. For a real-valued matrix M ∈ R d × d we write k A k ∗ to denote its nuclear norm (the ℓ -norm of its singular values). For ε > A ∈ B ( R ) define the ε -enlargementof A as A ε := { t ∈ R : inf s ∈ A | t − s | ≤ ε } . Moreover, we write A − ε to denote sets B ∈ B ( R )for which B ε = A . A Additional results and technical lemmas
A.1 Abstract Berry-Esseen-type CLT
Recall the setup from Section 3 in the main text. Consider a sequence X = { X i } ni =1 ofindependent and centered random vectors in R d . Let Z = { Z i } ni =1 be a sequence of inde-pendent multivariate Gaussian random vectors Z i ∼ N (0 , E[ X i X ′ i ]) which are independentof X . Define the scaled averages S Xn := 1 √ n n X i =1 X i and S Zn := 1 √ n n X i =1 Z i , (62)and the Kolmogorov-Smirnov distance ̺ n,p := sup t ≥ (cid:12)(cid:12) P (cid:0) k S Xn k p ≤ t (cid:1) − P (cid:0) k S Zn k p ≤ t (cid:1)(cid:12)(cid:12) . (63)The upper bound in the original univariate Berry-Esseen inequality depends on the thirdmoments of the X i ’s. In the multivariate case, the concept of third moments is less clear cut.For example, E[ k X k ] (for some norm k · k ) and P dk =1 E[ | X k | ] are both sensible generaliza-tions of the univariate third moment. The bound in our Berry-Essen-type CLT depends onthe following generalized third moments: For a, b ≥ M n,b ( a ) := E " n n X i =1 (cid:16) k X i k b {k X i k b > a } + k Z i k b {k Z i k b > a } (cid:17) , (64) L n,b := E " n n X i =1 (cid:16) k X i k b + k Z i k b (cid:17) . (65)27e write L n,b for an upper bound on L n,b . Furthermore, we need two quantities based onthe (average) covariance matrix of the X i ’s: the vector of its diagonal elements and its rank,i.e. σ n := E " n n X i =1 X ik dk =1 and r n := rank E " n n X i =1 X i X ′ i . (66)The following Berry-Esseen-type CLT for ℓ p -norms is our main theoretical contributionand central to all other results in this paper. Theorem 5 (Berry-Esseen-type CLT for ℓ p -norms) . (i) For all p ∈ [1 , ∞ ) and all τ ∈ [1 , ∞ ] , ̺ n,p . M n,τp (cid:0) p − / (3 τ ) n / L / n,τp (cid:1) p − /τ L n,τp + ( pd /p ) − / (3 τ ) L / n,τp n / p / r / (2 p ) n k σ n k p . (67) (ii) For all p ∈ [log d, ∞ ] , ̺ n,p . M n, ∞ (cid:0) n / (log d ) / L / n, ∞ (cid:1) L n, ∞ + (log d ) / n / L / n, ∞ k σ n k ∞ . (68)Observe that the upper bound on the Kolmogorov-Smirnov distance exhibits qualita-tively different behavior depending on the magnitude of the exponent p and the tails of thedistribution of the X i ’s.What is of interest here is that the upper bound undergoes a phase transition frompolynomial dependence on d (or r n ) to logarithmic dependence in d as the exponent p crosses the threshold log d . It is easy to verify that for p ≍ log d and τ = 1 the boundsin (67) and (68) are of the same order and that for p & log d and τ ≥ p . log d . In this case, the tails of the distribution of the X i ’s come into play and thenuisance parameter τ ∈ [1 , ∞ ] can be used to trade off moment conditions versus fractionalpowers of dimension d . To illustrate the basic idea of how to use τ , let us consider the twoboundary cases τ ∈ { , ∞} . Denote by ¯ σ n, min := min ≤ k ≤ d σ n,k the smallest diagonal elementof the (averaged) covariance matrix of the X i ’s. For τ = 1 eq. (67) simplifies to ̺ n,p . M n,p (cid:0) p / n / L / n,p (cid:1) L n,p + p / r / (2 p ) n n / ¯ σ n, min (cid:18) L n,p d /p (cid:19) / , while for τ = ∞ eq. (67) reduces to ̺ n,p . M n, ∞ (cid:0) pn / L / n, ∞ (cid:1) pL n, ∞ + p / r / (2 p ) n n / ¯ σ n, min L / n, ∞ . Typically, the first term on the right hand side of each of the two displays can be boundedindependently of d and is negligible as n → ∞ . Therefore, to decide which one of the28wo bounds is (asymptotically) tighter, we need to determined whether d − /p L n,p ≷ L n, ∞ .Clearly, this depends on the tails of the distribution of the X i ’s. For concreteness, if the X i ’s are sub-Gaussian, then L n, ∞ ≍ (log d ) / while d − /p L n,p ≍ d /p . In the main text inSection 3.2, we provide simplified and ready-to-use bounds under various conditions on thetails of the distribution of the X i ’s.Theorem 5 is non-asymptotic and holds for all n, d, p . However, it is only relevant in high-dimensional settings since in low-dimensional settings, in which the dimension d is fixed orgrows much slower than the square root of the sample size n , there exist sharper results (e.g.Bhattacharya, 1977; G¨otze, 1991; Bentkus, 2003; Raiˇc, 2019, and references therein). Since n − / is the minimax optimal rate for CLTs in infinite dimensional Banach spaces, it is likelythat in high-dimensional settings the bound in Theorem 5 is nearly optimal in terms ofdependence on n (see discussion in Chernozhukov et al., 2017a; Bentkus, 1985).Theorem 5 is related to Theorem 2.1 in Chernozhukov et al. (2017a), which is a Berry-Esseeen-type CLT for hyper-rectangles. In fact, our proof builds on their idea of combiningStein’s leave-one-out approach with Slepian’s smart-path-interpolation and iterative argu-ments due by Bolthausen (1984). All major technical differences between their and ourproof can be traced back to the specific behavior of our new anti-concentration, Gaus-sian comparison, and smoothing inequalities in the regime p ≤ log d . If one is interestedin results for large exponents p ≥ log d only, one can simply combine the original prooffrom Chernozhukov et al. (2017a) with our new anti-concentration and smoothing inequali-ties. Without further modifications of their arguments one then obtains the following slightimprovement of (68). Proposition 1 (Refined Berry-Esseen-type CLT for ℓ p -norms with large exponents) . Forall p ∈ [log d, ∞ ] , ̺ n,p . M n, ∞ (cid:0) n / (log d ) − / L / n, max (cid:1) L n, max + (log d ) / n / L / n, max k σ n k ∞ , where L n, max ≥ max ≤ k ≤ d n P ni =1 E [ | X ik | ] . The second term on the right hand side in the bound of Proposition 1 is clearly smallerthan the corresponding term in the bound (68). However, under the primitive conditions inSection 3.2 and for p ≥ log d , the terms L n, max and L n,p will only differ by a factor of order o (log d ). A.2 Anti-concentration inequalities
We begin with the following basic result for ℓ p -norms of random vectors with log-concaveprobability measure when p ∈ N is an even integer. For a random variable X ∈ R d withlaw ν and A ∈ B ( R d ) define P ν ( X ∈ A ) := R A dν . Theorem 6.
Let
X, X ′ ∈ R d be i.i.d. random vectors with law ν . For ε > arbitrary, sup ν sup p ∈ N sup t ≥ P ν (cid:16) t ≤ k X k p ≤ t + ε kk X k p − k X ′ k p k ψ (cid:17) . ε, where the supremum in ν is taken over all log-concave probability measures on R d . d only through the quantity kk X k p − k X ′ k p k ψ .Interestingly, the assumption that ν belongs to the class of log-concave probability mea-sures is indeed necessary: For one, it is easy to see that if d = 1 and the class of probabilitymeasures contains measures ν whose densities dν have multiple modes (or a point mass),there exists an ε > dν have to becontinuous and unimodal. For another, the term kk X k p − k X ′ k p k ψ is finite for all p ∈ N only if ν has at least sub-exponential tails. Together these two facts imply that the ν has tobe log-concave.The key idea behind the proof of Theorem 6 is that for p ∈ N we may interpret k X k pp as a (multivariate) polynomial and invoke the distributional version of the Carbery-Wrightinequality for random polynomials over convex bodies (Carbery and Wright, 2001, Theorem8). We defer the detailed proof to Appendix B.Specializing to a Gaussian random vector and the ℓ -norm, Theorem 6 yields the followingimportant result: Corollary 5.
Let X ∈ R d be a Gaussian random vector with mean µ ∈ R d and positivesemi-definite covariance matrix Σ . For ε > arbitrary, sup t ≥ P (cid:16) t ≤ k X k ≤ t + ε (cid:0) tr(Σ ) + µ ′ Σ µ (cid:1) / (cid:17) . ε. Note that this corollary holds for any fixed mean µ ∈ R d of X . In this sense it is ananti-concentration inequality about ellipsoids with arbitrary radii but fixed center µ ∈ R d .In contrast, the well-known result by Nazarov (2003) is an anti-concentration inequality overellipsoids with arbitrary radii and arbitrary centers. However, this stronger result requires theadditional assumption that Σ is positive definite and comes with a smaller standard deviationproxy, namely tr(Σ − ) − / . Our Corollary 5 sharpens the finite-dimensional analogue ofTheorem 2.7 in G¨otze et al. (2019) by introducing the quadratic term µ ′ Σ µ to the inequality.Combining Theorem 6 with an interpolation argument and fine properties of Gaussianmeasures (i.e. Plancherel’s identity and careful truncation) yields the following theorem forgeneral ℓ p -norms with exponent p ∈ [1 , ∞ ]. Theorem 7.
Let X ∈ R d be a centered Gaussian random vector with positive semi-definitecovariance matrix Σ = ( σ kj ) dk,j =1 of rank r ≥ . Set σ = ( σ kk ) dk =1 . For ε > arbitrary, sup p ∈ [1 , ∞ ] sup t ≥ P (cid:18) t ≤ k X k p ≤ t + ε k σ k p ω p ( d, r ) (cid:19) . ε, where ω p ( d, r ) = (p pr /p if p ∈ [1 , ∞ ) , √ log d if p ≥ log d. Remark 9.
We will use the following refined inequality to prove Theorem 5. Let p + = 2 ⌈ p ⌉ be the smallest even integer larger than p . Then, sup p ∈ [1 , ∞ ) sup t ≥ sup q ∈{ p,p + } P t ≤ k X k q ≤ t + ε k σ k p p pr /p ! . ε. e only need to show the validity of the inequality for q = p + . To this end, invoke Theorem 6and, as in the proof of Theorem 7, lower bound (cid:13)(cid:13) k X k p + − k X ′ k p + (cid:13)(cid:13) ψ by c k σ k p / p pr /p ,where c > is an absolute constant. Several comments are in order: First, the term ω p ( d, r ) undergoes a phase transition as p crosses the threshold log d . This phenomenon matches well-known phase transitions ofthe expected value and the variance of ℓ p -norms of isotropic Gaussian random vectors (e.g.Schechtman and Zinn, 1990; Paouris and Valettas, 2018). Second, for p = ∞ our theoremimproves Nazarov’s inequality (i.e. Chernozhukov et al., 2017b, Theorem 1) in a crucialdetail: The term ω p ( d, r ) depends on the inverse of the largest diagonal element of Σ, whereasthe corresponding quantity in Nazarov’s inequality depends on the inverse of the smallestdiagonal element of Σ. Third, if X is isotropic then k σ k r − / ≍ tr(Σ ) / and Corollary 5 andTheorem 7 are asymptotically equivalent. However, in general, the inequality in Corollary 5is tighter.In special cases, it is possible to obtain explicit constants for inequalities in this section.Notably, if the Gaussian random vector X has a spherical distribution, the bounds on vari-ance and moments of ℓ p -norms of Gaussian random vectors in Section 3 of Paouris and Valettas(2018) are directly applicable. A.3 Gaussian comparison inequalities
The following result allows us to compare the distributions of ℓ p -norms of two centeredGaussian random vectors with (potentially) different covariance matrices. Theorem 8.
Let X and Y be two independent Gaussian random vectors in R d with meanzero and covariance matrices Σ X = (Σ Xjk ) dj,k =1 and Σ Y = (Σ Yjk ) dj,k =1 , respectively. Define σ X =(Σ Xkk ) dk =1 , σ Y = (Σ Ykk ) dk =1 , r X = rank(Σ X ) , and r Y = rank(Σ Y ) . Set ∆ op = (cid:13)(cid:13) Σ X − Σ Y (cid:13)(cid:13) op and ∆ p = (cid:13)(cid:13) vec(Σ X − Σ Y ) (cid:13)(cid:13) p .(i) For all p ∈ [1 , ∞ ) , sup t ≥ (cid:12)(cid:12)(cid:12) P ( k X k p ≤ t ) − P ( k Y k p ≤ t ) (cid:12)(cid:12)(cid:12) . q p d /p r /pX ∆ p k σ X k p ^ q p d /p r /pY ∆ p k σ Y k p . (69) (ii) For all p ∈ [log d, ∞ ] , sup t ≥ (cid:12)(cid:12)(cid:12) P ( k X k p ≤ t ) − P ( k Y k p ≤ t ) (cid:12)(cid:12)(cid:12) . (log d ) p ∆ op ∧ ∆ ∞ k σ X k ∞ ∨ k σ Y k ∞ . (70) Remark 10.
In order to derive minimax lower bounds for the Gaussian parametric andGaussian multiplier bootstrap it would be extremely useful to have complementary lowerbounds on the Kolmogorov-Smirnov distance between the distributions of k X k p and k Y k p . Remark 11.
Using Corollary 5 instead of Theorem 7 in the proof of above theorem we obtainthe following alternative bound for p = 2 : sup t ≥ (cid:12)(cid:12)(cid:12) P ( k X k ≤ t ) − P ( k Y k ≤ t ) (cid:12)(cid:12)(cid:12) . s d / ∆ k vec(Σ X ) k ∨ k vec(Σ Y ) k . (71)31he most interesting aspect of this result is that the upper bound on the Kolmogorov-Smirnov distance shows qualitatively different behavior depending on the magnitude of theexponent p . Let σ X, min := min ≤ k ≤ d Σ Xkk or σ Y, min := min ≤ k ≤ d Σ Ykk . We can now furthersimplify (69) tosup t ≥ (cid:12)(cid:12)(cid:12) P ( k X k p ≤ t ) − P ( k Y k p ≤ t ) (cid:12)(cid:12)(cid:12) . s p r /pX d /p ∆ p σ X, min ^ s p r /pY d /p ∆ p σ Y, min . This bound is useful because it depends on the dimension d only via the difference ∆ p (notethat r X d − , r Y d − ≤ p = 2 G¨otze et al. (2019) (Theorem 2.1 and corollaries) have derived bounds similarto (71) but based on a completely different approach that involves bounding the densityfunctions of k X k and k Y k . In the one- and two-dimensional cases their bounds are strictlytighter than (71). In the d -dimensional case with d ≥
3, their bound is (roughly) of theorder of O (cid:16) k Σ X − Σ Y k ∗ (cid:0) k vec(Σ X ) k ∨ k vec(Σ Y ) k (cid:1) − (cid:17) which is just slightly smaller thanthe square of (71). In general, neither their nor our bound is clearly better or worse.For p = ∞ and (log d ) ∆ ∞ = o (1) inequality 70 improves Theorem 2 in Chernozhukov et al.(2015) in two ways: First, we improve the rate from (log d ) / ∆ / ∞ to (log d )∆ / ∞ . Sec-ond, our bound depends only on the inverse of k σ X k ∞ or k σ Y k ∞ , whereas the inequalityby Chernozhukov et al. (2015) depends also on the the inverse of either σ X, min or σ Y, min . Thesecond improvement can be ascribed to our improved anti-concentration inequality.The proofs of Theorem 5 and 8 are conceptually very similar and rely on the same anti-concentration and smoothing inequalities. The main difference between the two proofs is thatTheorem 5 uses a second-order Taylor approximation to expand the smoothed Kolmogorov-Smirnov distance and matches the first two moments of S Xn and S Zn , whereas Theorem 8uses only a first-order Taylor approximation and matches only the first moments (becausethe second moments Σ X and Σ Y differ). Along the way, the proof of Theorem 8 also makesheavily use of X and Y being Gaussian. A.4 Smoothing inequalities
Smoothing inequalities allow us to replace probabilities like P ( k X k p ∈ A ) = E[ {k X k p ∈ A } ], which are expectations of non-differentiable indicator functions of non-differentiablemaps x
7→ k x k p , by expectations of smooth functions. This enables us to approximate theseprobabilities via first- or second-order Taylor approximations, which is the first step in estab-lishing the abstract Berry-Essen-type and Gaussian comparison inequalities in Sections A.1and A.3. Lemma 1 ( C ∞ b ( R d )-Approximation of ℓ p -Norms) . Let X ∈ R d be an arbitrary randomvector. There exists a family of smooth functions H = { h p,d,β,δ,A ∈ C ∞ b ( R d ) : p ∈ N , d, β, δ > , A ∈ B ( R ) } which satisfies the following:(i) For A ∈ B ( R ) , p ∈ N , τ ∈ [1 , ∞ ] , and κ p = 3 β − pd / ( τp ) , P ( k X k p ∈ A ) ≤ E [ h p,d,β,δ,A κp ( X )] ≤ P (cid:0) k X k p ∈ A δ +2 κ p (cid:1) . (72)32 ii) The functions in H have support set (cid:8) x ∈ R d : M β ( x ) ∈ A δ \ A (cid:9) , where M β is definedin eq. (137) .(iii) For p ∈ N , τ ∈ [1 , ∞ ] , and q = τpτp − , sup A ∈B ( R ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h p,d,β,δ,A | q /q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (cid:18) δ + βδ (cid:19) d τ − / ( τp ) , sup A ∈B ( R ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =3 | D α h p,d,β,δ,A | q /q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (cid:18) δ + βδ + β δ (cid:19) d τ − / ( τp ) (73) (iv) Let I = { A ⊆ R : A = [0 , t ] , t ≥ } . Then, for A ∈ I , τ ∈ [1 , ∞ ] , and p ∈ [1 , ∞ ) , P ( k X k p ∈ A ) ≤ E (cid:2) h p + ,d,β,δ,A κp + ( X ) (cid:3) ≤ P (cid:0) k X k p ∈ A δ +4 κ p (cid:1) , (74) where p + = 2 ⌈ p ⌉ is the smallest even integer larger than p and κ p = 3 β − pd / ( τp ) . The key observation behind this result is that ℓ p -norms with even exponents p ∈ N are (multivariate) polynomials of degree p and continuously differentiable (except at 0) withself-normalizing derivatives.Lemma 1 ( iv ) holds for all p ∈ [1 , ∞ ) but for p → ∞ it is impossible to simultaneouslycontrol the (probability of the) enlarged set A δ +4 κ p in (74) and the partial derivatives in (73):For one, if we set β = O ( p ), we can control κ p and P( k X k p ∈ A δ +4 κ p ) but the bounds onthe partial derivatives diverge. For another, if we set β = O (1), we can control the partialderivatives but P( k X k p ∈ A δ +4 κ p ) →
1. The reason for this is that ℓ p -norms with largeexponents (relative to dimension d ) behave essentially like the non-differentiable maximumnorm ( ℓ ∞ -norm). We therefore have to smooth ℓ p -norms with large exponents p ≥ log d differently. This is content of the next result. Lemma 2 ( C ∞ b ( R d )-Approximation of ℓ p -Norms for p ≥ log d ) . Let X ∈ R d be an arbitraryrandom vector. There exists a family of smooth functions H = { h p,d,β,δ,A ∈ C ∞ b ( R d ) : p ∈ [log d, ∞ ] , d, β, δ > , A ∈ B ( R ) } which satisfies the following:(i) For A ∈ B ( R ) , p ∈ [log d, ∞ ] , and κ = eβ − log(2 d ) , P ( k X k p ∈ A ) ≤ E [ h p,d,β,δ,A κ ( X )] ≤ P (cid:0) k X k p ∈ A δ +2 κ (cid:1) . (75) (ii) The functions in H have support set (cid:8) x ∈ R d : F β ( x ) ∈ A δ \ A (cid:9) , where F β is thesmooth-max function and defined in eq. (149) .(iii) For p ∈ [log d, ∞ ] , sup A ∈B ( R ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h p,d,β,δ,A | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . δ + βδ , sup A ∈B ( R ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =3 | D α h p,d,β,δ,A | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . δ + βδ + β δ . (76)33he smooth approximation in this lemma is based on the smooth-max function thatwas first introduced by Chernozhukov et al. (2013). Thus, the modest novelty of this resultis that the smooth-max function can be used to approximate not just the maximum-norm( ℓ ∞ -norm) but also all ℓ p -norms with exponent p ≥ log d . Smooth approximations basedon the smooth-max function satisfy other useful stability properties beyond the bounds onthe second and third derivative in (76). While these other stability properties are crucialto the proofs of the Berry-Esseen-type CLTs in Chernozhukov et al. (2013, 2015, 2017a);Deng and Zhang (2020); Koike (2019), these properties are not essential to our proofs. A.5 Auxiliary results I (Bootstrap consistency)
In this section we collect bounds on tail probabilities and moments of vectors and covariancematrices in ℓ p -norms. These results are used in Sections 3.3 and 3.4.Throughout this section we write ¯ X n = n − P ni =1 X i for the sample mean and b Σ naive =( b σ kj ) dk,j =1 = n − P ni =1 ( X i − ¯ X n )( X i − ¯ X n ) ′ for the sample covariance matrix. Lemma 3.
Let X ∈ R d be a random vector that satisfies Assumption 3 with s ≥ t ∨ p forsome t ≥ . Then, (cid:0) E k X k tp (cid:1) /t . K s k σ k p , where σ = ( σ k ) dk =1 and σ k = E[ X k ] for ≤ k ≤ d . Lemma 4 (Product of subgaussian random variables) . Let X , . . . , X K ∈ R be sub-gaussianrandom variables. Then, for K ≥ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K Y k =1 X k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ /K ≤ K Y k =1 k X k k ψ . Lemma 5 (Sub-Gaussian) . Let X = { X i } ni =1 be a sequence of i.i.d. mean zero randomvectors in R d with covariance matrix Σ = ( σ jk ) dj,k =1 which satisfies Assumption 1. Set σ = ( σ kk ) dk =1 and b σ = ( b σ kk ) dk =1 . Let ζ ∈ (0 , be arbitrary.(i) With probability at least − ζ , for all p ∈ [1 , ∞ ] , k vec( b Σ naive − Σ) k p . k σ k p r log d + log(2 /ζ ) n _ log d + log(2 /ζ ) n ! . (ii) Let r(Σ) = tr(Σ) / k Σ k op be the effective rank of Σ . With probability at least − ζ , k b Σ naive − Σ k op . k Σ k op r r(Σ) log d + log(2 /ζ ) n _ r(Σ) log d + log(2 /ζ ) n ! . (iii) With probability at least − ζ , max ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) . r log d + log(2 /ζ ) n _ log d + log(2 /ζ ) n ! . emark 12. For p = 2 case (i) is qualitatively (i.e up to log-factors) identical to Theorem2.1 in Bunea and Xiao (2015). Cases (ii) and (iii) are folklore. Note that these results areusually given for the matrix of second moments n − P ni =1 X i X ′ i only, whereas we providebounds for the sample covariance matrix b Σ naive . Lemma 6 (Finite Moments) . Let X = { X i } ni =1 be a sequence of i.i.d. mean zero randomvectors in R d with covariance matrix Σ = ( σ jk ) dj,k =1 . Set σ = ( σ kk ) dk =1 and b σ = ( b σ kk ) dk =1 .(i) Suppose Assumption 3 holds with s ≥ p ∨ . For p ∈ [1 , ∞ ] , k vec( b Σ naive − Σ) k p = O p K s k σ k p r p ∧ log dn ! . (ii) Suppose Assumption 3 holds with s = 2 . Set m(Σ) = E[max ≤ i ≤ n k X i k ] / k Σ k op . k b Σ naive − Σ k op = O p k Σ k op r m(Σ) log( d ∧ n ) n _ m(Σ) log( d ∧ n ) n !! . (iii) Suppose Assumption 3 holds for s ≥ . Set e m(Σ) = E[max ≤ i ≤ n k diag(Σ) − X i k ] . max ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) = O p (cid:18) K s d /s q s ∧ log dn (cid:19) for s ≥ ,O p (cid:18)q e m(Σ) log( d ∧ n ) n W e m(Σ) log( d ∧ n ) n (cid:19) for s ≥ . Covariance matrix estimators that can exploit Assumption 4 are the so-called threshold-ing estimator. In the following, we consider a generic thresholding operator T λ : R → R withthresholding parameter λ , which satisfies( i ) | T λ ( u ) | ≤ | u | ; ( ii ) T λ ( u ) = 0 for | u | ≤ λ ; ( iii ) | T λ ( u ) − u | ≤ λ. Thesholding operators satisfying these three properties include the hard-thresholding op-erator, T λ ( u ) = u {| u | > λ } (from Section 3) as well as the soft-thresholding operator, T λ ( u ) = sign( u ) (cid:0) ( | u | − λ ) ∨ (cid:1) . Lemma 7 (Thresholded covariance estimators) . Let X = { X i } ni =1 be a sequence of i.i.d.mean zero random vectors in R d with covariance matrix Σ = ( σ jk ) dj,k =1 . Set σ = ( σ kk ) dk =1 and b σ = ( b σ kk ) dk =1 . Let ζ ∈ (0 , be arbitrary and set λ n ≍ q log d +log(2 /ζ ) n W log d +log(2 /ζ ) n .(i) Suppose Assumption 1 holds. Let A ∈ R d × d be the adjacency matrix of Σ , i.e. A jk = { σ jk = 0 } . With probability at least − ζ , for all p ∈ [1 , ∞ ] , (cid:13)(cid:13) vec (cid:0) T λ n ( b Σ naive ) − Σ (cid:1)(cid:13)(cid:13) p . k vec( A ) k p k σ k ∞ λ n , (cid:13)(cid:13) T λ n ( b Σ naive ) − Σ (cid:13)(cid:13) op . k vec( A ) k op k σ k ∞ λ n . ii) Suppose Assumptions 1 and 4 hold. With probability at least − ζ , for all p ∈ [ θ, ∞ ] , (cid:13)(cid:13) vec (cid:0) T λ n ( b Σ naive ) − Σ (cid:1)(cid:13)(cid:13) p . d /p R γ,p k σ k − γ ) ∞ λ − γn , (cid:13)(cid:13) T λ n ( b Σ naive ) − Σ (cid:13)(cid:13) op . R γ,θ k σ k − γ ) ∞ λ − γn . (iii) Suppose Assumptions 3 holds with s ≥ ( p ∧ log d ) ∨ . Then, for all p ∈ [2 ∨ θ, ∞ ] , (cid:13)(cid:13) vec (cid:0) T λ n ( b Σ) − Σ (cid:1)(cid:13)(cid:13) p = O p k vec( A ) k p K s k σ k s r s ∧ log dn ! , (cid:13)(cid:13) T λ n ( b Σ) − Σ (cid:13)(cid:13) op = O p k A k op K s k σ k s r s ∧ log dn ! . (iv) Suppose Assumptions 3 and 4 hold with s ≥ ( p ∧ log d ) ∨ . Then, for all p ∈ [2 ∨ θ, ∞ ] , (cid:13)(cid:13) vec (cid:0) T λ n ( b Σ) − Σ (cid:1)(cid:13)(cid:13) p = O p d /p R γ,p K − γ ) s k σ k − γ ) s (cid:18) s ∧ log dn (cid:19) (1 − γ ) / ! , (cid:13)(cid:13) T λ n ( b Σ) − Σ (cid:13)(cid:13) op = O p R γ,θ K − γ ) s k σ k − γ ) s (cid:18) s ∧ log dn (cid:19) (1 − γ ) / ! . Remark 13.
Cases (i) and (ii) generalize Theorems 6.23 and 6.27 in Wainwright (2019)to the sample covariance matrix b Σ naive and the vectorized ℓ p -norm with p ∈ [ θ, ∞ ] . Note thatfor γ = 0 case (ii) reduces to case (i). Cases (iii) and (iv) bounds are only better than thenaive bounds from Lemma 6 for p < s . For s ≍ log d they match the bounds of cases (i)and (ii). Remark 14.
For max ≤ j ≤ d P dk =1 | σ jk | γ ≤ R ,γ and p = 2 we obtain the same rate as Theo-rem 2 in Bickel and Levina (2008b) Lemma 8 (Banded covariance estimators) . Let X = { X i } ni =1 be a sequence of i.i.d. meanzero random vectors in R d with covariance matrix Σ = ( σ jk ) dj,k =1 . Set σ = ( σ kk ) dk =1 and b σ = ( b σ kk ) dk =1 . Let ζ ∈ (0 , be arbitrary and set ℓ n = B p/ (1+ pα ) p k σ k − p/ (1+ pα ) ∞ λ − p/ (1+ pα ) n for λ n , α > .(i) Suppose Assumptions 1 and 5 hold. Set λ n ≍ q log d +log(2 /ζ ) n W log d +log(2 /ζ ) n . With prob-ability at least − ζ , for all p ∈ [ θ, ∞ ] , k vec( B ℓ n ( b Σ naive ) − Σ) k p . d /p B / (1+ pα ) p k σ k pα/ (1+ pα ) ∞ λ pα/ (1+ pα ) n , (cid:13)(cid:13) B ℓ n ( b Σ naive ) − Σ (cid:13)(cid:13) op . B / (1+ αθ ) θ k σ k αθ/ (1+ αθ ) ∞ λ αθ/ (1+ αθ ) n . (ii) Suppose Assumptions 3 and 5 hold with s ≥ ( p ∧ log d ) ∨ . Set λ n ≍ q s ∧ log dn . For all p ∈ [2 ∨ θ, ∞ ] , (cid:13)(cid:13) vec (cid:0) B ℓ n ( b Σ naive ) − Σ (cid:1)(cid:13)(cid:13) p = O p (cid:0) d /p B / (1+ pα ) p K pα/ (1+ pα ) s k σ k pα/ (1+ pα ) s λ pα/ (1+ pα ) n (cid:1) , (cid:13)(cid:13) B ℓ n ( b Σ naive ) − Σ (cid:13)(cid:13) op = O p (cid:16) B / (1+ αθ ) θ K αθ/ (1+ αθ ) s k σ k αθ/ (1+ αθ ) s λ αθ/ (1+ αθ ) n (cid:17) . emark 15. Analogous results also hold for covariance estimates based on the taperingoperator as defined in Bickel and Levina (2008a).
A.6 Auxiliary results II (Testing high-dimensional mean vectors)
In this section we present results that are used in Section 4. Recall that the Gaussianparametric bootstrap estimate of the α -quantile of test statistic S n,p is c ∗ n,p ( α ) := inf (cid:8) t ∈ R : P( S ∗ n,p ≤ t | X ) ≥ α (cid:9) . We now introduce the Gaussian approximation of the α -quantile of test statistic S n,p ,˜ c p ( α ) := inf n t ∈ R : P( e S p ≤ t ) ≥ α o , where e S p := k V k p with V ∼ N (0 , Ω). Given the theoretical results in Section 3 we canexpect that these two quantiles are close. The next lemma formalizes this intuition. It is astraightforward adaptation of Lemma 3.2 in Chernozhukov et al. (2013) to our setup.
Lemma 9 (Comparison of quantiles) . For all p ∈ [1 , ∞ ) and all δ > , sup α ∈ (0 , P (cid:16) c ∗ n,p ( α ) ≤ ˜ c n,p (cid:0) π p ( δ ) + α (cid:1)(cid:17) ≥ − P (Π p > δ ) , sup α ∈ (0 , P (cid:16) ˜ c n,p ( α ) ≤ c ∗ n,p (cid:0) π p ( δ ) + α (cid:1)(cid:17) ≥ − P (Π p > δ ) , (77) where π p ( δ ) = ( p r /pω d ′ /p δω if p ∈ [1 , ∞ ) , (log d ′ ) δω if p ≥ log d ′ and Π p = (b Γ p if p ∈ [1 , ∞ ) , b Γ op ∧ b Γ ∞ if p ≥ log d ′ . The following lemma provides bounds on the upper quantiles of the Gaussian proxystatistic e S p in terms of its expected value and the covariance matrix. Lemma 10 (Bounds on (upper) quantiles of e S p ) . For all α ∈ (0 , / , E[ e S p ] − k Ω / k → p ∧ q Var[ e S p ] ≤ ˜ c p (1 − α ) ≤ E[ e S p ] + p /α ) k Ω / k → p ∧ q (1 /α )Var[ e S p ] . In fact, the upper bound holds for all α ∈ (0 , . Remark 16.
Note that one may combine this lemma with Lemma 9 to obtain bounds on thebootstrap critical values c ∗ n,p (1 − α ) for α ∈ (0 , / . .7 Auxiliary results III (Partial derivatives of ℓ p -norms) Here we collect three lemmata on the partial derivatives of ℓ p -norms. Lemma 11 (Partial Derivatives) . Let p > and x ∈ (cid:8) z ∈ R d : z i ≥ , i = 1 , . . . , d (cid:9) \ { } .Set M p ( x ) = (cid:0) P dj =1 x pj (cid:1) /p . The following are the partial derivatives of M p up to order three: ∂M p ( x ) ∂x k = x p − k (cid:0) M p ( x ) (cid:1) p − ∂ M p ( x ) ∂x k = ( p − x p − k M p ( x ) (cid:1) p − − ( p − x p − k M p ( x ) (cid:1) p − ∂ M p ( x ) ∂x k ∂x ℓ = − ( p − x p − k x p − ℓ (cid:0) M p ( x ) (cid:1) p − ∂ M p ( x ) ∂x k ∂x ℓ ∂x m = (2 p − p − x p − k x p − ℓ x p − m (cid:0) M p ( x ) (cid:1) p − ∂ M p ( x ) ∂x k ∂x ℓ = − ( p − x p − k x p − ℓ (cid:0) M p ( x ) (cid:1) p − + (2 p − p − x p − k x p − ℓ (cid:0) M p ( x ) (cid:1) p − ∂ M p ( x ) ∂x k = ( p − p − x p − k (cid:0) M p ( x ) (cid:1) p − − p − x p − k (cid:0) M p ( x ) (cid:1) p − + (2 p − p − x p − k (cid:0) M p ( x ) (cid:1) p − . Lemma 12 (Stability of Partial Derivatives (Conjugate Norm)) . Let p, q > be conjugateexponents such that /p + 1 /q = 1 . Let x ∈ (cid:8) z ∈ R d : z i ≥ , i = 1 , . . . , d (cid:9) \ { } and set M p ( x ) = (cid:0) P dj =1 x pj (cid:1) /p . We have the following: d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q = 1 d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ p − d /p M p ( x ) + 2( p − M p ( x ) ∀ p ≥ X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ ( p − M p ( x ) X k,ℓ,m (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ ∂x m (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ (2 p − p − M p ( x ) X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ p − d /p M p ( x ) + 2(2 p − p − M p ( x ) ∀ p ≥ d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ p − p − d /p M p ( x ) + 12( p − M p ( x ) + 4(2 p − p − M p ( x ) ∀ p ≥ d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) ! / ≤ M ( x ) . Lemma 13 (Stability of Partial Derivatives (Transformed Conjugate Norm)) . Let p > , τ ≥ , and q ′ = τpτp − . Let x ∈ (cid:8) z ∈ R d : z i ≥ , i = 1 , . . . , d (cid:9) \{ } and set M p ( x ) = (cid:0) P dj =1 x pj (cid:1) /p .We have the following: d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ′ ! /q ′ ≤ d ( τ − / ( τp ) d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ′ ! /q ′ ≤ p − d (2 τ − / ( τp ) M p ( x ) + 2( p − d ( τ − / ( τp ) M p ( x ) ∀ p ≥ X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ′ ! /q ′ ≤ ( p − d τ − / ( τp ) M p ( x ) X k,ℓ,m (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ ∂x m (cid:12)(cid:12)(cid:12)(cid:12) q ′ ! /q ′ ≤ (2 p − p − d τ − / ( τp ) M p ( x ) X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ′ ! /q ′ ≤ p − d (3 τ − / ( τp ) M p ( x ) + 2(2 p − p − d τ − / ( τp ) M p ( x ) ∀ p ≥ d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ′ ! /q ′ ≤ p − p − d (3 τ − / ( τp ) M p ( x ) + 12( p − d ( τ − / ( τp ) M p ( x )+ 4(2 p − p − d ( τ − / ( τp ) M p ( x ) ∀ p ≥ d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) τ τ − ! τ − τ ≤ d ( τ − / (2 τ ) M ( x ) . B Proofs
B.1 Proofs for Section 3
Proof of Theorem 1 . The proof of this theorem follows from Theorem 5 and Proposi-tion 1. We first establish Case (iv). Cases (i) and (ii) are especial cases of (iv). Case (iii)has a standalone proof.
Proof of Case (iv).
For s ≥ L n,τp ( s ) := n n X i =1 E (cid:2) k X i k sτp + k Z i k sτp (cid:3)! /s and note that by two applications H¨older’s inequality L n,τp ( s ) ≥ L n,τp . Recall that Gaussianrandom vectors Z ∈ R d satisfy Assumption 3 with K s = √ s for all s ≥
1. Therefore, by39ssumption 3, Lemma 3, and H¨older’s inequality there exists an absolute constant
C > s ≥ L / n,τp ( s ) ≤ C (cid:0) K τp ∨ s ∨ √ τ p ∨ s (cid:1) d /τp σ n, max . Hence,( pd /p ) − / (3 τ ) L / n,τp ( s ) n / ω − p ( d, r n ) k σ n k p ≤ C (cid:0) K τp ∨ s ∨ √ τ p ∨ s (cid:1) ( pd /p ) − / (3 τ ) ω p ( d, r n ) n / d /τp σ n, max d /p σ n, min . (78)Moreover, observe that for any real-valued random variable Z and any t > s ≥ | Z | {| Z | > t } ] ≤ E[ | Z | ( | Z | /t ) s − {| Z | > t } ] ≤ t − s E[ | Z | s ]. Hence, Assumption 3 andLemma 3 imply that for all s ≥ M n,τp (cid:0) p − / (3 τ ) n / L / n,τp ( s ) (cid:1) p − /τ L n,τp ( s ) ≤ p − /τ nL n,τp ( s ) p s − s/ (3 τ ) n s/ L s/ n,τp ( s ) n − P ni =1 E (cid:2) k X i k sτp + k Z i k sτp (cid:3) p − /τ L n,τp ( s )= p − s (1 − /τ ) / ( p n ) ( s − / . (79)Combining eq. (78) and (79) with Theorem 5 we conclude that for s ≥ p ∈ [1 , ∞ ) and τ ∈ [1 , ∞ ], ̺ n,p . (cid:0) K τp ∨ s ∨ √ τ p ∨ s (cid:1) ( pd /p ) − / (3 τ ) ω p ( d, r n ) n / d /τp σ n, max d /p σ n, min + p − s (1 − /τ ) / ( p n ) ( s − / . (cid:0) K τp ∨ s ∨ √ τ p ∨ s (cid:1)s p d / (3 τp ) r /pn p / (3 τ ) n / σ n, max σ n, min + p − s (1 − /τ ) / ( p n ) ( s − / . (80)Since the X ’s have only s ≥ K τp ∨ s < ∞ only if τ p ≤ s . Since τ ≥
1, this impliesthat K τp ∨ s < ∞ only if p ≤ s . Hence, we deduce from (80) that for all p ∈ [1 , s ] and s ≥ ̺ n,p . (cid:0) K s ∨ √ s (cid:1)s p d / (3 s ) r /pn n / σ n, max σ n, min + 1( p n ) ( s − / . For s ≥ Proof of Case (i).
This is a special case of statement (iv); more precisely eq. (80).Recall that that if the X ’s are sub-gaussian, then Assumption 3 holds for all τ p, s ≥ K τp ∨ s = √ τ p ∨ s . Take τ = (log d ) /p ∨ s = 6 to obtain (cid:0) K τp ∨ s ∨ √ τ p ∨ s (cid:1)s p d / (3 τp ) r /pn p / (3 τ ) n / σ n, max σ n, min . p log d s p d / (3 log d ) r /pn p p/ (3 log d ) n / σ n, max σ n, min . s p (log d ) r /pn n / σ n, max σ n, min , p − s (1 − /τ ) / ( p n ) ( s − / . p n . Thus, by eq. (80), ̺ n,p . s p (log d ) r /pn n / σ n, max σ n, min . This completes the proof of case (i).
Proof of Case (ii).
This is again a special case of statement (iv). Note that if the X ’s are sub-exponential, then Assumption 3 holds for all τ p, s ≥ K τp ∨ s = τ p ∨ s .Therefore, in eq. (80) take τ = (log d ) /p ∨ s = 6 to obtain ̺ n,p . s p (log d ) r /pn n / σ n, max σ n, min , This completes the proof of case (ii).
Proof of Case (iii).
For s ≥ L n, max ( s ) := max ≤ k ≤ d n n X i =1 E [ | X ik | s + | Z ik | s ] ! /s , and observe that by two applications of Jensen’s inequality L n, max ( s ) ≥ max ≤ k ≤ d n n X i =1 E (cid:2) | X ik | (cid:3) . As in the proof of case (iv), we have M n, ∞ (cid:0) n / (log d ) − / L / n, max ( s ) (cid:1) L n, max ( s ) ≤ n (log d ) − L n, max ( s ) n s/ (log d ) − s/ L s/ n, max ( s ) n − P ni =1 E [ k X i k s ∞ + k Z i k s ∞ ] L n, max ( s ) . (log d ) (4 s − / n ( s − / , (81)where the last inequality follows from Lemma 2.2.2 in van der Vaart and Wellner (1996) andAssumption 1 or 2. By Lemma 3, Jensen’s inequality, and Assumption 1 or 2 there existsan absolute constant C ≥ s, n, d ) such that L / n, max ( s ) ≤ C max ≤ k ≤ d n n X i =1 E (cid:2) X ik (cid:3) s/ ! /s ≤ C max ≤ k ≤ d n n X i =1 E (cid:2) X ik (cid:3)! /s max ≤ i ≤ n E[ X ik ] ( s − / (2 s ) , k σ n k ∞ = max ≤ k ≤ d n n X i =1 E (cid:2) X ik (cid:3)! / ≥ max ≤ k ≤ d n n X i =1 E (cid:2) X ik (cid:3)! /s min ≤ i ≤ n E[ X ik ] ( s − / (2 s ) . Combine the preceding two inequalities to obtain(log d ) / n / L / n, max k σ n k ∞ ≤ C (log d ) / n / κ ( s − / (2 s ) n (82)Set s = 6, combine eq. (81) and (82) with Proposition 1 and conclude that ̺ n,p . log dn + (cid:18) κ n log dn (cid:19) / . (cid:18) κ n log dn (cid:19) / , where the last inequality follows since without loss of generality we may assume that Cκ n (log d ) /n < Proof of Theorem 2 . The result follows from the triangle inequality and Theorem 1 andTheorem 8 as described in the main text.
Proof of Corollary 3 . Combine Theorem 2 and Lemma 7.
Proof of Corollary 4 . Combine Theorem 2 and Lemma 8.
B.2 Proofs for Section 4
Proof of Theorem 3 . The proof is an adaptation of the proof of Theorem 3.1 in Chernozhukov et al.(2013) to our setup. Note thatsup α ∈ (0 , sup µ ∈H (cid:12)(cid:12) P µ (cid:0) S n,p + ξ ≤ c ∗ n,p ( α ) (cid:1) − α (cid:12)(cid:12) ≤ sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12) P µ (cid:0) S n,p + ξ ≤ c ∗ n,p ( α ) (cid:1) − P µ ( S n,p + ξ ≤ ˜ c p ( α )) (cid:12)(cid:12) + sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12)(cid:12) P µ ( S n,p + ξ ≤ ˜ c p ( α )) − P µ (cid:16) e S n,p ≤ ˜ c p ( α ) (cid:17)(cid:12)(cid:12)(cid:12) . (83)For δ > α ∈ (0 , sup µ ∈H P µ (cid:16) ˜ c p (cid:0) α − π p ( δ ) (cid:1) < S n,p + ξ ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17) + 2P µ (Π p > δ ) ≤ sup α ∈ (0 , sup µ ∈H P µ (cid:16) ˜ c p (cid:0) α − π p ( δ ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17) + 2 sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ (cid:0) S n,p ≤ t (cid:1) − P µ (cid:0) e S p ≤ t (cid:1)(cid:12)(cid:12)(cid:12) + 2P µ (Π p > δ )42 sup α ∈ (0 , sup µ ∈H n P µ (cid:16) ˜ c p (cid:0) α − π p ( δ ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17) − P µ (cid:16) ˜ c p (cid:0) α − π p ( δ ) (cid:1) < e S p ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17)o + 2 π p ( δ ) + 2 sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ (cid:0) S n,p ≤ t (cid:1) − P µ (cid:0) e S p ≤ t (cid:1)(cid:12)(cid:12)(cid:12) + 2 sup µ ∈H P µ (Π p > δ ) , (84)where the second inequality follows by definition of quantiles and because e S p has no pointmasses. Let η > α ∈ (0 , sup µ ∈H n P µ (cid:16) ˜ c p (cid:0) α − π p ( δ ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17) − P µ (cid:16) ˜ c p (cid:0) α − π p ( δ ) (cid:1) < e S p ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17)o ≤ sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12)(cid:12) P µ (cid:16) e S p + ξ ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17) − P µ (cid:16) e S p ≤ ˜ c p (cid:0) α + π p ( δ ) (cid:1)(cid:17)(cid:12)(cid:12)(cid:12) + sup α ∈ (0 , sup µ ∈H (cid:12)(cid:12)(cid:12) P µ (cid:16) e S p ≤ ˜ c p (cid:0) α − π p ( δ ) (cid:1)(cid:17) − P µ (cid:16) e S p + ξ ≤ ˜ c p (cid:0) α − π p ( δ ) (cid:1)(cid:17)(cid:12)(cid:12)(cid:12) . sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ (cid:16) e S p ≤ t (cid:17) − P µ (cid:16) e S p + ξ ≤ t (cid:17)(cid:12)(cid:12)(cid:12) . sup µ ∈H P µ ( | ξ | > η ) + sup t ≥ sup µ ∈H P µ (cid:16) t − η ≤ e S p ≤ t + η (cid:17) . sup µ ∈H P µ ( | ξ | > η ) + η ω p ( d ′ , r ω ) k ω k p , (85)where the last inequality follows from Theorem 7.We now bound the second term on the right hand side of eq. (83) bysup α ∈ (0 , sup µ ∈H | P µ ( S n,p + ξ ≤ ˜ c p ( α )) − P µ ( S n,p ≤ ˜ c p ( α )) | + sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ ( S n,p ≤ t ) − P µ (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) . sup µ ∈H P µ ( | ξ | > η ) + sup t ≥ sup µ ∈H P µ ( t − η ≤ S n,p ≤ t + η )+ sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ ( S n,p ≤ t ) − P µ (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) . sup µ ∈H P µ ( | ξ | > η ) + sup t ≥ sup µ ∈H P µ (cid:16) t − η ≤ e S p ≤ t + η (cid:17) + sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ ( S n,p ≤ t ) − P µ (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) . sup µ ∈H P µ ( | ξ | > η ) + η ω p ( d ′ , r ω ) k ω k p + sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ ( S n,p ≤ t ) − P µ (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) , (86)where the third inequality follows from Theorem 7.43ombine eq. (83)–(86) to obtainsup α ∈ (0 , sup µ ∈H (cid:12)(cid:12) P µ (cid:0) S n,p + ξ ≤ c ∗ n,p ( α ) (cid:1) − α (cid:12)(cid:12) . sup t ≥ sup µ ∈H (cid:12)(cid:12)(cid:12) P µ ( S n,p ≤ t ) − P µ (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) + inf δ> (cid:26) π p ( δ ) + sup µ ∈H P µ (Π p > δ ) (cid:27) + inf η> (cid:26) η ω p ( d ′ , r ω ) k ω k p + sup µ ∈H P µ ( | ξ | > η ) (cid:27) . To complete the proof bound the first term on the right hand side by Theorem 1.
Proof of Theorem 4 . We prove a slightly sharper result than the one given in the maintext. Let
L ⊆ R + be arbitrary and define the collection of alternatives A p ( L ) := ( ( µ n ) n ∈ N , µ n ∈ R d n : lim n →∞ E k Ω / Z k p ∨ p Var k Ω / Z k p √ n k M µ n − m k p ∈ L ) , (87)where Z ∼ N (0 , I d ′ ). Note that A p = A p ( { } ) and Z p ⊂ A p ((1 , ∞ ]). Prove of Case (i).
Recall π p ( · ) and Π p from Lemma 9. Let δ α > / ( α − π p ( δ α ) ≤ /α . Fix a sequence ( µ n ) n ∈ N ∈ A p ( { } ). Then, by Lemma 9 and Assumption 6,P µ n (cid:0) S n,p > c ∗ n,p (1 − α ) (cid:1) ≥ P µ n (cid:0) S n,p > c ∗ n,p (1 − α ) , c ∗ n,p (1 − α ) ≤ ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:1) ≥ P µ n (cid:0) S n,p > ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:1) + P (cid:0) c ∗ n,p (1 − α ) ≤ ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:1) − ≥ P µ n (cid:0) S n,p > ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:1) − P (cid:0) Π p > δ α (cid:1) ≥ P µ n (cid:0) S n,p > ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:1) + o (1) . (88)We now lower bound the first factor on the far right hand side in above display. By thereverse triangle inequality and Lemma 10,P µ n (cid:0) S n,p > ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:1) = P µ n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n n X i =1 M ( X i − µ n ) + √ n ( M µ n − m ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p > ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1) ≥ P µ n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n n X i =1 M ( X i − µ n ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p < √ n k M µ n − m k p − ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1) ≥ P (cid:16) e S p < √ n k M µ n − m k p − ˜ c p (cid:0) π p ( δ α ) + 1 − α (cid:1)(cid:17) − sup t ≥ (cid:12)(cid:12)(cid:12) P ( S n,p ≤ t ) − P (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) ≥ P (cid:18) e S p < √ n k M µ n − m k p − E[ e S p ] − q / (cid:0) α − π p ( δ α ) (cid:1) Var[ e S p ] (cid:19) − sup t ≥ (cid:12)(cid:12)(cid:12) P ( S n,p ≤ t ) − P (cid:16) e S p ≤ t (cid:17)(cid:12)(cid:12)(cid:12) . (89)44y Theorem 1 and Assumption 6 the second term in the last line is of order o (1). ByMarkov’s inequality the first term can be bounded byinf µ n ∈A p P (cid:18) e S p < √ n k M µ n − m k p − E[ e S p ] − q / (cid:0) α − π p ( δ α ) (cid:1) Var[ e S p ] (cid:19) ≥ − e S p ] + q (2 /α )Var[ e S p ] √ n k M µ n − m k p = 1 + o (1) , (90)where we have used that ( µ n ) n ∈ N ∈ A p ( { } ).To conclude the proof combine eq. (88)–(90). Prove of Case (ii).
Recall π p ( · ) and Π p from Lemma 9. Let δ α > / ( α − π p ( δ α ) ≤ /α . Let ( µ n ) n ∈ N ∈ A p ((1 , ∞ ]), and computeP µ n (cid:0) S n,p > c ∗ n,p (1 − α ) (cid:1) ≤ P µ n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n n X i =1 M ( X i − µ n ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p + √ n k M µ n − m k p > c ∗ n,p (1 − α ) ≤ P (cid:16) e S p + √ n k M µ n − m k p > ˜ c p (1 − α ) (cid:17) + n P (cid:16) S n,p + √ n k M µ n − m k p ≤ c ∗ n,p (1 − α ) (cid:17) − P (cid:16) S n,p + √ n k M µ n − m k p ≤ ˜ c p (1 − α ) (cid:17)o . (91)The second term can be bounded as the first term in eq. (83): Set ξ = √ n k M µ n − m k p .Then, as in eq. (84) and by the equivariance of the quantile function,P (cid:0) S n,p + ξ ≤ c ∗ n,p (1 − α ) (cid:1) − P ( S n,p + ξ ≤ ˜ c p (1 − α )) ≤ P (cid:16) ˜ c p (cid:0) − α − π p ( δ α ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) − α + π p ( δ α ) (cid:1)(cid:17) − P (cid:16) ˜ c p (cid:0) − α − π p ( δ α ) (cid:1) + ξ < e S p + ξ ≤ ˜ c p (cid:0) − α + π p ( δ α ) (cid:1) + ξ (cid:17) + 2 π p ( δ α ) + 2 sup t ≥ (cid:12)(cid:12)(cid:12) P (cid:0) S n,p ≤ t (cid:1) − P (cid:0) e S p ≤ t (cid:1)(cid:12)(cid:12)(cid:12) + 2P (Π p > δ α ) ≤ π p ( δ α ) + 2 sup t ≥ (cid:12)(cid:12)(cid:12) P (cid:0) S n,p ≤ t (cid:1) − P (cid:0) e S p ≤ t (cid:1)(cid:12)(cid:12)(cid:12) + 2P (Π p > δ α ) + o (1) , (92)where we have used thatP (cid:16) ˜ c p (cid:0) − α − π p ( δ α ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) − α + π p ( δ α ) (cid:1)(cid:17) − P (cid:16) ˜ c p (cid:0) − α − π p ( δ α ) (cid:1) + ξ < e S p + ξ ≤ ˜ c p (cid:0) − α + π p ( δ α ) (cid:1) + ξ (cid:17) = − P (cid:16) ˜ c p (cid:0) − α + π p ( δ α ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) − α + π p ( δ α ) (cid:1) + ξ (cid:17) + P (cid:16) ˜ c p (cid:0) − α − π p ( δ α ) (cid:1) < e S p + ξ ≤ ˜ c p (cid:0) − α − π p ( δ α ) (cid:1) + ξ (cid:17) o (1) , since by Assumption 6 π p ( δ α ) = o (1).By another application of Assumption 6 and Theorem 1 we conclude that the remainingterms in eq. (92) are negligible as well.Since e S p has no point mass, the first term on the far right hand side of eq. (91) is strictlyless than 1 whenever √ n k M µ n − m k p < E[ e S p ] − q Var[ e S p ] ≤ ˜ c p (1 − α ) , (93)where the second inequality follows from Lemma 10. This inequality holds by definition ofthe set A p ((1 , ∞ ])To conclude the proof combine eq. (90)–(93). B.3 Proofs for Appendix A
B.3.1 Proofs for Appendix A.1
Proof of Theorem 5 . Proof of Case (i).Step 1. Fundamental smoothing inequality.
Let Y = { Y i } ni =1 be an independentcopy of Z = { Z i } ni =1 and define W n ( s ) := n X i =1 r sn X i + r − sn Y i ! , s ∈ [0 , . Consider the family of sets I = { A ⊆ R : A = [0 , t ] , t ≥ } . Let p ∈ [1 , ∞ ) be arbitrary.Define p + = 2 ⌈ p ⌉ to be the smallest even integer larger than (or equal to) p . By Lemma 1for A ∈ I , we haveP ( k W n ( s ) k p ∈ A ) − P (cid:0) k S Zn k p ∈ A κ p +3 δ (cid:1) ≤ E h h p + ,d,β,A κp + (cid:0) W n ( s ) (cid:1) − h p + ,d,β,A κp + ( S Zn ) i . Re-arrange the terms in above inequality and take the supremum over A ∈ I to obtainsup A ∈I (cid:16) P ( k W n ( s ) k p ∈ A ) − P (cid:0) k S Zn k p ∈ A (cid:1) (cid:17) ≤ sup A ∈I P (cid:0) k S Zn k p ∈ A κ p +3 δ \ A (cid:1) + sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W n ( s ) (cid:1) − h p + ,d,β,δ,A ( S Zn ) (cid:3) (cid:12)(cid:12)(cid:12) , (94)By Lemma 1 we also have for A ∈ I ,P (cid:0) k S Zn k p ∈ A − (12 κ p +3 δ ) (cid:1) − P ( k W n ( s ) k p ∈ A ) ≤ E h h p + ,d,β,A − (12 κp + +3 δ ) ( S Zn ) − h p + ,d,β,A − (12 κp + +3 δ ) (cid:0) W n ( v ) (cid:1)i . Observe that sup A ∈I P (cid:0) k S Zn k p ∈ A \ A − (12 κ p +3 δ ) (cid:1) ≤ sup A ∈I P (cid:0) k S Zn k p ∈ A κ p +3 δ \ A (cid:1) . To-gether with the preceding inequality this yieldssup A ∈I (cid:16) P (cid:0) k S Zn k p ∈ A (cid:1) − P ( k W n ( s ) k p ∈ A ) (cid:17) ≤ sup A ∈I P (cid:0) k S Zn k p ∈ A κ p +3 δ \ A (cid:1) + sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W n ( s ) (cid:1) − h p + ,d,β,δ,A ( S Zn ) (cid:3) (cid:12)(cid:12)(cid:12) , (95)46ombine eq. (94) and eq. (95) to obtainsup s ∈ [0 , sup A ∈I (cid:12)(cid:12)(cid:12) P (cid:0) k S Zn k p ∈ A (cid:1) − P ( k W n ( s ) k p ∈ A ) (cid:12)(cid:12)(cid:12) ≤ sup A ∈I P (cid:0) k S Zn k p ∈ A κ p +3 δ \ A (cid:1) + sup s ∈ [0 , sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W n ( s ) (cid:1) − h p + ,d,β,δ,A ( S Zn ) (cid:3) (cid:12)(cid:12)(cid:12) Note that above inequality holds also for p = p + . Thus, we have the following fundamentalsmoothing inequalitysup s ∈ [0 , sup A ∈I sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P (cid:0) k S Zn k r ∈ A (cid:1) − P ( k W n ( s ) k r ∈ A ) (cid:12)(cid:12)(cid:12) ≤ sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A κ r +3 δ \ A (cid:1) + sup s ∈ [0 , sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W n ( s ) (cid:1) − h p + ,d,β,δ,A ( S Zn ) (cid:3) (cid:12)(cid:12)(cid:12) (96)We now bound the second term on the right hand side of eq. (96). Step 2. Slepian-Stein interpolation.
Define the Slepian interpolant as V ( t ; s ) := n X i =1 V i ( t ; s ) , where V i ( t ; s ) := r stn X i + r (1 − s ) tn Y i + r − tn Z i , s, t ∈ [0 , , the Stein leave-one-out term as V ( i ) ( t ; s ) := V ( t ; s ) − V i ( t ; s ) , i = 1 , . . . , n, and denote the derivative of the i th summand V i ( t ; s ) with respect to t by˙ V i ( t ; s ) := ddt V i ( t ; s ) = 12 " √ t r sn X i + r − sn Y i ! − √ − t √ n Z i . Since V (0; s ) = S Zn and V (1; s ) = W n ( s ), by expressing the difference as integration of thederivative function, we haveE (cid:2) h p + ,d,β,δ,A (cid:0) W n ( s ) (cid:1) − h p + ,d,β,δ,A ( S Zn ) (cid:3) = n X i =1 X | α | =1 Z E h ˙ V αi ( t ; s ) (cid:0) D α h p + ,d,β,δ,A (cid:1)(cid:0) V ( t ; s ) (cid:1)i dt. (97)For brevity of notation, we now drop the subscripts p + , d, β, δ, A and write h instead of h p + ,d,β,δ,A . We also write X ni , Y ni , and Z ni instead of √ n X i , √ n Y i , and √ n Z i , respectively. Inabove display, expanding the summands over i = 1 , . . . , n via a first-order Taylor expansionaround V ( i ) ( t ; s ) in direction V i ( t ; s ) yields, for all s ∈ [0 , (cid:2) h (cid:0) W n ( s ) (cid:1) − h ( S Zn ) (cid:3) = n X i =1 X | α | =1 Z E h ˙ V αi ( t ; s ) (cid:0) D α h (cid:1)(cid:0) V ( i ) ( t ; s ) (cid:1)i dt n X i =1 X | α ′ | =1 X | α | =1 Z E h V α ′ i ( t ; s ) ˙ V αi ( t ; s ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) (cid:1)i dt + n X i =1 X | α ′ | =2 X | α | =1 Z Z (1 − u )E h V α ′ i ( t ; s ) ˙ V αi ( t ; s ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1)i dtdu = I + II + III . (98)It is standard to verify that I = II = 0 (because E [ ˙ V i ( t ; s )] = 0 and E [ V α ′ i ( t ; s ) ˙ V αi ( t ; s )] =0 for | α ′ | = | α | = 1, and ˙ V i ( t ; s ) and V i ( t ; s ) are independent of V ( i ) ( t ; s ); see p. 2327in Chernozhukov et al. (2017a)). Thus, we only need to bound the third term. Let ξ > τ ≥
1, set χ i = {k X ni k τp + ∨ k Y ni k τp + ∨ k Z ni k τp + ≤ ξ } , and compute | III | == n X i =1 X | α ′ | =2 X | α | =1 Z Z (1 − u )E h χ i (cid:12)(cid:12) V α ′ i ( t ; s ) ˙ V αi ( t ; s ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1)(cid:12)(cid:12)i dtdu + n X i =1 X | α ′ | =2 X | α | =1 Z Z (1 − u )E h (1 − χ i ) (cid:12)(cid:12) V α ′ i ( t ; s ) ˙ V αi ( t ; s ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1)(cid:12)(cid:12)i dtdu = | III | + | III | . (99) Step 3. Bound on III . Set q = τp + τp + − and B = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16)P | α | =3 | D α h | q (cid:17) /q (cid:13)(cid:13)(cid:13)(cid:13) ∞ . By repeatedapplications of H¨older’s inequality, | III | ≤ E n X i =1 Z Z (1 − u )(1 − χ i ) X | α ′ | =2 X | α | =1 | V α ′ i ( t ; s ) ˙ V αi ( t ; s ) | τp + / ( τp + ) × X | α ′ | =2 X | α | =1 (cid:12)(cid:12)(cid:12)(cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1)(cid:12)(cid:12)(cid:12) q /q dtdu ≤ B E n X i =1 Z (1 − χ i ) X | α ′ | =2 X | α | =1 | V α ′ i ( t ; s ) ˙ V αi ( t ; s ) | τp + / ( τp + ) dt ≤ B (cid:18)Z dt √ t ∧ √ − t (cid:19) E n X i =1 (1 − χ i ) X | α ′ | =2 X | α | =1 τp + (cid:12)(cid:12)(cid:12)(cid:0) | X ni | α ′ ∨ | Y ni | α ′ ∨ | Z ni | α ′ (cid:1) × (cid:0) | X ni | α + | Y ni | α + | Z ni | α (cid:1)(cid:12)(cid:12)(cid:12) τp + /τp + . B E n X i =1 (1 − χ i ) X | α ′ | =2 X | α | =1 (cid:12)(cid:12)(cid:12) | X ni | α + α ′ ∨ | Y ni | α + α ′ ∨ | Z ni | α + α ′ (cid:12)(cid:12)(cid:12) τp + / ( τp + ) (100)48e now apply inequality (20) of Lemma B.1 in Chernozhukov et al. (2017a) to the far righthand side of (100) and obtain | III | . B E n X i =1 X | α | =3 (cid:12)(cid:12)(cid:12) | X ni | α {k X ni k τp + > ξ } (cid:12)(cid:12)(cid:12) τp + / ( τp + ) + X | α | =3 (cid:12)(cid:12)(cid:12) | Y ni | α {k Y ni k τp + > ξ } (cid:12)(cid:12)(cid:12) τp + / ( τp + ) + X | α | =3 (cid:12)(cid:12)(cid:12) | Z ni | α {k Z ni k τp + > ξ } (cid:12)(cid:12)(cid:12) τp + / ( τp + ) . B E " n X i =1 k X ni k τp + {k X ni k τp + > ξ } + k Z ni k τp + {k Z ni k τp + > ξ } , (101)where the last inequality holds because Y ni d = Z ni for all i = 1 , . . . , n . Step 4. Bound on III . Recall from Lemma 1 that h ≡ h p + ,d,β,δ,A is non-constanton the set (cid:8) z ∈ R d : M p + ,κ p + ( z ) ∈ A δ \ A (cid:9) only. By construction of M p + ,κ p + it holdsthat (cid:8) z ∈ R d : k z k p + ∈ A δ \ A − κ p + (cid:9) ⊇ (cid:8) z ∈ R d : M p + ,κ p + ( z ) ∈ A δ \ A (cid:9) . Thus, for ϕ ( x ) = (cid:8) k x k p + ∈ A δ \ A − κ p + (cid:9) we have χ i X | α ′ | =2 X | α | =1 Z (1 − u ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1) du = χ i ϕ (cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1) X | α ′ | =2 X | α | =1 Z (1 − u ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1) du. (102)Set γ p + = d /p + − / ( τp + ) ξ . Note that for all t, s, u ∈ [0 , k V ( i ) ( t ; s ) k p + χ i ≤ k V ( i ) ( t ; s ) + uV i ( t ; s ) k p + χ i + k V i ( t ; s ) k p + χ i ≤ k V ( i ) ( t ; s ) + uV i ( t ; s ) k p + χ i + 3 γ p + , and, similarly, k V ( i ) ( t ; s ) k p + χ i ≥ k V ( i ) ( t ; s ) + uV i ( t ; s ) k p + χ i − γ p + . Thus, for all t, s, u ∈ [0 , (cid:12)(cid:12) k V ( i ) ( t ; s ) + uV i ( t ; s ) k p + − k V ( i ) ( t ; s ) k p + (cid:12)(cid:12) χ i ≤ γ p + . (103)Define φ ( x ) = (cid:8) k x k p + ∈ A δ +3 γ p + \ A − ( κ p + +3 γ p + ) (cid:9) . Now, eq. (102) and (103) imply χ i X | α ′ | =2 X | α | =1 Z (1 − u ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1) du ≤ χ i φ (cid:0) V ( i ) ( t ; s ) (cid:1) X | α ′ | =2 X | α | =1 Z (1 − u ) (cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1) du. (104)49ecall that q = τp + τp + − and B = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16)P | α | =3 | D α h | q (cid:17) /q (cid:13)(cid:13)(cid:13)(cid:13) ∞ . Two applications of H¨older’sinequality and above inequality (104) give | III | . E n X i =1 Z Z (1 − u ) χ i φ (cid:0) V ( i ) ( t ; s ) (cid:1) X | α ′ | =2 , | α | =1 (cid:12)(cid:12)(cid:12) V α ′ i ( t ; s ) ˙ V αi ( t ; s ) (cid:12)(cid:12)(cid:12) τp + / ( τp + ) × X | α ′ | =2 X | α | =1 (cid:12)(cid:12)(cid:12)(cid:0) D α + α ′ h (cid:1)(cid:0) V ( i ) ( t ; s ) + uV i ( t ; s ) (cid:1)(cid:12)(cid:12)(cid:12) q /q dtdu . B E n X i =1 Z χ i φ (cid:0) V ( i ) ( t ; s ) (cid:1) X | α ′ | =2 , | α | =1 | V α ′ i ( t ; s ) ˙ V αi ( t ; s ) | τp + / ( τp + ) dt . B E n X i =1 χ i Z φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t dt ! X | α | =3 (cid:16) | X ni | α ∨ | Y ni | α ∨ | Z ni | α (cid:17) τp + / ( τp + ) , (105)where the last inequality follows as the third inequality in the bound on III .To bound the expected value in the expression in eq. (105) we plan to apply Harris’association inequality (e.g. Boucheron et al., 2013, Theorem 2.15). We note the following:First, the map ( x ′ , y ′ , z ′ ) ′ ( P | α | =3 ( | X ni | α ∨ | Y ni | α ∨ | Z ni | α | τp + ) / ( τp + ) is non-decreasingin each coordinate of ( x ′ , y ′ , z ′ ) ′ ∈ R d (while keeping all other coordinates fixed at anyvalue). Second, the map ( x ′ , y ′ , z ′ ) ′ χ i ( x, y, z ) = {k x k τp + ∨ k y k τp + ∨ k z k τp + ≤ ξ } isnon-increasing in each coordinate of ( x ′ , y ′ , z ′ ) ′ ∈ R d (while keeping all other coordinatesfixed). Third, V ( i ) ( t ; s ) and ( X ni , Y ni , Z ni ) are independent. Therefore, Fubini’s theorem andHarris’ association inequality applied conditionally on V ( i ) ( t ; s ) imply thatE n X i =1 χ i Z φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t dt ! X | α | =3 (cid:16) | X ni | α ∨ | Y ni | α ∨ | Z ni | α (cid:17) τp + / ( τp + ) ≤ E "Z n X i =1 E " χ i φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t ! | V ( i ) ( t ; s ) × E X | α | =3 (cid:12)(cid:12)(cid:12) | X ni | α ∨ | Y ni | α ∨ | Z ni | α (cid:12)(cid:12)(cid:12) τp + / ( τp + ) | V ( i ) ( t ; s ) dt = n X i =1 E " χ i Z φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t dt ! E X | α | =3 (cid:16) | X ni | α ∨ | Y ni | α ∨ | Z ni | α (cid:17) τp + / ( τp + ) . (106)50o bound the first factor in eq. (106) define ψ ( x ) = (cid:8) k x k p + ∈ A δ +6 γ p + \ A − ( κ p + +6 γ p + ) (cid:9) and compute χ i Z φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t dt ! = χ i Z ψ (cid:0) V ( t ; s ) (cid:1) φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t dt ! ≤ Z ψ (cid:0) V ( t ; s ) (cid:1) √ t ∧ √ − t dt. Hence, E " χ i Z φ (cid:0) V ( i ) ( t ; s ) (cid:1) √ t ∧ √ − t dt ! ≤ E "Z ψ (cid:0) V ( t ; s ) (cid:1) √ t ∧ √ − t dt ≤ √ t ∈ [0 , P (cid:18)(cid:13)(cid:13)(cid:13) √ tW n ( s ) + √ − tS Zn (cid:13)(cid:13)(cid:13) p + ∈ A δ +6 γ p + \ A − ( κ p + +6 γ p + ) (cid:19) ≤ √ s ∈ [0 , P (cid:16) k W n ( s ) k p + ∈ A δ +6 γ p + \ A − ( κ p + +6 γ p + ) (cid:17) , (107)where the last inequality follows from √ tW n ( s ) + √ − tS Zn d = W n ( st ). We bound theprobability in eq. (107) bysup s ∈ [0 , P (cid:0) k W n ( s ) k p + ∈ A δ +6 γ p + \ A − ( κ p + +6 γ p + ) (cid:1) = sup s ∈ [0 , n P (cid:0) k W n ( s ) k p + ∈ A δ +6 γ p + (cid:1) − P (cid:0) k S Zn k p + ∈ A δ +6 γ p + (cid:1) + P (cid:0) k S Zn k p + ∈ A − ( κ p + +6 γ p + ) (cid:1) − P (cid:0) k W n ( s ) k p + ∈ A − ( κ p + +6 γ p + ) (cid:1) + P (cid:0) k S Zn k p + ∈ A δ +6 γ p + \ A − ( κ p + +6 γ p + ) (cid:1) o ≤ A ∈I sup s ∈ [0 , sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P ( k W n ( s ) k r ∈ A ) − P (cid:0) k S Zn k r ∈ A (cid:1) (cid:12)(cid:12)(cid:12) + sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A δ + κ r +12 γ r \ A (cid:1) . (108)Combine eq. (105)–(108) to conclude that | III | . B n X i =1 E X | α | =3 (cid:16) | X ni | α ∨ | Y ni | α ∨ | Z ni | α (cid:17) τp + / ( τp + ) × sup s ∈ [0 , P (cid:16) k W n ( s ) k p + ∈ A δ +6 γ p + \ A − ( κ p + +6 γ p + ) (cid:17) . B n X i =1 E[ k X ni k τp + ] + E[ k Z ni k τp + ] ! × sup s ∈ [0 , sup A ∈I sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P ( k W n ( s ) k r ∈ A ) − P (cid:0) k S Zn k r ∈ A (cid:1) (cid:12)(cid:12)(cid:12) + sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A δ + κ r +12 γ r \ A (cid:1)! , (109)51here the second inequality follows from Y ni d = Z ni for all i = 1 , . . . , n . Step 5. Recursive bound on eq. (96) . To simplify notation, let us write ̺ n,p = sup s ∈ [0 , sup A ∈I sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P ( k W n ( s ) k r ∈ A ) − P (cid:0) k S Zn k r ∈ A (cid:1) (cid:12)(cid:12)(cid:12) . Recall that q = τp + τp + − and B = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16)P | α | =3 | D α h | q (cid:17) /q (cid:13)(cid:13)(cid:13)(cid:13) ∞ . Now, the bounds from Step 1through 4 imply that ̺ n,p . B M n,τp + ( ξ √ n ) √ n + B L n,τp + √ n ̺ n,p + B L n,τp + √ n sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A δ + κ r +12 γ r \ A (cid:1)! + sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A δ +12 κ r \ A (cid:1) . (110)By Lemma 1 and Theorem 7 we can simplify eq. (110) to ̺ n,p ≤ C (cid:18) δ + βδ + β δ (cid:19) d τ − τp + M n,τp + ( ξ √ n ) √ n + C (cid:18) δ + βδ + β δ (cid:19) d τ − τp + L n,τp + √ n ̺ n,p + C (cid:18) δ + βδ + β δ (cid:19) d τ − τp + L n,τp + √ n sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A δ + κ r +12 γ r \ A (cid:1) + sup A ∈I sup r ∈{ p,p + } P (cid:0) k S Zn k r ∈ A δ +12 κ r \ A (cid:1) ≤ C (cid:18) δ + βδ + β δ (cid:19) d τ − τp M n,τp ( ξ √ n ) √ n + C (cid:18) δ + βδ + β δ (cid:19) d τ − τp L n,τp √ n ̺ n,p + C δ + β − pd / ( τp ) + γ p ω − p ( d, r n ) k σ n k p (cid:18) C (cid:18) δ + βδ + β δ (cid:19) d τ − τp L n,τp √ n (cid:19) , (111)where C , C ≥ p ≤ p + andby Remark 9. Now, set β = p / (3 τ ) d − ( τ − / ( τp ) d / (3 τp ) n / L − / n,τp and δ = 6 C p − / (3 τ ) d ( τ − / ( τp ) d / (3 τp ) n − / L / n,τp , and note that (cid:18) δ + βδ + β δ (cid:19) d τ − τp ≤ d τ − τp β δ = √ n C p − /τ L n,τp . Therefore, eq. (111) reduces to ̺ n,p ≤ M n,τp ( ξ √ n )2 p − /τ L n,τp + ̺ n p − /τ + 3 C ( pd /p ) − / (3 τ ) L / n,τp n / ω − p ( d, r n ) k σ n k p + 3 C γ p ω − p ( d, r n ) k σ n k p . γ p = d /p (1 − /τ ) ξ . Set ξ = p − / (3 τ ) n − / L / n,τp and solve above inequality for ̺ n,p to obtain ̺ n,p . M n,τp (cid:0) p − / (3 τ ) n / L / n,τp (cid:1) p − /τ L n,τp + ( pd /p ) − / (3 τ ) L / n,τp n / ω − p ( d, r n ) k σ n k p . Lastly, note that sup t ≥ (cid:12)(cid:12)(cid:12) P (cid:0) k S Xn k p ≤ t (cid:1) − P (cid:0) k S Zn k p ≤ t (cid:1)(cid:12)(cid:12)(cid:12) ≤ ̺ n,p . This concludes the proofof the first statement. Proof of Case (ii).
The proof is identical to the proof of the first statement exceptfor the following four changes: First, the fundamental smoothing inequality is the inequalitythat directly precedes inequality (96); there is no need to introduce p + , the smallest evenexponent larger than p . Second, replace arguments involving H¨older’s inequality with theconjugate exponents ( q, τ p + ) by arguments based on H¨older’s inequality with the conjugateexponents (1 , ∞ ). Third, replace Lemma 1 by Lemma 2 throughout. Lastly, set β = n / (log d ) / L − / n, ∞ , δ = 6 C (log d ) / n − / L / n, ∞ , ξ = (log d ) / n − / L / n, ∞ , and proceed as in Step 5. This concludes the proof of the second statement. B.3.2 Proofs for Appendix A.2
Proof of Theorem 6 . Observe that for p ∈ N even, k X k pp − t = P dj =1 X pj − t is a (multi-variate) polynomial of degree p in X ∈ R d . Therefore, by Theorem 8 in Carbery and Wright(2001) uniformly in t ≥ q ≥
1, and p ∈ N even, q − (cid:16) E (cid:12)(cid:12) k X k pp − t (cid:12)(cid:12) q/p (cid:17) /q P (cid:0)(cid:12)(cid:12) k X k pp − t (cid:12)(cid:12) ≤ ε p (cid:1) . ε. (112)Furthermore, note that for all p, q ≥ Z, Z ′ of independent and identicallydistributed random variables, (cid:16) E (cid:12)(cid:12) Z − Z ′ (cid:12)(cid:12) q/p (cid:17) p/q ≤ (2 p/q ∨ (cid:0) E | Z | q/p (cid:1) p/q . Thus, for all t ≥ q ≥
1, and p ∈ N even, (cid:16) E (cid:12)(cid:12) k X k pp − t (cid:12)(cid:12) q/p (cid:17) /q & (cid:16) E (cid:12)(cid:12) k X k pp − k X ′ k pp (cid:12)(cid:12) q/p (cid:17) /q & (cid:0) E (cid:12)(cid:12) k X k p − k X ′ k p (cid:12)(cid:12) q (cid:1) /q , (113)where the second inequality follows from the reverse triangle inequality applied to | · | /p .Combine eq. (112) and eq. (113) to conclude thatsup q ≥ q − (cid:0) E (cid:12)(cid:12) k X k p − k X ′ k p (cid:12)(cid:12) q (cid:1) /q sup t ≥ P (cid:0)(cid:12)(cid:12) k X k pp − t (cid:12)(cid:12) ≤ ε p (cid:1) . ε. (114)53his is a statement about the polynomial k X k pp − t = P dj =1 X pj − t . We reduce it to astatement about k X k p by lower bounding the probability as follows:sup t ≥ P ( t ≤ k X k p ≤ t + ε ) = sup t ≥ P (cid:16) t p ≤ k X k pp ≤ ( t + ε ) p (cid:17) ≤ sup t ≥ P (cid:0) t p − p − ε p ≤ k X k pp ≤ p − t p + 2 p − ε p (cid:1) = sup t ≥ P (cid:0)(cid:8) (cid:12)(cid:12) k X k pp − t p (cid:12)(cid:12) ≤ p − ε p (cid:9) ∪ (cid:8) (cid:12)(cid:12) k X k pp − p − t p (cid:12)(cid:12) ≤ p − ε p (cid:9)(cid:1) ≤ t ≥ P (cid:0)(cid:12)(cid:12) k X k pp − t (cid:12)(cid:12) ≤ p − ε p (cid:1) , (115)where the first inequality holds since ( a + b ) p ≤ p − a p + 2 p − b p for all a, b ≥ p > q ≥ q − (cid:0) E (cid:12)(cid:12) k X k p − k X ′ k p (cid:12)(cid:12) q (cid:1) /q sup t ≥ P ( t ≤ k X k p ≤ t + ε ) . ε. (116) Proof of Corollary 5 . Note that by eq. (113),sup q ≥ q − (cid:16) E (cid:12)(cid:12) k X k − k X ′ k (cid:12)(cid:12) q/ (cid:17) /q sup t ≥ P ( t ≤ k X k ≤ t + ε ) . ε. The expected value on the left hand side can be lower bounded assup q ≥ q − (cid:16) E (cid:12)(cid:12) k X k − k X ′ k (cid:12)(cid:12) q/ (cid:17) /q ≥ (cid:16) E (cid:12)(cid:12) k X k − k X ′ k (cid:12)(cid:12) (cid:17) / = 12 Var[ k X k ] / = 12 (cid:0) tr(Σ ) + µ ′ Σ µ (cid:1) / , where the last line follows from direct calculations. This completes the proof. Proof of Theorem 7 . We split the proof into two parts. First, for p ∈ [1 , ∞ ) arbitrarywe show that sup t ≥ P( t ≤ k X k p ≤ t + εp − / r − / (2 p ) k σ k p ) . ε . Then, for p ≥ log d , we showthat sup t ≥ P( t ≤ k X k p ≤ t + ε k σ k p / √ log d ) . ε . Step 1.
Let p ≥ p + = 2 ⌈ p ⌉ to be the smallest even integerlarger than (or equal to) p . Then, k x k p + ≤ k x k p for all x ∈ R d and thereforesup t ≥ P ( t ≤ k X k p ≤ t + ε ) ≤ sup t ≥ P (cid:0) t ≤ k X k p + ≤ t + ε (cid:1) ≤ t ≥ P (cid:0)(cid:12)(cid:12) k X k p + p + − t (cid:12)(cid:12) ≤ p + − ε p + (cid:1) , (117)where the second inequality follows as in eq. (115). Thus, by eq. (113) and (114), we havesup q ≥ q − (cid:16) E (cid:12)(cid:12) k X k p + p + − k X ′ k p + p + (cid:12)(cid:12) q/p + (cid:17) /q sup t ≥ P ( t ≤ k X k p ≤ t + ε ) . ε. (118)54et q = 2 p + and lower bound the expected value on the left hand side in above display asfollowssup q ≥ q − (cid:16) E (cid:12)(cid:12) k X k p + p + − k X ′ k p + p + (cid:12)(cid:12) q/p + (cid:17) /q ≥ p + (cid:16) E (cid:12)(cid:12) k X k p + p + − k X ′ k p + p + (cid:12)(cid:12) (cid:17) / (2 p + ) = p − Var (cid:2) k X k p + p + (cid:3) / (2 p + ) . (119)Since Σ is positive semi-definite and has rank r , there exists a lower triangular matrixΓ such that ΓΓ ′ = Σ and γ kj = 0 for all ( k, j ) which satisfy k < j or j > r (via LDL ′ decomposition). Let Z ∼ N (0 , I d ) be a standard normal random vector in R d and set X d = Γ Z . By Lemma 11.1 in Chatterjee (2014) and Cauchy-Schwarz we haveVar (cid:2) k X k p + p + (cid:3) ≥ d X j =1 E " p + d X k =1 γ kj Z j ( γ ′ k Z ) p + − = 12 r X j =1 E " p + d X k =1 γ kj Z j ( γ ′ k Z ) p + − ≥ p r E " d X k =1 ( γ ′ k Z ) p + = p r (cid:0) E k X k p + p + (cid:1) . (120)By Stirling’s formula the central moments of a standard normal random variable Z satisfythe following asymptotic estimate: E [ | Z | p ] = 2 p/ √ π Γ (cid:18) p + 12 (cid:19) ≍ p/ (cid:16) pe (cid:17) p/ as p → ∞ . Thus, since p ≤ p + ≤ p , it follows that (cid:0) E k X k p + p + (cid:1) /p + & (cid:0) E k X k pp (cid:1) /p & d X j =1 σ pj ! /p (cid:16) pe (cid:17) / & p / k σ k p . (121)Combine eq. (119)–(121) to conclude thatsup q ≥ q − (cid:16) E (cid:12)(cid:12) k X k p + p + − k X ′ k p + p + (cid:12)(cid:12) q/p + (cid:17) /q & p − p / r − / (2 p ) k σ k p & p − / r − / (2 p ) k σ k p . Hence, by eq. (118), for any p ∈ [1 , ∞ ),sup t ≥ P (cid:0) t ≤ k X k p ≤ t + εp − / r − / (2 p ) k σ k p (cid:1) . ε. (122) Step 2.
Let p ≥ log d be arbitrary. Note that r / (2 log d ) ≤ e / . Also, k x k p ≤ k x k log d ≤ e k x k p . Therefore, by Step 1,sup t ≥ P (cid:18) t ≤ k X k p ≤ t + ε k σ k p √ log d (cid:19) = sup t ≥ P (cid:18) et ≤ e k X k p ≤ et + eε k σ k p √ log d (cid:19) sup t ≥ P (cid:18) et ≤ k X k log d ≤ et + ε e / k σ k log d √ log d r / (2 log d ) (cid:19) . ε. (123)To combine Step 1 and 2 as in the statement of the theorem, simply note that for p ≤ log d , we have r /p p ≤ d /p p ≤ e log d. Hence, k σ k p p − / r / (2 p ) ≥ k σ k p / √ log d . For p ≥ log d , the inequality is reversed. B.3.3 Proofs for Appendix A.3
Proof of Theorem 8 . Proof of Case (i).Step 1. Fundamental smoothing inequality.
Let Z be an independent copy of Y and define W ( s ) := √ sX + √ − sZ, s ∈ [0 , . Consider the family of sets I = { A ⊆ R : A = [0 , t ] , t ≥ } . Let p ∈ [1 , ∞ ) be arbitrary.Define p + = 2 ⌈ p ⌉ to be the smallest even integer larger than (or equal to) p . By Lemma 1for A ∈ I , we haveP ( k W ( s ) k p ∈ A ) − P (cid:0) k Y k p ∈ A κ p +3 δ (cid:1) ≤ E (cid:2) h p + ,d,β,A κp (cid:0) W ( s ) (cid:1) − h p ++ ,d,β,A κp ( Y ) (cid:3) . Re-arrange the terms in above inequality and take the supremum over A ∈ A to obtainsup A ∈A (cid:16) P ( k W ( s ) k p ∈ A ) − P ( k Y k p ∈ A ) (cid:17) ≤ sup A ∈A P (cid:0) k Y k p ∈ A κ p +3 δ \ A (cid:1) + sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W ( s ) (cid:1) − h p + ,d,β,δ,A ( Y ) (cid:3) (cid:12)(cid:12)(cid:12) , (124)By Lemma 1 we also haveP (cid:0) k Y k p ∈ A − (12 κ p +3 δ ) (cid:1) − P ( k W ( s ) k p ∈ A ) ≤ E h h p + ,d,β,A − (12 κp +3 δ ) ( T ) − h p + ,d,β,A − (12 κp +3 δ ) (cid:0) W ( s ) (cid:1)i . Observe that sup A ∈I P (cid:0) k Y k p ∈ A \ A − (12 κ p +3 δ ) (cid:1) ≤ sup A ∈I P (cid:0) k Y k p ∈ A κ p +3 δ \ A (cid:1) . Togetherwith the preceding inequality this yieldssup A ∈I (cid:16) P ( k Y k p ∈ A ) − P ( k W ( s ) k p ∈ A ) (cid:17) ≤ sup A ∈I P (cid:0) k Y k p ∈ A κ p +3 δ \ A (cid:1) + sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W ( s ) (cid:1) − h p + ,d,β,δ,A ( Y ) (cid:3) (cid:12)(cid:12)(cid:12) , (125)Combine eq. (124) and eq. (125) to obtainsup s ∈ [0 , sup A ∈I (cid:12)(cid:12)(cid:12) P ( k Y k p ∈ A ) − P ( k W ( s ) k p ∈ A ) (cid:12)(cid:12)(cid:12) sup A ∈I P (cid:0) k Y k p ∈ A κ p +3 δ \ A (cid:1) + sup s ∈ [0 , sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W ( s ) (cid:1) − h p + ,d,β,δ,A ( Y ) (cid:3) (cid:12)(cid:12)(cid:12) Note that above inequality also holds for p = p + . Thus, we have the following fundamentalsmoothing inequalitysup s ∈ [0 , sup A ∈I sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P ( k Y k r ∈ A ) − P ( k W ( s ) k r ∈ A ) (cid:12)(cid:12)(cid:12) ≤ sup A ∈I sup r ∈{ p,p + } P (cid:0) k Y k r ∈ A κ r +3 δ \ A (cid:1) + sup s ∈ [0 , sup A ∈I (cid:12)(cid:12)(cid:12) E (cid:2) h p + ,d,β,δ,A (cid:0) W ( s ) (cid:1) − h p + ,d,β,δ,A ( Y ) (cid:3) (cid:12)(cid:12)(cid:12) (126)We now bound the second term on the right hand side of eq. (126).We now bound the second term. Step 2. Stein’s Lemma.
Define the Slepian-Stein (double) interpolant as V ( t ; s ) := √ tW ( s ) + √ − tY, s, t ∈ [0 , , and its derivative with respect to t by˙ V ( t ; s ) := ddt V ( t ; s ) = 12 (cid:20) √ t (cid:0) √ sX + √ − sZ (cid:1) − √ − t Y (cid:21) . Since V (0; s ) = Y and V (1; s ) = W ( s ), the mean value theorem givesE (cid:2) h p + ,d,β,δ,A (cid:0) W ( s ) (cid:1) − h p + ,d,β,δ,A ( Y ) (cid:3) = X | α | =1 Z E h ˙ V α ( t ; s ) (cid:0) D α h p + ,d,β,δ,A (cid:1)(cid:0) V ( t ; s ) (cid:1)i dt. (127)For brevity of notation, we now drop the subscripts p + , d, β, δ, A and write h instead of h p + ,d,β,δ,A . By Stein’s identity, for multi-indices α , α ′ with | α | = | α ′ | = 1,E (cid:2) W α ( s )( D α h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) = √ t X | α ′ | =1 E (cid:2) W α ( s ) W α ′ ( s ) (cid:3) E (cid:2) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) , E (cid:2) Y α ( D α h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) = √ − t X | α ′ | =1 E (cid:2) Y α Y α ′ (cid:3) E (cid:2) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) . Hence, eq. (127) simplifies toE (cid:2) h (cid:0) W ( s ) (cid:1) − h ( Y ) (cid:3) = 12 X | α | =1 X | α ′ | =1 Z E (cid:2) W α ( s ) W α ′ ( s ) − Y α Y α ′ (cid:3) E (cid:2) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) dt = s X | α | =1 X | α ′ | =1 Z E (cid:2) X α X α ′ − Y α Y α ′ (cid:3) E (cid:2) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) dt, (128)where the second equality follows since Z is an independent copy of Y .57ecall from Lemma 1 that h is non-constant on the set (cid:8) z ∈ R d : M p + ,κ p + ( z ) ∈ A δ \ A (cid:9) only. Set φ ( x ) = (cid:8) k x k p + ∈ A δ \ A − κ p + (cid:9) and note that φ ( x ) = 1 if x ∈ (cid:8) z ∈ R d : M p + ,κ p + ( z ) ∈ A δ \ A (cid:9) . Therefore, for all s ∈ [0 , s X | α | =1 X | α ′ | =1 Z E (cid:2) X α X α ′ − Y α Y α ′ (cid:3) E (cid:2) φ (cid:0) V ( t ; s ) (cid:1) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) dt. (129)By H¨older’s inequality, for 1 /p + + 1 /q = 1, s X | α | =1 X | α ′ | =1 Z E (cid:2) X α X α ′ − Y α Y α ′ (cid:3) E (cid:2) φ (cid:0) V ( t ; s ) (cid:1) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) dt ≤ s X | α | =1 X | α ′ | =1 (cid:12)(cid:12) E (cid:2) X α X α ′ − Y α Y α ′ (cid:3)(cid:12)(cid:12) p + /p + × Z X | α | =1 X | α ′ | =1 (cid:12)(cid:12) E (cid:2) φ (cid:0) V ( t ; s ) (cid:1) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3)(cid:12)(cid:12) q /q dt ≤ (cid:13)(cid:13) vec(Σ X − Σ Y ) (cid:13)(cid:13) p + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h | q /q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ × P (cid:18)(cid:13)(cid:13)(cid:13) √ tW ( s ) + √ − tY (cid:13)(cid:13)(cid:13) p + ∈ A δ \ A − κ p + (cid:19) . (130)We bound the last factor in eq. (130) bysup s ∈ [0 , P (cid:18)(cid:13)(cid:13)(cid:13) √ tW ( s ) + √ − tY (cid:13)(cid:13)(cid:13) p + ∈ A δ \ A − κ p + (cid:19) ≤ sup s ∈ [0 , P (cid:0) k W ( s ) k p + ∈ A δ \ A − κ p + (cid:1) = sup s ∈ [0 , (cid:16) P (cid:0) k W ( s ) k p + ∈ A δ (cid:1) − P (cid:0) k Y k p + ∈ A δ (cid:1) − P (cid:0) k W ( s ) k p + ∈ A − κ p + (cid:1) (cid:17) + P (cid:0) k Y k p + ∈ A − κ p + (cid:1) + P (cid:0) k Y k p + ∈ A δ \ A − κ p + (cid:1) ≤ A ∈I sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P ( k W ( s ) k r ∈ A ) − P ( k Y k r ∈ A ) (cid:12)(cid:12)(cid:12) + sup A ∈I sup r ∈{ p,p + } P (cid:0) k Y k r ∈ A δ + κ r \ A (cid:1) , (131)where the first inequality holds since √ tW ( s ) + √ − tY d = W ( st ). Step 3. Recursive bound on eq. (126) . To simplify notation we define ̺ X = sup s ∈ [0 , sup A ∈I sup r ∈{ p,p + } (cid:12)(cid:12)(cid:12) P ( k W ( s ) k r ∈ A ) − P ( k Y k r ∈ A ) (cid:12)(cid:12)(cid:12) . ̺ X ≤ ∆ p + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h | q /q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ̺ X + ∆ p + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h | q /q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ sup A ∈I sup r ∈{ p,p + } P (cid:0) k Y k r ∈ A δ + κ r \ A (cid:1) + sup A ∈I sup r ∈{ p,p + } P (cid:0) k Y k r ∈ A δ +12 κ r \ A (cid:1) . (132)By Lemma 1 and Theorem 7, eq. (132) reduces to ̺ X ≤ C (cid:18) δ − + βδ (cid:19) ∆ p + ̺ X + sup A ∈I sup r ∈{ p,p + } P (cid:0) k Y k r ∈ A δ +12 κ r \ A (cid:1) + C (cid:18) δ − + βδ (cid:19) ∆ p + sup A ∈I sup r ∈{ p,p + } P (cid:0) k Y k r ∈ A δ + κ r \ A (cid:1)! ≤ C (cid:18) δ − + βδ (cid:19) ∆ p ̺ X + C δ + β − pd /p ω − p ( d, r X ) k σ X k p (cid:18) C (cid:18) δ − + βδ (cid:19) ∆ p (cid:19) , (133)where C ≥ p ≤ p + and byRemark 9.Set β = p / d / (2 p ) ∆ − / p and δ = 4 C p / d / (2 p ) ∆ / p . Note that δ − + δ − β ≤ δ − β .Thus, eq. (133) simplifies to ̺ X ≤ ̺ X C p / d / (2 p ) ∆ / p ω − p ( d, r X ) k σ X k p , which implies ̺ X . p / d / (2 p ) ∆ / p ω − p ( d, r X ) k σ X k p . (134)Since X and Y are both Gaussian, we can interchange their role in the proof and ob-tain analogous bounds on ̺ Y involving σ Y and r Y . Since ̺ X and ̺ Y both upper bound | P ( k X k p ∈ A ) − P ( k Y k p ∈ A ) | the first claim of Theorem 8 follows. Proof of Case (ii).
We split the proof into two parts.
Step 1.
We derive the bound involving ∆ ∞ . The proof of this result is identical to theproof of the four statement except for the following three changes: First, we do not need tointroduce p + , the smallest even integer larger than p . Instead, as fundamental smoothinginequality we may take the inequality directly preceding eq. (126). Second, we replacearguments involving H¨older’s inequality with the conjugate exponents ( q, p + ) by argumentsbased on H¨older’s inequality with the conjugate exponents (1 , ∞ ). Third, replace Lemma 1by Lemma 2 throughout. Lastly, set β = (log d ) / ∆ − / ∞ and δ = 4 C (log d ) / ∆ / ∞ , Step 2.
To derive the bound involving ∆ op we have to make the following changes:Denote by ∇ h the Hessian of h . Then, by H¨older’s inequality for matrix inner products(and a rough upper bound following from Gershgorin’s circle theorem) we can upper boundeq. (129) as follows: s X | α | =1 X | α ′ | =1 Z E (cid:2) X α X α ′ − Y α Y α ′ (cid:3) E (cid:2) φ (cid:0) V ( t ; s ) (cid:1) ( D α + α ′ h ) (cid:0) V ( t ; s ) (cid:1)(cid:3) dt = s Z E (cid:2) φ (cid:0) V ( t ; s ) (cid:1) tr (cid:8) (Σ X − Σ Y ) ∇ h (cid:0) V ( t ) (cid:1)(cid:9)(cid:3) dt ≤ (cid:13)(cid:13) Σ X − Σ Y (cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ h (cid:13)(cid:13) S (cid:13)(cid:13)(cid:13) ∞ P (cid:18)(cid:13)(cid:13)(cid:13) √ tW ( s ) + √ − tY (cid:13)(cid:13)(cid:13) p ∈ A δ + κ p \ A (cid:19) ≤ (cid:13)(cid:13) Σ X − Σ Y (cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ P (cid:18)(cid:13)(cid:13)(cid:13) √ tW ( s ) + √ − tY (cid:13)(cid:13)(cid:13) p ∈ A δ + κ p \ A (cid:19) , where k · k S denotes the Schatten 1-norm. Now, proceed as in Step 1 replacing ∆ p by ∆ op .This concludes the proof of the second statement. B.3.4 Proofs for Appendix A.4
Proof of Lemma 1 . Step 1. Smooth approximation of an indicator function.
Let δ ≥ ǫ >
0. For x ∈ R and A ∈ B ( R ) define I δ,A ( x ) = (cid:18) − δ − inf y ∈ A δ | x − y | (cid:19) ∨ . Note that I δ,A is Lipschitz continuous with Lipschitz constant δ − . Let ψ ∈ C ∞ ( R ) be amollifier with compact support [ − , ψ ( t ) = ( c exp (cid:0) x − (cid:1) | x | < | x | ≥ , where c > R ψ ( x ) dx = 1, and set g δ,ǫ,A ( x ) = 1 ǫ Z | x − y |≤ ǫ ψ (cid:18) x − yǫ (cid:19) I δ,A ( y ) dy. We observe the following facts: First, δ ≥ ǫ implies that g δ,ǫ,A ( x ) = 1 for x ∈ A , and g δ,ǫ,A ( x ) = 0 for x / ∈ A δ , and hence, A ( x ) ≤ g δ,ǫ,A ( x ) ≤ A δ ( x ) . (135)Second, g δ,ǫ,A ∈ C ∞ b ( R ) and its derivatives up to order three satisfy | g ′ δ,ǫ,A ( x ) | ≤ C δ − A δ \ A ( x ) , | g ′′ δ,ǫ,A ( x ) | ≤ C ǫ − δ − A δ \ A ( x ) , | g ′′′ δ,ǫ,A ( x ) | ≤ C ǫ − δ − A δ \ A ( x ) , (136)60here C > ψ . Step 2. Smooth approximation of indicator functions of ℓ p -norms for even p ∈ N . For η > M p,η ( x ) = η p + d X j =1 | x j | p ! /p . (137)Since η > p ∈ N , the map M p,η is C ∞ ( R d ). Moreover, M p,η ( x ) approximates k x k p ,i.e. k x k p ≤ M p,η ( x ) ≤ k x k p + η, (138)and M p,η ( x ) is bounded away from zero, i.e.min x ∈ R d M p,η ( x ) ≥ η > . (139)Compose g δ,ǫ,A and M p,η to obtain a smooth approximation of the map x A ( k x k p ), i.e. h p,δ,ǫ,η,A ( x ) = ( g δ,ǫ,A ◦ M p,η ) ( x ) . Combine eq. (135) and (138) and take expectation with respect to the law of X to concludethat for A ∈ B ( R ) and p ∈ N ,P ( k X k p ∈ A ) ≤ E [ h p,δ,ǫ,η,A η ( X )] ≤ P (cid:0) k X k p ∈ A δ +2 η (cid:1) . Step 3. Bounds on partial derivatives of h p,δ,ǫ,η,A in transformed conjugatenorm. Let τ ∈ [1 , ∞ ] and set q = τpτp − . By Fa`a di Bruno’s chain rule and H¨older’sinequality, X | α | =2 | D α h p,δ,ǫ,η,A ( x ) | q /q ≤ k g ′′ δ,ǫ,A ◦ M p,η k ∞ X | α | =1 X | α ′ | =1 | D α M p,η ( x ) | q | D α M p,η ( x ) | q /q + k g ′ δ,ǫ,A ◦ M p,η k ∞ X | α | =2 | D α M p,η ( x ) | q /q , (140)61 X | α | =3 | D α h p,δ,ǫ,η,A ( x ) | q /q ≤ k g ′′′ δ,ǫ,A ◦ M p,η k ∞ X | α | =1 X | α ′ | =1 X | α ′′ | =1 | D α M p,η ( x ) | q (cid:12)(cid:12)(cid:12) D α ′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12) D α ′′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q /q + 3 k g ′′ δ,ǫ,A ◦ M p,η k ∞ X | α | =1 X | α ′ | =2 | D α M p,η ( x ) | q (cid:12)(cid:12)(cid:12) D α ′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q /q + k g ′ δ,ǫ,A ◦ M p,η k ∞ X | α | =3 | D α M p,η ( x ) | q /q . (141)Next, we bound the partial derivatives of M p,η . By Lemma 13, the chain rule X | α | =1 X | α ′ | =1 | D α M p,η ( x ) | q (cid:12)(cid:12)(cid:12) D α ′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q /q = X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ! /q = d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ d τ − / ( τp ) , (142) X | α | =1 X | α ′ | =1 X | α ′′ | =1 | D α M p,η ( x ) | q (cid:12)(cid:12)(cid:12) D α ′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12) D α ′′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q /q = X k,ℓ,m (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x m (cid:12)(cid:12)(cid:12)(cid:12) q ! /q = d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ d τ − / ( τp ) , (143)and X | α | =2 | D α M p,η ( x ) | q /q ≤ d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q + X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ p − d (2 τ − / ( τp ) M p,η ( x ) + 2( p − d ( τ − / ( τp ) M p,η ( x ) + ( p − d τ − / ( τp ) M p,η ( x ) . η − pd (2 τ − / ( τp ) , (144)62here the third inequality follows from the lower bound (139); and X | α | =1 X | α ′ | =2 | D α M p,η ( x ) | q (cid:12)(cid:12)(cid:12) D α ′ M p,η ( x ) (cid:12)(cid:12)(cid:12) q /q ≤ X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ! /q + X k,ℓ,m (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q (cid:12)(cid:12)(cid:12)(cid:12) ∂M p,η ( x ) ∂x m (cid:12)(cid:12)(cid:12)(cid:12) q ! /q . η − pd (3 τ − / ( τp ) , (145)where we have used the results from eq. (142) and (144); and X | α | =3 | D α M p,η ( x ) | q /q ≤ X k (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ! /q + X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ! /q + X k,ℓ,m (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p,η ( x ) ∂x k ∂x ℓ ∂x m (cid:12)(cid:12)(cid:12)(cid:12) q ! /q ≤ p − p − d (3 τ − / ( τp ) M p,η ( x ) + 12( p − d ( τ − / ( τp ) M p,η ( x ) + 4(2 p − p − d ( τ − / ( τp ) M p,η ( x )+ 2( p − d (3 τ − / ( τp ) M p,η ( x ) + 2(2 p − p − d τ − / ( τp ) M p,η ( x ) + (2 p − p − d τ − / ( τp ) M p,η ( x ) . η − p d (3 τ − / ( τp ) , (146)To conclude, set δ = ǫ > η = β − pd / ( τp ) , β >
0. Then, the upper bounds (136)and (140)–(146) imply, uniformly in x ∈ R d and A ∈ A , X | α | =2 | D α h p,δ,ǫ,η,A ( x ) | q /q . (cid:0) δ − + δ − β (cid:1) d τ − / ( τp ) , X | α | =3 | D α h p,δ,ǫ,ηA ( x ) | q /q . (cid:0) δ − + δ − β + δ − β (cid:1) d τ − / ( τp ) . Note that due to the substitutions h p,δ,ǫ,η,A depends only on p, d, β, δ, A . Step 4. Smooth approximation of indicator function of ℓ p -norms with p ∈ [1 , ∞ ) . Let I = { A ⊆ R : A = [0 , t ] , t ≥ } . Let p ∈ [1 , ∞ ) be arbitrary and define p + = 2 ⌈ p ⌉ to be the smallest even integer larger than (or equal to) p . Then, k x k p + ≤ k x k p for all x ∈ R d . We have the following relation between M p + ,β ( x ) and k x k p k x k p + ≤ M p + ,η ( x ) ≤ k x k p + + η ≤ k x k p + η. (147)Combine eq. (135) and (147) and take expectation with respect to the law of X to concludethat for A ∈ I and p ∈ [1 , ∞ ),P ( k X k p ∈ A ) ≤ P (cid:0) k X k p + ∈ A (cid:1) ≤ E (cid:2) h p + ,β,δ,ǫ,A η ( X ) (cid:3) ≤ P (cid:0) k X k p ∈ A δ +2 η (cid:1) , ≤ k x k p + ≤ k x k p for all x ∈ R d and thefact that A = [0 , t ] or some t ≥ Proof of Lemma 2 . For p ≥ log d we can approximate any ℓ p -norm by the smooth maxfunction. We can therefore sharpen the result from Lemma 1. Step 1. Smooth approximation of indicator functions of ℓ p -norms with p ≥ log d . Let δ ≥ ǫ > A = { A ⊆ R : A = [0 , t ] , t ≥ } . For x ∈ R and A ∈ A define g δ,ǫ,A ( x ) = 1 ǫ Z | x − y |≤ ǫ ψ (cid:18) x − yǫ (cid:19) I δ,A ( y ) dy, with I δ,A and ψ as in the proof of Lemma 1. Recall that A ( x ) ≤ g δ,ǫ,A ( x ) ≤ A δ ( x ) . (148)For β > F β ( x ) := β − log p X k =1 e βx k d − /p + e − βx k d − /p ! . (149)Let x ∈ R d be arbitrary. Set u ∗ = arg max k u k q =1 | x ′ u | , where q = p/ ( p −
1) is the conjugateexponent to p . Note that d − /p k u ∗ k ≤ ≤ d /p ≤ e . Therefore, we have, for β > k x k p = β − d /p d X k =1 u ∗ k d /p log (cid:0) e | x k | β (cid:1) ≤ β − d /p log d X k =1 u ∗ k d /p e | x k | β ! ≤ β − d /p log d X k =1 e | x k | βd − /p ! ≤ β − d /p log d X k =1 e x k βd − /p + e − x k βd − /p ! ≤ k x k ∞ + d /p β − log(2 d ) ≤ k x k p + eβ − log(2 d ) , (150)where the first inequality follows from Jensen’s inequality, and the second and third inequal-ities from elementary calculations.We define h p,β,δ,ǫ,A ( x ) = ( g δ,ǫ,A ◦ F β ) ( x ) . Combine eq. (148) and (150) and take expectation with respect to the law of X to concludethat P ( k X k p ∈ A ) ≤ E h h p,β,δ,ǫ,A eβ − d ) ( X ) i ≤ P (cid:16) k X k p ∈ A δ +2 eβ − log(2 d ) (cid:17) . tep 2. Bounds on partial derivatives of h p,β,δ,ǫ,A in ℓ -norm. By Lemma A.2–A.6in Chernozhukov et al. (2013) we havesup A ∈A (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =2 | D α h p,d,β,δ,A | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . δ + βδ , sup A ∈A (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X | α | =3 | D α h p,d,β,δ,A | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . δ + βδ + β δ . This concludes the proof.
B.3.5 Proofs for Appendix A.5
Proof of Lemma 3 . For t ≥ p , it follows from Minkowski’s integral inequality that (cid:0) E k X k tp (cid:1) p/t ≤ d X k =1 (cid:0) E | X k | t (cid:1) p/t . K ps k σ k pp . While for t ≤ p , H¨older’s inequality yields (cid:0) E k X k tp (cid:1) p/t ≤ E " d X k =1 | X k | p = d X k =1 E | X k | p . K ps k σ k pp . Combine both inequalities to conclude.
Proof of Lemma 4 . Recall Young’s inequality: Q Ki =1 x α i i ≤ P Ki =1 α i x i for all x i , α i ≥ i = 1 , . . . , K with P Ki =1 α i = 1. Without loss of generality, we can assume that k X i k ψ = 1for all i = 1 , . . . , K . Thus, the claim of the lemma follows if we can show the following: IfE [exp( X i )] ≤ i = 1 , . . . , K , then E[exp( Q Ki =1 | X i | /K )] ≤
2. This assertion followsfrom straightforward calculations:E " ψ /K K Y i =1 X i ! = E " exp K Y i =1 | X i | /K ! ≤ E " exp K K X i =1 | X i | ! = E " K Y i =1 exp (cid:18) K | X i | (cid:19) ≤ K K X i =1 E (cid:2) exp (cid:0) | X i | (cid:1)(cid:3)! ≤ , where in the first and second inequalities we have used Young’s inequality.65 roof of Lemma 5 . Proof of Case (i).
We have the following: b Σ − Σ = 1 n n X i =1 (cid:0) X i X ′ i − E[ X i X ′ i ] (cid:1) − n X ≤ i,j ≤ n X i X ′ j ≡ I − II . For 1 ≤ j, k ≤ d , set b Σ jk = n − P ni =1 X ij X ik , Σ jk = n − P ni =1 E[ X ij X ik ], and σ k = n − P ni =1 E[ X ik ]. By Assumption 1 there exists an absolute constant K > ≤ i ≤ n , (cid:13)(cid:13) X ij X ik (cid:13)(cid:13) ψ ≤ k X ij k ψ k X ik k ψ ≤ K E[ X ij ] / E[ X ik ] / = K σ j σ k . Hence, by union bound and Bernstein’s inequality there exists an absolute constant
C > t > (cid:16) k vec( I ) k p > tK k σ k p (cid:17) ≤ X ≤ j,k ≤ d P (cid:16)(cid:12)(cid:12)b Σ jk − Σ jk (cid:12)(cid:12) > tK σ j σ k (cid:17) ≤ d exp (cid:0) − C min (cid:8) t , t (cid:9) n (cid:1) . Above tail bound implies that with probability at least 1 − ζ , k vec( I ) k p ≤ K k σ k p r C r log d + log(2 /ζ ) n _ C log d + log(2 /ζ ) n ! . To bound k vec( II ) k p we directly use the sub-gaussianity of the X i ’s. By union bound andHoeffding’s inequality there exists an absolute constant C > t > (cid:0) k vec( II ) k p > t K k σ k p (cid:1) ≤ X ≤ k ≤ d P ( | X ik | > tK σ k n ) ≤ de exp (cid:0) − Ct n (cid:1) . Hence, with probability at least 1 − ζ , k vec( II ) k p ≤ K k σ k p C (cid:18) log d + log(2 /ζ ) n (cid:19) . Conclude that with probability at least 1 − ζ , k vec( b Σ − Σ) k p . k vec( I ) k p ∨ k vec( II ) k p . K k σ k p r log d + log(2 /ζ ) n ∨ log d + log(2 /ζ ) n ! . (151) Proof of Case (ii).
We have k b Σ − Σ k op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 (cid:0) X i X ′ i − E[ X i X ′ i ] (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X ≤ i,j ≤ n X i X ′ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 (cid:0) X i X ′ i − E[ X i X ′ i ] (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k I k op + k II k . By Theorem 5.39 and Remarks 5.40 and 5.53 in Vershynin (2012), with probability at least1 − ζ , k I k op . k Σ k op r r(Σ) log d + log(2 /ζ ) n ∨ r(Σ) log d + log(2 /ζ ) n ! , Similar arguments as in the second part of Case 1 with p = 2 and the fact that k σ k = tr(Σ),yield, with probability at least 1 − ζ , k II k . k Σ k op (cid:18) r(Σ) log d + log(2 /ζ ) n (cid:19) . The claim follows from combining both bounds.
Proof of Case (iii).
Denote by ⊘ the Hadamard division and observe thatmax ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) = (cid:13)(cid:13)(cid:13) diag( b Σ) ⊘ diag(Σ) − I d (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag n n X i =1 X i X ′ i − E[ X i X ′ i ] ! ⊘ diag(Σ) − I d !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag n X ≤ i,j ≤ d X i X ′ j ! ⊘ diag(Σ) !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Moreover, k diag( I d ) k ∞ = 1. Hence, the claim follows from Case 1 with p = ∞ . Proof of Lemma 6 . Proof of Case (i).
We have the following: b Σ − Σ = 1 n n X i =1 (cid:0) X i X ′ i − E[ X i X ′ i ] (cid:1) − n X ≤ i,j ≤ d X i X ′ j ≡ I − II . Step 1.
We begin with the analysis of I . Let X ∈ R d satisfy Assumption 3. Then, thereexists an absolute constant C > (cid:2) k vec( XX ′ ) k p (cid:3) = E d X j =1 d X k =1 | X j X k | p ! /p = E d X j =1 | X j | p ! /p ≤ C K p ∨ k σ k p . (152)Since E[ X ] = 0, eq. (152) implies thatsup k u k q =1 Var [vec( XX ′ ) ′ u ] = sup k u k q =1 E h(cid:0) vec( XX ′ ) ′ u (cid:1) i ≤ E (cid:2) k vec( XX ′ ) k p (cid:3) ≤ C K p ∨ k σ k p . t ≥ n − / C / K p ∨ k σ k p ,P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) vec n n X i =1 X i X ′ i − E[ X i X ′ i ] !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p > t ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p > t/ , (153)where ε , . . . , ε n are i.i.d. Rademacher random variables independent of the X i ’s. Let θ > A ( θ ) := ω ∈ Ω : E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p | X , . . . , X n ( ω ) ≤ θ . Expand the tail probability on the right hand side in above eq. (153),P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p > t/ , E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p | X , . . . , X n ≤ θ + P E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p | X , . . . , X n > θ ≤ Z A ( θ ) exp − t h(cid:13)(cid:13) n P ni =1 vec( X i X ′ i ) ε i (cid:13)(cid:13) p | X , . . . , X n i d P X ,...,X n ( ω )+ P E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p | X , . . . , X n > θ ≤ (cid:18) − t θ (cid:19) + θ − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec( X i X ′ i ) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p , (154)where the first inequality follows from the sub-gaussianity of Rademacher random vari-ables (e.g. Ledoux and Talagrand, 1991, Theorem 4.7 and eq. (4.12) on p. 101) and thesecond inequality by Markov’s inequality.We now determine the choice of θ >
0. By Theorem 2.2 in D¨umbgen et al. (2010)(refinement of Nemirovski’s inequality) there exists an absolute constant C > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 vec (cid:0) X i X ′ i (cid:1) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p ≤ C (cid:0) p ∧ log d (cid:1) n X i =1 E (cid:2) k vec (cid:0) X i X ′ i (cid:1) ε i k p (cid:3) ≤ C (cid:0) p ∧ log d (cid:1) n E (cid:2) k vec (cid:0) XX ′ (cid:1) k p (cid:3) . (155)68ombine eq. (152) and eq. (155) to conclude thatE (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec (cid:0) X i X ′ i (cid:1) ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p ≤ C C K p ∨ k σ k p (cid:18) p ∧ log dn (cid:19) . Thus, we set θ = M ( C C K p ∨ ∨ k σ k p (cid:0) p ∧ log dn (cid:1) and t = M / θ / ≥ n − / C / K p ∨ k σ k p ,where M ≥ M large enough we can make theleft hand side of eq. (154) arbitrarily small. Hence, we conclude that k vec ( I ) k p = O p K p ∨ k σ k p r p ∧ log dn ! . (156) Step 2.
We now analyze term II . Let X ∈ R d satisfy Assumption 3 and let e X be anindependent copy of X . Then, there exists an absolute constant C > h k vec( X e X ′ ) k p i = E d X j =1 d X k =1 | X j e X k | p ! /p = E d X j =1 | X j | p ! /p ≤ CK p k σ k p . (157)Let 1 < p, q < ∞ be conjugate exponents such that 1 /p + 1 /q = 1. By standarddecoupling arguments (e.g. Foucart and Rauhut, 2013, Theorem 8.11) we havesup k u k q =1 E X i = j vec( X i X ′ j ) ′ u ! ≤
16 sup k u k q =1 E X ≤ i,j ≤ n vec( X i e X ′ j ) ′ u ! , (158)where e X , . . . , e X n are mutually independent copies of the corresponding X i ’s. Since E[ X ] =E[ e X ] = 0, we can further bound the right hand side of above inequality using eq. (157),16 sup k u k q =1 E " X ≤ i,j ≤ n (cid:16) vec( X i e X ′ j ) ′ u (cid:17) ≤ " X ≤ i,j ≤ n (cid:13)(cid:13)(cid:13) vec( X i e X ′ j ) (cid:13)(cid:13)(cid:13) p ≤ n C K p k σ k p . (159)Therefore, by Symmetrization Lemma 2.3.7 in van der Vaart and Wellner (1996), for any t ≥ n − C / K p k σ k p ,P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) vec n X i = j X i X ′ j !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p > t ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) vec n X i = j X i X ′ j ! ε ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p > t/ , (160)where ε , . . . , ε nn are i.i.d. Rademacher random variables independent of the X i X ′ j ’s. Pro-ceeding as in Step 1, we upper bound the tail probability in (160) by8 exp (cid:18) − t θ (cid:19) + 4 θ − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i = j vec (cid:0) X i X ′ j (cid:1) ε ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p , (161)69here θ > X , . . . , X n , the summands vec (cid:0) X i X ′ j (cid:1) ε ij areindependent with mean zero. Thus, by Theorem 2.2 in D¨umbgen et al. (2010) and eq. (157)there exists an absolute constant C > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i = j vec (cid:0) X i X ′ j (cid:1) ε ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p ≤ C ( p ∧ log d ) 1 n X i = j E (cid:2) k vec (cid:0) X i X ′ j (cid:1) ε i k p (cid:3) ≤ C C K p k σ k p (cid:18) p ∧ log dn (cid:19) . Hence, we set θ = M ( C C K p ∨ k σ k p (cid:0) p ∧ log dn (cid:1) and t = M / θ / ≥ n − C / K p k σ k p ,where M ≥ M large enough we can make theleft hand side of eq. (161) arbitrarily small, i.e. (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) vec n X i = j X i X ′ j !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p = O p K p k σ k p r p ∧ log dn ! . (162)Lastly, by triangle inequality, eq. (156) and eq. (152) we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) vec n n X i =1 X i X ′ i !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 vec ( X i X ′ i − E[ X i X ′ i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p + 1 n n X i =1 E (cid:2) k vec( X i X ′ i ) k p (cid:3) / = O p K p ∨ k σ k p r p ∧ log dn ! + O (cid:18) K p ∨ k σ k p n (cid:19) . (163)Combine eq. (162) and eq. (163) to conclude that k vec ( II ) k p = O p K p ∨ k σ k p r p ∧ log dn ! . (164)Therefore, k vec( b Σ − Σ) k p . k vec( I ) k p ∨ k vec( II ) k p = O p K p ∨ k σ k p r p ∧ log dn ! . (165) Proof of Case (ii).
Since n P n ≤ ,j ≤ n X i X ′ j = (cid:0) n P ni =1 X i (cid:1) (cid:0) n P ni =1 X i (cid:1) ′ has rank one,we have k b Σ − Σ k op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 (cid:0) X i X ′ i − E[ X i X ′ i ] (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X ≤ i,j ≤ n X i X ′ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 (cid:0) X i X ′ i − E[ X i X ′ i ] (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ≤ ,j ≤ n vec( X i X ′ j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≡ k I k op + k vec( II ) k .
70y Theorem 5.48 in Vershynin (2012), k I k op = O p k Σ k op r m(Σ) log( d ∧ n ) n ∨ m(Σ) log( d ∧ n ) n !! . (166)Since k σ k = tr(Σ) ≤ m(Σ) k Σ k op , we have by eq. (164) with p = 2, k vec( II ) k . O p (cid:18) k Σ k op m(Σ) n (cid:19) . The claim follows from combining the last two bounds.
Case (iii).
Denote by ⊘ the Hadamard division. Suppose that Assumption 3 holds with s ≥
4. Observe thatmax ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) = (cid:13)(cid:13)(cid:13) diag( b Σ) ⊘ diag(Σ) − I d (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag n n X i =1 X i X ′ i − E[ X i X ′ i ] ! ⊘ diag(Σ) − I d !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag n X ≤ i,j ≤ d X i X ′ j ! ⊘ diag(Σ) !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag n n X i =1 X i X ′ i − E[ X i X ′ i ] ! ⊘ diag(Σ) − I d !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag n X ≤ i,j ≤ d X i X ′ j ! ⊘ diag(Σ) !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r Moreover, k diag( I d ) k ∞ = 1. Hence, from Case 1,max ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) = O p K s d /s r s ∧ log dn ! . Suppose Assumption 2 holds with r ≥
2. Observe thatmax ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) = (cid:13)(cid:13)(cid:13) diag( b Σ) ⊘ diag(Σ) − I d (cid:13)(cid:13)(cid:13) op . Note that tr (cid:0) diag(Σ) ⊘ diag(Σ) (cid:1) ≤ e m (cid:0) diag(Σ) (cid:1) . Thus, by Case 2,max ≤ k ≤ d (cid:12)(cid:12) ( b σ k /σ k ) − (cid:12)(cid:12) = O p s e m (cid:0) diag(Σ) (cid:1) log( d ∧ n ) n ∨ e m (cid:0) diag(Σ) (cid:1) log( d ∧ n ) n . roof of Lemma 7 . Proof of Case (i).
By Lemma 5 (i) with probability at least 1 − ζ , k vec( b Σ − Σ) k ∞ . k σ k ∞ λ n . On this event, it is straight forward to show (e.g. Wainwright, 2019, p. 181) that (cid:13)(cid:13) vec (cid:0) T λ n ( b Σ) − Σ (cid:1)(cid:13)(cid:13) p . k vec( A ) k p k σ k ∞ λ n . and (cid:13)(cid:13) T λ n ( b Σ) − Σ (cid:13)(cid:13) op . k vec( A ) k op k σ k ∞ λ n . Proof of Case (ii).
The claim about the difference in operator norm follows verbatimfrom the proof of Theorem 6.27 in Wainwright (2019). The statement about the differencein vectorized ℓ p -norm follows from an easy modification of the proof of Theorem 6.27. Forcompleteness we provide a sketch of the modified argument. Wainright’s proofs are easier togeneralize to our setup than the original proofs in Bickel and Levina (2008b).Suppose that k vec( b Σ − Σ) k ∞ ≤ λ/ λ > j ∈ { , . . . , d } and define S j ( λ/
2) := (cid:8) k ∈ { , . . . , d } : | Σ jk | > λ/ (cid:9) . For any k ∈ S j ( λ/ | T λ ( b Σ jk ) − Σ jk | ≤ | T λ ( b Σ jk ) − b Σ jk | + | b Σ jk − Σ jk | ≤ λ, (167)where the second inequality follows from k vec( b Σ − Σ) k ∞ ≤ λ/ T λ . For any k / ∈ S j ( λ/ k vec( b Σ − Σ) k ∞ ≤ λ/ T λ that T λ ( b Σ jk ) = 0. Hence, | T λ ( b Σ jk ) − Σ jk | = | Σ jk | . (168)Combine eq. (167) and (168) to conclude that d X j =1 | T λ ( b Σ jk ) − Σ jk | p ! /p ≤ X j ∈ S k ( λ/ | T λ ( b Σ jk ) − Σ jk | p /p + X j / ∈ S k ( λ/ | T λ ( b Σ jk ) − Σ jk | p /p ≤ | S k ( λ/ | λ + X j / ∈ S k ( λ/ | Σ jk | p /p . (169)To bound the first term on the far right hand side in above display note that R p,γ ≥ d X j =1 | Σ jk | pγ ! /p ≥ | S k ( λ/ | ( λ/ γ . | S k ( λ/ | ≤ γ R p,γ λ − γ . To bound the second term on thefar right hand side in eq. (169) observe that X j / ∈ S k ( λ/ | Σ jk | p /p = λ X j / ∈ S k ( λ/ (cid:12)(cid:12)(cid:12)(cid:12) Σ jk λ/ (cid:12)(cid:12)(cid:12)(cid:12) p /p ≤ λ X j / ∈ S k ( λ/ (cid:12)(cid:12)(cid:12)(cid:12) Σ jk λ/ (cid:12)(cid:12)(cid:12)(cid:12) pγ /p ≤ λ − γ R p,γ . Combine the preceding two inequalities with eq. (169) and conclude that d X j =1 | T λ ( b Σ jk ) − Σ jk | p ! /p ≤ γ R p,γ λ − γ
32 + R p,γ λ − γ ≤ R p,γ λ − γ . (170)We now determine the choice of λ >
0. Consider the following,P (cid:16) k vec( T λ ( b Σ) − Σ) k p > d /p R p,γ λ − γ (cid:17) = P (cid:16) k vec( T λ ( b Σ) − Σ) k p > d /p R p,γ λ − γ , k vec( b Σ − Σ) k ∞ ≤ λ/ (cid:17) + P (cid:16) k vec( b Σ − Σ) k ∞ ≤ λ/ (cid:17) ≤ d X k =1 P d X j =1 | T λ ( b Σ jk ) − Σ jk | p ! /p > R p,γ λ − γ , k vec( b Σ − Σ) k ∞ ≤ λ/ + P (cid:16) k vec( b Σ − Σ) k ∞ ≤ λ/ (cid:17) = P (cid:16) k vec( b Σ − Σ) k ∞ ≤ λ/ (cid:17) , where the last line follows from eq. (170). Now, set λ = 2 k σ k ∞ λ n and conclude by Lemma 5(i) (applied with p = ∞ ) that with probability at least 1 − ζ , k vec( T λ ( b Σ) − Σ) k p . d /p R p,γ k σ k − γ ) ∞ λ − γn . Proof of Case (iii).
Note that for all s < ∞ and λ > k vec( b Σ − Σ) k s ≤ λ implies k vec( b Σ − Σ) k ∞ ≤ λ . Thus, by Lemma 6 (i) we have for all s ≥ ( p ∧ log p ) ∨ (cid:13)(cid:13) vec( T λ n ( b Σ) − Σ) (cid:13)(cid:13) p = O p k vec( A ) k p K s k σ k s r s ∧ log dn ! , (cid:13)(cid:13) T λ n ( b Σ) − Σ (cid:13)(cid:13) op = O p k A k op K s k σ k s r s ∧ log dn ! . Proof of Case (iv).
The claim follows as Case 2 but using Lemma 6 (i) with s ≥ ( p ∧ log p ) ∨ p = ∞ . See also Case 3. Proof of Lemma 8 . Proof of Case (i).
The claim about the difference in operator normfollows from the proof of Theorem 1 in Bickel and Levina (2008a). The statement about the73ifference in vectorized ℓ p -norm follows from an easy modification of the proof of Theorem1. For completeness we give the modified argument below.Fix ℓ ∈ { , . . . , d − } and compute k vec( B ℓ ( b Σ) − Σ) k p ≤ (cid:13)(cid:13) vec (cid:0) B ℓ ( b Σ) − B ℓ (Σ) (cid:1)(cid:13)(cid:13) p + k vec( B ℓ (Σ) − Σ) k p = d X j =1 d X k =1 | b Σ jk − Σ jk | p {| j − k | ≤ ℓ } ! /p + d X j =1 d X k =1 | Σ jk | p {| j − k | > ℓ } ! /p ≤ (cid:0) d + ℓ (2 d − ℓ − (cid:1) /p (cid:13)(cid:13) vec (cid:0) B ℓ ( b Σ) − B ℓ (Σ) (cid:1)(cid:13)(cid:13) ∞ + d /p B p ℓ − α , (171)where the first term on the far right hand side follows since the double sum has d + ℓ (2 d − ℓ − ℓ = ℓ n ≡ B p/ (1+ pα ) p k σ k − p/ (1+ pα ) ∞ λ − p/ (1+ pα ) n . By Lemma 5 (i) (applied to p = ∞ ), withprobability at least 1 − ζ , k vec( B ℓ ( b Σ) − Σ) k p . B / (1+ pα ) p d /p k σ k pα/ (1+ pα ) ∞ λ pα/ (1+ pα ) n . (172) Proof of Case (ii).
The claim follows as Case 1 but using Lemma 6 (i) with s ≥ ( p ∧ log d ) ∨
4, instead of Lemma 5 (i) with p = ∞ . See also Case 1. B.3.6 Proofs for Appendix A.6
Proof of Lemma 9 . The proof is identical to the one of Lemma 3.2 in Chernozhukov et al.(2013). We sketch it for completeness. By Theorem 8, on the event { Π p ≤ δ } , we have | P( S ∗ n,p ≤ t | X ) − P( e S p ≤ t ) | ≤ π p ( δ ) or all t ∈ R ; in particular, for t = ˜ c n,p (cid:0) π p ( δ ) + α (cid:1) wehave P (cid:16) S ∗ n,p ≤ ˜ c p (cid:0) π p ( δ ) + α (cid:1) | X (cid:17) ≥ P (cid:16) e S p ≤ ˜ c p (cid:0) π p ( δ ) + α (cid:1) | X (cid:17) − π p ( δ ) ≥ π p ( δ ) + α − π p ( δ ) = α. This implies the first inequality in the lemma. The second follows similarly.
Proof of Lemma 10 . We first establish the upper bound for all α ∈ (0 , e S p := k Ω / Z k p with Z ∼ N (0 , I ) and that the map f ( Z ) = k Ω / Z k p is Lipschitz continuous (withrespect to the Euclidean norm) with Lipschitz constant k Ω / k → p := sup k u k ≤ k Ω / u k p .Thus, by the Gaussian concentration inequality for Lipschitz continuous functions (e.g.van der Vaart and Wellner, 1996, Lemma A.2.2), for all t > (cid:16) e S p − E[ e S p ] ≥ t (cid:17) ≤ exp (cid:26) − t k Ω / k → p (cid:27) . In particular, P (cid:16) e S p > E[ e S p ] + p /α ) k Ω / k → p (cid:17) ≤ α. (cid:16) e S p − E[ e S p ] ≥ t (cid:17) ≤ Var[ e S p ] t , and therefore P (cid:18) e S p > E[ e S p ] + p /α q Var[ e S p ] (cid:19) ≤ α. Now, the upper bound follows from the definition of ˜ c p (1 − α ).To establish the lower bound for α ∈ (0 , / (cid:12)(cid:12)(cid:12) E[ e S p ] − ˜ c p (1 / (cid:12)(cid:12)(cid:12) ≤ q Var[ e S p ] . Whence, for all α ∈ (0 , /
2] it follows that˜ c p (1 − α ) ≥ ˜ c p (1 / ≥ E[ e S p ] − q Var[ e S p ] . To conclude, note that by the Gaussian Poincar´e inequality, q Var[ e S p ] ≤ k Ω / k → p . B.3.7 Proofs for Appendix A.7
Proof of Lemma 11 . The claim follows from straightforward computations. The mostconvenient way to carry out those calculations is to notice that M p ( x ) ≡ M f,g ( x ) = g − (cid:16)P dj =1 f ( x j ) (cid:17) for f ( x ) = g ( x ) = x p , x ≥
0. Now, repeated applications of the implicit function theoremand the chain rule yield the claim.
Proof of Lemma 12 . Note that M p ( x ) = k x k p for any x ∈ (cid:8) z ∈ R d : z i ≥ , i = 1 , . . . , d (cid:9) \{ } and p >
1. With slight abuse of notation, we will also use this formulation when theexponent is less than one or negative. First, since conjugate exponents satisfy ( p − q = p , d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q = k x k ( p − q ( p − q k x k ( p − qp = k x k pp k x k pp = 1 . Second, suppose that p ≥
2. Since ( p − q = p − q and (2 p − q = 2 p + q , d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ≤ q − ( p − q k x k ( p − q ( p − q k x k ( p − qp + 2 q − ( p − q k x k p − q p − q k x k (2 p − qp = 2 q − ( p − q k x k p − qp − q k x k pp + 2 q − ( p − q k x k p p k x k p + qp ≤ q − ( p − q d q/p k x k p − qp k x k pp + 2 q − ( p − q k x k p p k x k p + qp = 2 q − ( p − q d q/p k x k qp + 2 q − ( p − q k x k qp , p − q =2 p + q , X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ≤ ( p − q k x k p − q ( p − q k x k (2 p − qp = ( p − q k x k pp k x k p + qp = ( p − q k x k qp . Fourth, since (3 p − q = 3 p + 2 q , X k,ℓ,m (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ ∂x m (cid:12)(cid:12)(cid:12)(cid:12) q ≤ (2 p − q ( p − q k x k p − q ( p − q k x k (3 p − qp = (2 p − q ( p − q k x k pp k x k p +2 qp = (2 p − q ( p − q k x k qp . Fifth, suppose that p ≥
2, and compute X k,ℓ (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k ∂x ℓ (cid:12)(cid:12)(cid:12)(cid:12) q ≤ q − ( p − q k x k ( p − q ( p − q k x k ( p − q ( p − q k x k (2 p − qp + 2 q − (2 p − q ( p − q k x k p − q p − q k x k ( p − q ( p − q k x k (3 p − qp = 2 q − ( p − q k x k p − qp − q k x k pp k x k p + qp + 2 q − (2 p − q ( p − q k x k p p k x k pp k x k p +2 qp ≤ q − ( p − q d q/p k x k p − qp k x k p + qp + 2 q − (2 p − q ( p − q k x k p p k x k p +2 qp ≤ q − ( p − q d q/p k x k qp + 2 q − (2 p − q ( p − q k x k qp , where the second inequality follows from the power mean inequality. Sixth, suppose that p ≥
3. Since ( p − q = p − q and (2 p − q = 2 p − q = p + ( p − q ≥ p , d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M p ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) q ≤ q − ( p − q ( p − q k x k ( p − q ( p − q k x k ( p − qp + 2 q − q ( p − q k x k (2 p − q (2 p − q k x k (2 p − qp + 2 q − (2 p − q ( p − q k x k p − q p − q k x k (3 p − qp = 2 q − ( p − q ( p − q k x k p − qp − q k x k pp + 2 q − q ( p − q k x k p − q p − q k x k p + qp + 2 q − (2 p − q ( p − q k x k p p k x k p +2 qp ≤ q − ( p − q ( p − q d q/p k x k p − qp k x k pp + 2 q − q ( p − q k x k p − qp k x k p + qp + 2 q − (2 p − q ( p − q k x k qp = 2 q − ( p − q ( p − q d q/p k x k qp + 2 q − q ( p − q k x k qp + 2 q − (2 p − q ( p − q k x k qp , p = 2, then the firstterm vanishes, and we have d X k =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ M ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) ≤ q q k x k qp . Proof of Lemma 13 . The claim follows from Lemma 12 and the power mean inequality.
References
Antonini, R. G. (1997). Subgaussian random variables in Hilbert spaces.
Rendiconti delSeminario Matematico della Universit`a di Padova , 98:89–99.Avella-Medina, M., Battey, H. S., Fan, J., and Li, Q. (2018). Robust estimation of high-dimensional covariance and precision matrices.
Biometrika , 105(2):271–284.Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: by an example of a two sampleproblem.
Statistica Sinica , 6(2):311–329.Bentkus, V. (1985). Lower bounds for the rate of convergence in the central limit theoremin Banach spaces.
Lithuanian Mathematical Journal , 25(4):312–320.Bentkus, V. (2003). On the dependence of the BerryEsseen bound on dimension.
Journal ofStatistical Planning and Inference , 113(2):385 – 402.Bentkus, V. and G¨otze, F. (1997). Uniform rates of convergence in the CLT for quadraticforms in multidimensional spaces.
Probability theory and related fields , 109(3):367–416.Bhattacharya, R. N. (1977). Refinements of the multidimensional central limit theorem andapplications.
Ann. Probab. , 5(1):1–27.Biau, G. and Mason, D. M. (2015). High-dimensional p -norms. In Mathematical Statisticsand Limit Theorems , pages 21–40. Springer.Bickel, P. J. and Freedman, D. A. (1983). Bootstrapping regression models with manyparameters.
A Festschrift for Erich L. Lehmann , pages 28–48.Bickel, P. J. and Levina, E. (2008a). Covariance regularization by thresholding.
Annals ofStatistics , 36(6):2577–2604.Bickel, P. J. and Levina, E. (2008b). Regularized estimation of large covariance matrices.
Annals of Statistics , 36(1):199–227.Bolthausen, E. (1984). An estimate of the remainder in a combinatorial central limit theorem.
Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Gebiete , 66(3):379–386.77oucheron, S., Lugosi, G., and Massart, P. (2013).
Concentration Inequalities: A Nonasymp-totic Theory of Independence . Oxford University Press, Oxford.Bunea, F. and Xiao, L. (2015). On the sample covariance matrix estimator of reducedeffective rank population matrices, with applications to fPCA.
Bernoulli , 21(2):1200–1230.Cai, T. and Liu, W. (2011). Adaptive Thresholding for Sparse Covariance Matrix Estimation.
Journal of the American Statistical Association , 106(494):672–684.Cai, T. T., Zhang, C.-H., and Zhou, H. H. (2010). Optimal rates of convergence for covariancematrix estimation.
Annals of Statistics , 38(4):2118–2144.Carbery, A. and Wright, J. (2001). Distributional and L q norm inequalities for polynomialsover convex bodies in R n . Mathematical Research Letters , 8.Chatterjee, S. (2014).
Superconcentration and Related Topics . Springer Monographs inMathematics. Springer.Chen, S. X., Qin, Y.-L., et al. (2010). A two-sample test for high-dimensional data withapplications to gene-set testing.
The Annals of Statistics , 38(2):808–835.Chernozhukov, V., Chetverikov, D., and Kato, K. (2013). Gaussian approximations andmultiplier bootstrap for maxima of sums of high-dimensional random vectors.
The Annalsof Statistics , 41(6):2786–2819.Chernozhukov, V., Chetverikov, D., and Kato, K. (2015). Comparison and anti-concentrationbounds for maxima of gaussian random vectors.
Probability Theory and Related Fields ,162(1):47–70.Chernozhukov, V., Chetverikov, D., and Kato, K. (2017a). Central limit theorems andbootstrap in high dimensions.
The Annals of Probability , 45(4):2309–2352.Chernozhukov, V., Chetverikov, D., and Kato, K. (2017b). Detailed Proof of Nazarov’sInequality. arXiv preprint, arXiv:1711.10696 .Deng, H. and Zhang, C.-H. (2020). Beyond gaussian approximation: Bootstrap for maximaof sums of independent random vectors. arXiv preprint, arXiv:1705.09528 .D¨umbgen, L., van de Geer, S. A., Veraar, M. C., and Wellner, J. A. (2010). Nemirovski’sInequalities Revisited.
The American Mathematical Monthly , 117(2):138–160.Fan, J., Guo, S., and Hao, N. (2012). Variance estimation using refitted cross-validationin ultrahigh dimensional regression.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 74(1):37–65.Fan, J., Liao, Y., and Mincheva, M. (2011). High dimensional covariance matrix estimationin approximate factor models.
Annals of statistics , 39(6):3320.78an, J., Liao, Y., and Yao, J. (2015). Power enhancement in high-dimensional cross-sectionaltests.
Econometrica , 83(4):1497–1541.Foucart, S. and Rauhut, H. (2013).
A Mathematical Introduction to Compressive Sensing .Applied and Numerical Harmonic Analysis. Springer New York.G¨otze, F. (1991). On the rate of convergence in the multivariate clt.
The Annals of Proba-bility , 19(2):724–739.G¨otze, F., Naumov, A., Spokoiny, V., and Ulyanov, V. (2019). Large ball probabilities,Gaussian comparison and anti-concentration.
Bernoulli , 25(4A):2538–2563.G¨otze, F. and Zaitsev, A. Y. (2014). Explicit rates of approximation in the CLT for quadraticforms.
Ann. Probab. , 42(1):354–397.Koike, Y. (2019). Notes on the dimension dependence in high-dimensional central limittheorems for hyperrectangles. arXiv preprint, arXiv:1911.00160 .Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrixestimation.
Annals of statistics , 37(6B):4254.Ledoux, M. and Talagrand, M. (1991).
Probability in Banach Spaces . Springer.Liu, R. Y. (1988). Bootstrap Procedures under some Non-I.I.D. Models.
Ann. Statist. ,16(4):1696–1708.Lopes, M. E., Blandino, A., and Aue, A. (2019). Bootstrapping spectral statistics in highdimensions.
Biometrika , 106(4):781–801.Lopes, M. E., Lin, Z., and Mller, H.-G. (2020). Bootstrapping max statistics in high dimen-sions: Near-parametric rates under weak variance decay and application to functional andmultinomial data.
Annals of Statistics , 48(2):1214–1229.Mammen, E. (1993). Bootstrap and Wild Bootstrap for High Dimensional Linear Models.
Ann. Statist. , 21(1):255–285.Nazarov, F. (2003).
On the Maximal Perimeter of a Convex Set in R n with Respect to aGaussian Measure , pages 169–187. Springer Berlin Heidelberg, Berlin, Heidelberg.Paouris, G. and Valettas, P. (2018). On Dvoretzky’s theorem for subspaces of L p . Journalof Functional Analysis , 275(8):2225 – 2252.Pouzo, D. (2015). Bootstrap consistency for quadratic forms of sample averages with in-creasing dimension.
Electron. J. Statist. , 9(2):3046–3097.Radulovi´c, D. (1998). Can we bootstrap even if CLT fails?
Journal of Theoretical Probability ,11(3):813–830.Raiˇc, M. (2019). A multivariate BerryEsseen theorem with explicit constants.
Bernoulli ,25(4A):2824–2853. 79¨ollin, A. (2013). Stein’s method in high dimensions with applications.
Annales de l’IHPProbabilit´es et Statistiques , 49(2):529–549.Schechtman, G. and Zinn, J. (1990). On the volume of the intersection of two l np balls. Proceedings of the American Mathematical Society , 110(1):217–224.Spokoiny, V. and Zhilova, M. (2015). Bootstrap confidence sets under model misspecification.
Ann. Statist. , 43(6):2653–2675.van de Geer, S., B¨uhlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptoticallyoptimal confidence regions and tests for high-dimensional models.
Annals of Statistics ,42(3):1166–1202.van der Vaart, A. W. and Wellner, J. A. (1996).
Weak Convergence and Empirical Processes .Springer.Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. InEldar, Y. and Kutyinok, G., editors,
Compressed Sensing, Theory and Applications , pages210–268, Cambridge. Cambridge University Press.Vershynin, R. (2018).
High-Dimensional Probability: An Introduction with Applications inData Science . Cambridge Series in Statistical and Probabilistic Mathematics. CambridgeUniversity Press.Wainwright, M. J. (2019).
High-dimensional statistics: A non-asymptotic viewpoint . Cam-bridge University Press.Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in regressionanalysis.
Annals of Statistics , 14(4):1261–1295.Xu, M., Zhang, D., and Wu, W. B. (2019). Pearson’s chi-squared statistics: approximationtheory and beyond.
Biometrika , 106:716–723.Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parametersin high dimensional linear models.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 76(1):217–242.Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models.