Central Limit Theorem and Bootstrap Approximation in High Dimensions with Near 1/ n − − √ Rates
aa r X i v : . [ m a t h . S T ] S e p CENTRAL LIMIT THEOREM AND BOOTSTRAP APPROXIMATION INHIGH DIMENSIONS WITH NEAR / √ n RATES B Y M ILES
E. L
OPES , University of California, Davis, [email protected]
Non-asymptotic bounds for Gaussian and bootstrap approximation haverecently attracted significant interest in high-dimensional statistics. This pa-per studies Berry-Esseen bounds for such approximations (with respect tothe multivariate Kolmogorov distance), in the context of a sum of n randomvectors that are p -dimensional and i.i.d. Up to now, a growing line of workhas established bounds with mild logarithmic dependence on p . However,the problem of developing corresponding bounds with near n − / depen-dence on n has remained largely unresolved. Within the setting of randomvectors that have sub-Gaussian entries, this paper establishes bounds withnear n − / dependence, for both Gaussian and bootstrap approximation. Inaddition, the proofs are considerably distinct from other recent approaches.
1. Introduction.
In recent years, the analysis of Berry-Esseen bounds for Gaussian andbootstrap approximation has become a quickly growing topic in high-dimensional statistics.Indeed, much of the work in this direction has been propelled by the fact that such approxi-mations are essential tools for a wide variety inference problems. A survey of related appli-cations and results may be found in (Belloni et al., 2018).To briefly review the modern literature on multivariate Berry-Esseen bounds, a naturalstarting point is the seminal paper (Bentkus, 2003). In that work, Bentkus studied Gaussianapproximation of a sum S n = n − / P ni =1 X i of i.i.d. centered isotropic random vectors in R p . Letting Y denote a centered Gaussian vector with E [ Y Y ⊤ ] = E [ X X ⊤ ] , and letting A denote the class of all Borel convex subsets of R p , Bentkus’ work showed that undersuitable moment conditions, the measure of distance sup A ∈ A | P ( S n ∈ A ) − P ( Y ∈ A ) | is atmost of order p / n − / . (See also (Bentkus, 2005; Raiˇc, 2019) for refinements and furtherreferences.) However, despite the strength of this result, it typically does not lend itself toapplications where p is large.In high-dimensional settings, the paper (Chernozhukov, Chetverikov and Kato, 2013)achieved a breakthrough by demonstrating that if A is taken instead to be a certain classof hyperectangles, then the corresponding measure of distance can be bounded at a rate thathas a logarithmic dependence on p , such as log( pn ) / n − / (and similarly for bootstrapapproximation). Subsequently, the papers (Chernozhukov, Chetverikov and Kato, 2017a)and (Chernozhukov et al., 2019), showed that when A includes all hyperrectangles, therates for Gaussian and bootstrap approximation can be improved to log( pn ) / n − / and log( pn ) / n − / respectively. Meanwhile, a parallel series of works (Deng and Zhang(2020+); Kuchibhotla, Mukherjee and Banerjee (2018); Koike (2019); Deng (2020); Das and Lahiri(2020)) developed further improvements, by showing that Gaussian and bootstrap approxi-mation can succeed asymptotically when log( p ) κ = o ( n ) and ≤ κ ≤ .With regard to rates of Gaussian approximation that go beyond the n − / dependence on n , some results have appeared in (Lopes, Lin and Müller, 2020) and (Fang and Koike, 2020). * Supported in part by NSF grant DMS-1915786.
MSC2020 subject classifications:
Primary 60F05, 62E17.
Keywords and phrases: central limit theorem, bootstrap, Berry-Esseen theorem, high dimensions. The first of these papers considered a setting of “weak variance decay”, where var( X j ) = O ( j − a ) for all ≤ j ≤ p , with a > being an arbitrarily small parameter. Under this typeof structure, the authors established the rate n − / δ for arbitrarily small δ > , when A is a certain class of hyperrectangles. In a different direction, the paper (Fang and Koike,2020) dealt with a setting where A includes all hyperrectangles, and where the vector X is isotropic. Within this setting, the authors established the rate log( p ) / log( n ) n − / when X has a continuous log-concave density, as well as the rate log( pn ) / n − / when X hassub-Gaussian entries (but need not have a density). In the current work, we focus on the lattercase, and the main contribution of our first result (Theorem 2.1) is to establish a rate withnear n − / dependence on n .In addition to the work on Gaussian approximation described above, there are a few spe-cial cases where near n − / rates are known to be achievable via bootstrap approximation.First, in the setting of weak variance decay, it was shown in (Lopes, Lin and Müller, 2020)that the mentioned n − / δ rate holds for bootstrap approximation as well. Second, the pa-per (Chernozhukov et al., 2019) showed that near n − / rates can be achieved when bootstrapmethods are used in particular ways. Namely, this was demonstrated in the case when the datahave a symmetric distribution and Rademacher weights are chosen for the multiplier boot-strap, or when bootstrap quantiles are adjusted in a conservative manner. (See also (Deng,2020) for further work in this direction.) In relation to these results, the current paper makesa second contribution in Theorem 2.3 by showing that a near n − / rate of bootstrap approx-imation holds, without relying variance decay, symmetry, or conservative adjustments.Concerning the proofs, perhaps the most important point to discuss is the use of smooth-ing techniques. As is well known, these techniques are based on using a smooth function,say ψ : R p → R , depending on a set A ⊂ R p , such that E [ ψ ( S n )] ≈ P ( S n ∈ A ) . Althoughthese techniques are of fundamental importance, one of their drawbacks is that they oftenincur an extra smoothing error | P ( S n ∈ A ) − E [ ψ ( S n )] | , which must be balanced with errorsfrom various other approximations. Moreover, this balancing process often turns out to be abottleneck for the overall rate of distributional approximation.As a way of avoiding this bottleneck, we use a smoothing function that arises “implic-itly” as part of the Lindeberg interpolation scheme — which has the benefit that it does notcreate any smoothing error. More concretely, if X , . . . , X n are non-Gaussian vectors and if Y , . . . , Y n are Gaussian vectors, then this notion of smoothing is based on the fact that theprobability P ( P ki =1 X i + P nj = k +1 Y j ∈ A ) can be equivalently written as E [ ˜ ψ ( P ki =1 X i )] ,for a particular smooth random function ˜ ψ defined in terms of Y k +1 , . . . , Y n . (This is ex-plained in detail in Section 4.1.) Furthermore, it turns out that the derivatives of ˜ ψ may becontrolled effectively, as a consequence of the work of (Bentkus, 1990). However, by itself,this type of smoothing does not seem to provide a way to handle every step of the Lindeberginterpolation, because the smoothing effect from the Gaussian vectors Y k +1 , . . . , Y n runs outof steam when k becomes close to n . To overcome this issue, a second important ingredientin the proof is the use of induction, which makes it possible to re-use good approximationsfrom small values of k at larger values of k . In particular, the use of induction here is influ-enced by the paper (Bentkus, 2003) (even though the approach to smoothing in that work isdifferent). Notation.
We write s (cid:22) r for two vectors s, r ∈ R p satisfying the inequalities s j ≤ r j for all ≤ j ≤ p . A scalar random variable U is said to be sub-Gaussian if it has a finite ψ -Orlicznorm, defined by k U k ψ = inf { t > | E [exp( U /t )] ≤ } . If V is another random variablethat is equal in distribution to U , then we write V L = U . If x is a vector, matrix, or tensorwith real entries, we use k x k ∞ to refer to the maximum absolute value of the entries, and k x k to refer to the sum of the absolute values of the entries. Also, the identity matrix in EAR / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS R p × p is denoted by I p . Throughout the paper, the symbol c will denote a positive absoluteconstant whose value may vary at each occurrence. (Different symbols will be used whenit is necessary to track constants.) Lastly, in order to simplify presentation, we will use thefunction Log ( t ) = max { log( t ) , } , where log is the ordinary natural logarithm.
2. Main results.
The following theorem is the core result of the paper. Later on, a Gaus-sian comparison result (Theorem 2.2) and a bootstrap approximation result (Theorem 2.3) areobtained as extensions. The main aspects of the proof of Theorem 2.1 are given in Section 3.T
HEOREM
There is an absolute constant ¯ C > , suchthat the following holds for all n and p : Let X , . . . , X n ∈ R p be centered i.i.d. randomvectors, and suppose that ν = max ≤ j ≤ p k X j / p var( X j ) k ψ is finite. In addition, let ρ bethe smallest eigenvalue of the correlation matrix of X , and suppose that ρ > . Lastly, let Y ∈ R p be a centered Gaussian random vector with E [ Y Y ⊤ ] = E [ X X ⊤ ] . Then, (1) sup r ∈ R p (cid:12)(cid:12)(cid:12) P (cid:0) √ n P ni =1 X i (cid:22) r (cid:1) − P (cid:0) Y (cid:22) r (cid:1)(cid:12)(cid:12)(cid:12) ≤ ¯ C (cid:0) ν / ρ / (cid:1) Log ( pn ) Log ( n ) n / . Remarks.
To discuss some of the characteristics of the bound, it should be mentioned thatthe reliance on the sub-Gaussian entries of X can be relaxed. In particular, a correspondingresult for sub-exponential entries can be obtained with a larger power of Log ( pn ) .Next, there are a few items to consider with regard to the dependence on the parameter ρ . First, it is important to clarify that ρ is often much larger than the smallest eigenvalue ofthe covariance matrix E [ X X ⊤ ] , say λ . This is easiest to see in the context of uncorrelatedvariables, where ρ = 1 holds for arbitrarily small λ > . More generally, in the context ofstrongly correlated variables, the parameter ρ need not be very small either. For instance, inthe case when cor ( X i , X j ) = 0 . for all i = j , it follows that ρ = 0 . (regardless of thedimension p ). With regard to the Gaussian approximation results in (Fang and Koike, 2020),it is difficult to make a comparison with respect to the dependence on ρ , since those results areformulated in an isotropic case where ρ = 1 . Meanwhile, it should be noted that the Gaussianapproximation result in (Chernozhukov et al., 2019) does not rely on the condition ρ > .Instead, that result depends on how well separated the parameter ς = min ≤ j ≤ p var ( X j ) isfrom 0. These different relative merits also apply to the Gaussian comparison and bootstrapapproximation results in that work, vis-a-vis Theorems 2.2 and 2.3 given below.At an informal level, if ρ is well separated from 0, this can be interpreted to mean that thedistribution of X is “fully high-dimensional” — which is precisely the case we are interestedin. On the other hand, if the correlation matrix of X has some eigenvalues that are very small,this is an indication that the distribution of X has some low-dimensional structure, and inthat case, it may be preferable to pursue a different approach that takes the structure directlyinto account, such as in (Lopes, Lin and Müller, 2020). Nevertheless, it turns out that it ispossible to extend Theorem 2.1 to handle the case when ρ = 0 , provided that the correlationmatrix of X is close to a positive definite matrix in an entrywise sense. However, this willnot be needed in order to develop the bootstrap approximation result later on.2.1. Gaussian comparison.
Our second result provides a bound on the Kolmogorov dis-tance between two Gaussian vectors in terms of a normalized ℓ ∞ -distance between theircovariance matrices. In addition to being of basic interest by itself, this result will serve asa bridge to connect the Gaussian approximation result in Theorem 2.1 with the bootstrapapproximation result in Theorem 2.3. T HEOREM
There is an absolute constant c > , such thatthe following holds for all p : Let Y and Z be centered Gaussian vectors in R p , havingrespective covariance matrices Σ Y and Σ Z . In addition, let ρ be the smallest eigenvalue ofthe correlation matrix of Y , and suppose that ρ > . Lastly, let D = diag (Σ Y , . . . , Σ Ypp ) , andlet ∆ = k D − / (Σ Z − Σ Y ) D − / k ∞ . Then, (2) sup r ∈ R p (cid:12)(cid:12)(cid:12) P (cid:0) Z (cid:22) r (cid:1) − P (cid:0) Y (cid:22) r (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:0) cρ (cid:1) Log ( p ) Log (cid:0) (cid:1) ∆ . Remarks.
Although the bound depends on the invertibility of Σ Y , it is important to note thatthe bound does not depend on the invertibility of Σ Z . This is a key property in the context ofbootstrap approximation, where Σ Z will represent a sample covariance matrix that is possiblynon-invertible. Another comment to make about Theorem 2.2 is its relation to Corollary 5.1 ofthe paper (Chernozhukov et al., 2019). That result establishes a Gaussian comparison boundof the form c ( ς ) Log ( p ) p k Σ Y − Σ Z k ∞ , where the parameter ς = min ≤ j ≤ p Σ Yjj is assumedto be positive, and the constant c ( ς ) depends only on ς . In this connection, the essential pointto notice is that the bound (2) has a near-linear dependence on the parameter ∆ .2.2. Bootstrap approximation.
For a set of observations X , . . . , X n ∈ R p , the associatedsample covariance matrix is defined as(3) b Σ = 1 n n X i =1 ( X i − ¯ X )( X i − ¯ X ) ⊤ , where ¯ X = n P ni =1 X i . In terms of this matrix, the Gaussian multiplier bootstrap methoddeveloped by (Chernozhukov, Chetverikov and Kato, 2013) is based on generating a set ofindependent random vectors X ∗ , . . . , X ∗ n ∈ R p from the Gaussian distribution N (0 , b Σ) . Thegeneral purpose of this method is to use the distribution of the sum X ∗ + · · · + X ∗ n (conditionalon the original observations) as an approximation to the distribution of the sum X + · · · + X n . Accordingly, we will use the notation P ( · | X ) to refer to probability that is conditionalon X , . . . , X n .T HEOREM
There is an absolute constant c > , suchthat the following holds for all n and p : Suppose that the conditions of Theorem 2.1 hold,and let X ∗ , . . . , X ∗ n be independent Gaussian random vectors drawn from N (0 , b Σ) . Then, thefollowing event holds with probability at least − cn , (4) sup r ∈ R p (cid:12)(cid:12)(cid:12) P (cid:16) √ n P ni =1 X ∗ i (cid:22) r (cid:12)(cid:12)(cid:12) X (cid:17) − P (cid:16) √ n P ni =1 X i (cid:22) r (cid:17)(cid:12)(cid:12)(cid:12) ≤ c (cid:0) ν / ρ / (cid:1) Log ( pn ) Log ( n ) n / . Remarks.
This result follows directly from Theorem 2.1 and Theorem 2.2 by letting b Σ playthe role of Σ Z , and letting Σ = E [ X X ⊤ ] play the role of Σ Y . To provide a bit more de-tail, we need only consider the case when n ≥ Log ( pn ) , for otherwise there is nothingto prove. In this case, there is an absolute constant c > , such that the random variable b ∆ = k D − / ( b Σ − Σ) D − / k ∞ satisfies the bound b ∆ ≤ cν p Log ( pn ) n − / with proba-bility at least − cn , as recorded in Lemma 7.1 of Section 7. Outline.
After a high-level proof of Theorem 2.1 is given in Section 3, some preparatoryitems are developed in Section 4, which will be used in the more technical arguments givenin Section 5. Later on, Theorem 2.2 is proven in Section 6, and various background resultsare summarized in Section 7.
EAR / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS
3. The main steps in the proof of Theorem 2.1.
Without loss of generality, we mayassume that the covariance matrix E [ X X ⊤ ] has all ones along the diagonal, because theKolmogorov distance is invariant to diagonal rescaling. Also, for future reference, it will beuseful to take note of the general bounds ρ ≤ and ν ≥ , which are implied by the definitionsof ρ and ν .To lay out the beginning of the proof, let Y , . . . , Y n ∈ R p be i.i.d. copies of Y that areindependent of X , . . . , X n . In order to write various partial sums, we use the followingnotation for ≤ k ≤ k ′ ≤ n , S k : k ′ ( X ) = n − / ( X k + · · · + X k ′ ) S k : k ′ ( Y ) = n − / ( Y k + · · · + Y k ′ ) . In addition, it will be convenient to denote the Kolmogorov distance between S k ( X ) and S k ( Y ) as D k = sup r ∈ R p (cid:12)(cid:12)(cid:12) P ( S k ( X ) (cid:22) r ) − P ( S k ( Y ) (cid:22) r ) (cid:12)(cid:12)(cid:12) . To write down a basic form of Lindeberg interpolation for bounding D n , define the followingquantities for any r ∈ R p , δ Xk ( r ) = P (cid:16) S k − ( X ) + √ n X k + S k +1: n ( Y ) (cid:22) r (cid:17) − P (cid:16) S k − ( X ) + S k +1: n ( Y ) (cid:22) r (cid:17) ,δ Yk ( r ) = P (cid:16) S k − ( X ) + √ n Y k + S k +1: n ( Y ) (cid:22) r (cid:17) − P (cid:16) S k − ( X ) + S k +1: n ( Y ) (cid:22) r (cid:17) . This notation yields the interpolation P ( S n ( X ) (cid:22) r ) − P ( S n ( Y ) (cid:22) r ) = n X k =1 δ Xk ( r ) − δ Yk ( r ) . Next, define the supremum of the k th difference(5) δ k = sup r ∈ R p | δ Xk ( r ) − δ Yk ( r ) | , which leads to the bound D n ≤ δ + · · · + δ n . However, rather than working directly with the entire sum δ + · · · + δ n , we will begin witha lemma that reduces the problem to bounding δ + · · · + δ n − m for an integer m that will becarefully chosen later on.L EMMA
There is an absolute constant c > such that the following holds for all n, p ≥ and ≤ m ≤ n : If the conditions of Theorem 2.1 hold, then (6) D n ≤ c ν q mn Log ( pn ) + 3( δ + · · · + δ n − m ) . It is worthwhile to proceed straight to the proof of this lemma, since the argument is fairlyshort, and since the notation in it will be used later on.P
ROOF . As a temporary shorthand, let ζ = S n ( X ) and ξ = S n − m ( X ) + S n − m +1: n ( Y ) .Observe that the Kolmogorov distance between ξ and S n ( Y ) is at most δ + · · · + δ n − m ,which gives D n ≤ sup r ∈ R p (cid:12)(cid:12)(cid:12) P (cid:0) ζ (cid:22) r (cid:1) − P (cid:0) ξ (cid:22) r (cid:1)(cid:12)(cid:12)(cid:12) + ( δ + · · · + δ n − m ) . To control the first term on the right side, define the corner set associated with a fixed r ∈ R p and t ∈ R , C ( r, t ) = n x ∈ R p (cid:12)(cid:12)(cid:12) x (cid:22) r + t o , where ∈ R p is the all-ones vector. Also, define an associated boundary set of “width” t ,(7) ∂ C ( r, t ) = C ( r, t ) \ C ( r, − t ) . In terms of this notation, we have the following basic inequality (Lemma 7.3), which holdsfor any r ∈ R p and any t > , | P ( ζ (cid:22) r ) − P ( ξ (cid:22) r ) | ≤ P (cid:0) ξ ∈ ∂ C ( r, t ) (cid:1) + P ( k ζ − ξ k ∞ ≥ t ) . Next, we control the probability that ξ hits ∂ C ( r, t ) by essentially replacing ξ with S n ( Y ) .To do this, again note that the Kolmogorov distance between ξ and S n ( Y ) is at most δ + · · · + δ n − m , and so P (cid:0) ξ ∈ C ( r, t ) (cid:1) ≤ P (cid:16) S n ( Y ) ∈ C ( r, t ) (cid:17) + ( δ + · · · + δ n − m ) . Similarly P (cid:0) ξ ∈ C ( r, − t ) (cid:1) ≥ P (cid:16) S n ( Y ) ∈ C ( r, − t ) (cid:17) − ( δ + · · · + δ n − m ) , and then combining gives P (cid:0) ξ ∈ ∂ C ( r, t ) (cid:1) ≤ P (cid:16) S n ( Y ) ∈ ∂ C ( r, t ) (cid:17) + 2( δ + · · · + δ n − m ) . In turn, Nazarov’s Gaussian anti-concentration inequality (Lemma 7.2) gives P (cid:16) S n ( Y ) ∈ ∂ C ( r, t ) (cid:17) ≤ ct p Log ( p ) , where we have made use of the reduction that E [ Y Y ⊤ ] has all ones along the diagonal. Lastly,observe that by a sub-Gaussian tail bound (Lemma 7.1), if we take t = cν p mn Log ( pn ) / fora sufficiently large absolute constant c > , then the coupling probability P ( k ζ − ξ k ∞ > t ) isat most cn . Furthermore, given that the parameter ν is at least 1, it follows that the quantity cn is of negligible order in comparison to t p Log ( p ) . Comments on induction.
In order to prove Theorem 2.1, we will use a form of strong in-duction. Specifically, for a given absolute constant ¯ C > , and given integers n and p , theassociated induction hypothesis is that the inequality (H k ( ¯ C ) ) below holds simultaneouslyfor all k = 1 , . . . , n − ,(H k ( ¯ C ) ) D k ≤ ¯ C (cid:0) ν / ρ / (cid:1) Log ( pk ) Log ( k ) k / . Although it is common in high-dimensional statistics to think of n and p as growing together,it is worth clarifying that the inductive approach here is based on showing that, for any fixed p , the entire sequence H ( ¯ C ) , H ( ¯ C ) , . . . holds. Hence, because p is arbitrary, it will followthat the statement of the main result holds for all pairs ( n, p ) . EAR / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS Proof of Theorem 2.1.
Observe that if ¯ C ≥ √ , then it is clear that H ( ¯ C ) andH ( ¯ C ) hold. To carry out the induction, fix any n ≥ , and suppose that H ( ¯ C ) , . . . , H n − ( ¯ C ) hold for some absolute constant ¯ C ≥ √ . Our goal is now to show that H n ( ¯ C ) holds (withthe same value of ¯ C ). The main tool for this purpose is the proposition below, whose proofis deferred to Section 5.P ROPOSITION
There is a positive absolute constant c such that the following holdsfor all n ≥ , p ≥ , and ≤ m ≤ n/ : If the the conditions of Theorem 2.1 hold, and if H ( ¯ C ) , . . . , H n − ( ¯ C ) hold for some absolute constant ¯ C ≥ √ , then (8) δ + · · · + δ n − m ≤ c (cid:0) ν / ρ (cid:1) Log ( pn ) Log ( n ) n / + c ¯ C (cid:0) ν ρ (cid:1) Log ( pn ) Log ( n ) n / m / . At a high level, Lemma 3.1 and Proposition 3.2 reduce the proof of Theorem 2.1 to ex-hibiting suitable values of m and ¯ C . To proceed, let c and c be the absolute constants inthe statements of these results. We may assume without loss of generality that c = c and c ≥ , because these results remain true if c and c are both replaced by max { c , c , } .Next, let ≤ m ≤ n/ , and define the quantities α , β , and γ according to α = c νm / Log ( pn ) n / β = c (cid:0) ν / ρ (cid:1) Log ( pn ) Log ( n ) n / + c ¯ C (cid:0) ν ρ (cid:1) Log ( pn ) Log ( n ) n / m / γ = ¯ C (cid:0) ν / ρ / (cid:1) Log ( pn ) Log ( n ) n / . In terms of this notation, Lemma 3.1 and Proposition 3.2 give the bound D n ≤ α + 3 β. Therefore, in order to show H n ( ¯ C ) , it is enough to show that there exist choices of m and ¯ C for which(9) α + 3 β ≤ γ. (It is not immediately obvious that such choices exist, because both sides of (9) depend on ¯ C , and also, because m must simultaneously satisfy the constraint m ≤ n/ .)In the remainder of the proof, we will construct feasible choices of ¯ C and m explicitlyin terms of c . For this purpose, let κ ≥ be a value that will be tuned later, and consider achoice of m whose square root given by(10) m / = l κ (cid:0) ν / ρ / (cid:1) Log ( pn ) m . When m is chosen this way, the quantities α and β satisfy α ≤ κc (cid:0) ν / ρ / (cid:1) Log ( pn ) n / (11) β ≤ c (cid:0) ν / ρ (cid:1) Log ( pn ) Log ( n ) n / + (cid:0) c ¯ Cκ (cid:1)(cid:0) ν / ρ / (cid:1) Log ( pn ) Log ( n ) n / . (12) (Note that in (11), the prefactor of 2 is introduced so that the ceiling function in (10) can beignored.) Since ν ≥ and ρ ≤ , we have ν / ρ ≤ ν / ρ / , which implies α + 3 β ≤ (cid:16) κc + 3 c + (cid:0) c ¯ Cκ (cid:1)(cid:17) · γ ¯ C .
Thus, in order to show α + 3 β ≤ γ , it suffices to select κ and ¯ C in terms of c so that κc + 3 c + (cid:0) c ¯ Cκ (cid:1) ≤ ¯ C. Likewise, if we put(13) κ = q ¯ C, then ¯ C should be chosen to satisfy (2 √ c ) p ¯ C + 3 c ≤ ¯ C. This is a quadratic inequality in √ ¯ C , which holds when(14) ¯ C ≥ (cid:16) √ c + q c + 3 c (cid:17) . In particular, this is compatible with the condition ¯ C ≥ √ mentioned earlier, since c ≥ .Moreover, since the right side of (14) is purely a function of c , the only remaining consid-eration is to make sure that (14) allows for a feasible choice of m ≤ n/ . (Note that m isnow determined by ¯ C through (10) and (13).) To do this, we may assume without loss ofgenerality that the inequality n / ≥ ¯ C (cid:0) ν / ρ / (cid:1) Log ( pn ) Log ( n ) holds, for otherwise H n ( ¯ C ) is trivially true. Comparing this inequality with (10) shows thatthe condition m ≤ n/ holds, for instance, when p (3 /
2) ¯ C ≤ ¯ C/ √ , i.e. when ¯ C ≥ .But at the same time, the right side of (14) is already greater than 18, and so it suffices to take ¯ C equal to the right side of (14).
4. Preparatory items.
The section develops the notation and key objects that will beneeded to prove Proposition 3.2 in Section 5.4.1.
Implicit smoothing.
The main idea in this subsection is represent the quantities δ Xk ( r ) and δ Yk ( r ) in terms of a certain implicit Gaussian smoothing function. We use theword “implicit”, because the smoothing function is automatically built into the Lindeberginterpolation through the Gaussian partial sums.To proceed, let ζ ∼ N (0 , I p ) be a standard Gaussian vector in R p , and for any fixed r, s ∈ R p and ǫ > , define(15) ϕ ǫ ( s, r ) = P ( s + ǫζ (cid:22) r ) = p Y j =1 Φ( r j − s j ǫ ) . When r is held fixed, the function ϕ ǫ ( · , r ) is a smoothed version of the indicator s { s (cid:22) r } ,with ǫ playing the role of a smoothing parameter. Next, for each k = 1 , . . . , n − , define(16) ǫ k = q n − kn √ ρ. EAR / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS The parameter ǫ k is used in order to simplify the following (distributional) decomposition ofthe Gaussian vector S k +1: n ( Y ) , S k +1: n ( Y ) L = ǫ k V k +1 + q n − kn W k +1 , where V k +1 ∼ N (0 , I p ) and W k +1 ∼ N (0 , R − ρI p ) are independent, and R is the corre-lation matrix of X . (Here, we continue to work under the reduction that E [ X X ⊤ ] = R .Also, note that the vectors V k +1 and W k +1 may be taken to be independent of X , . . . , X n .)Consequently, if we let(17) b r k +1 = r − q n − kn W k +1 , then we can connect ϕ ǫ k to the partial sums in the Lindeberg interpolation through the fol-lowing exact relation P (cid:16) S k ( X ) + S k +1: n ( Y ) (cid:22) r (cid:17) = E h ϕ ǫ k ( S k ( X ) , b r k +1 ) i . In turn, this relation allows us to express δ Xk ( r ) in terms of ϕ ǫ k for k = 1 , . . . , n − ,(18) δ Xk ( r ) = E h ϕ ǫ k (cid:16) S k − ( X ) + √ n X k , b r k +1 (cid:17) − ϕ ǫ k (cid:16) S k − ( X ) , b r k +1 (cid:17)i . The formula (18) is the key item to take away from the current subsection. The correspondingexpression for δ Yk ( r ) is nearly identical, with the only change being that the single occurrenceof X k in (18) is replaced with Y k . Moment matching.
By expanding the function ϕ ǫ k ( · , b r k +1 ) to second order at thepoint S k − ( X ) , we have the moment-matching formulas δ Xk ( r ) = E [ L Xk ( r )] + E [ Q Xk ( r )] + E [ R Xk ( r )] (19) δ Yk ( r ) = E [ L Yk ( r )] + E [ Q Yk ( r )] + E [ R Yk ( r )] , (20)where the terms corresponding to δ Xk ( r ) are defined as follows. Specifically, if all derivativesare understood as being with respect to the first argument of ϕ ǫ k , then L Xk ( r ) = D ∇ ϕ ǫ k ( S k − ( X ) , b r k +1 ) , n − / X k E (21) Q Xk ( r ) = D ∇ ϕ ǫ k ( S k − ( X ) , b r k +1 ) , n − X k X ⊤ k E (22) R Xk ( r ) = (1 − τ ) D ∇ ϕ ǫ k (cid:16) S k − ( X ) + τ √ n X k , b r k +1 (cid:17) , n − / X ⊗ k E , (23)with τ being a Uniform[0,1] random variable that is independent of all other random vari-ables. The notation ∇ ϕ ǫ k ( s, r ) refers to the tensor in R p × p × p whose entries are comprisedby all possible three-fold partial derivatives of ϕ ǫ k ( · , r ) at the point s . Also, we use h· , ·i todenote the entrywise inner product on vectors, matrices, and tensors. Lastly, the terms L Yk ( r ) , Q Yk ( r ) and R Yk ( r ) associated with δ Yk ( r ) in (20) only differ from those above insofar as eachappearance of X k on the right sides of (21), (22), and (23) is replaced by Y k .The classical idea of the Lindeberg interpolation is that if (20) is subtracted from (19),then the first and second order terms cancel, because X k and Y k have matching mean vectorsand covariance matrices. This leads to the relation(24) δ Xk ( r ) − δ Yk ( r ) = E [ R Xk ( r )] − E [ R Yk ( r )] . Hence, in order to control the supremum δ k = sup r ∈ R p | δ Xk ( r ) − δ Yk ( r ) | in (5), it remains tobound the expected remainders uniformly with respect to r ∈ R p , and this is handled in thenext section.
5. Bounds for δ k , and the proof of Proposition 3.2. The next lemma handles δ k for k = 2 , . . . , n − . This lemma is of special significance to the overall structure of the proof ofTheorem 2.1, because it sets up the opportunity to apply the induction hypothesis to D k − .Apart from this, the quantity δ will be handled separately in Lemma 5.2 later on. (It willnot be necessary to handle δ n , due to Lemma 3.1.) At the end of the section, the proof ofProposition 3.2 will be given.L EMMA
There is an absolute constant c > such that the following holds for all n ≥ , p ≥ , and ≤ k ≤ n − : If the conditions of Theorem 2.1 hold, then (25) δ k ≤ cν / Log ( pn ) ǫ k n / (cid:16) ǫ k Log ( pn ) q nk − + D k − + pn (cid:17) . P ROOF . From the previous section, we have the following bound on δ k ,(26) δ k ≤ sup r ∈ R p E [ | R Xk ( r ) | ] + sup r ∈ R p E [ | R Yk ( r ) | ] . The current proof will only establish a bound on sup r ∈ R p E [ | R Xk ( r ) | ] , since the argument isthe same for sup r ∈ R p E [ | R Yk ( r ) | ] . To begin, define the random vector ˜ r k +1 = b r k +1 − τ √ n X k ,and for any fixed ε > , define the event A k ( ε ) = n S k − ( X ) ∈ ∂ C (˜ r k +1 , ε ) o . Below, we will separately analyze R Xk ( r ) on the event A k ( ε ) and its complement A ck ( ε ) , via E [ | R Xk ( r ) | ] = E [ | R Xk ( r ) | { A k ( ε ) } ] + E [ | R Xk ( r ) | { A ck ( ε ) } ] . Handling the remainder on A k ( ε ) . By applying Hölder’s inequality to the definition of R Xk ( r ) in (23), we have(27) | R Xk ( r ) | { A k ( ε ) } ≤ n / · (cid:16) sup s,r ∈ R p k∇ ϕ ǫ k ( s, r ) k (cid:17) · k X k k ∞ · { A k ( ε ) } , where k∇ ϕ ǫ k ( s, r ) k refers to the sum of the absolute values of the entries in the 3-tensor ∇ ϕ ǫ k ( s, r ) . Crucially, it is known from (Bentkus, 1990, Theorem 3) that(28) sup s,r ∈ R p k∇ ϕ ǫ k ( s, r ) k ≤ c Log ( p ) / ǫ k . To be precise, the result (Bentkus, 1990, Theorem 3) is stated for functions that are slightlydifferent from ϕ ǫ k ( r, s ) , but a more recent statement of the result that matches the formof (28) can be found in (O’Donnell, Servedio and Tan, 2019, Theorem 6.5).Thus, it remains to control the expectation E [ k X k k ∞ { A k ( ε ) } ] . Noting that S k − ( X ) isindependent of ˜ r k +1 and X k , we have E [ k X k k ∞ { A k ( ε ) } ] = E h k X k k ∞ P (cid:16) S k − ( X ) ∈ ∂ C (˜ r k +1 , ε ) (cid:12)(cid:12)(cid:12) ˜ r k +1 , X k (cid:17)i ≤ E h k X k k ∞ (cid:16) P (cid:16) S k − ( Y ) ∈ ∂ C (˜ r k +1 , ε ) (cid:12)(cid:12)(cid:12) ˜ r k +1 , X k (cid:17) + 2 D k − (cid:17)i ≤ E [ k X k k ∞ ] (cid:16) cε q nk − p Log ( p ) + 2 D k − (cid:17) , where we note that S k − ( X ) has been replaced with S k − ( Y ) at the price of D k − ,and Nazarov’s Gaussian anti-concentration inequality (Lemma 7.2) has been used in the EAR / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS last step. Combining the last several steps with the bound E [ k X k k ∞ ] ≤ c ( ν Log ( p )) / fromLemma 7.1 yields(29) E [ | R Xk ( r ) | { A k ( ε ) } ] ≤ cν / Log ( p ) ǫ k n / (cid:16) ε q nk − p Log ( p ) + D k − (cid:17) , which holds uniformly with respect to r ∈ R p . Handling the remainder on A ck ( ε ) . For this part, the idea is that for any r ∈ R p , the quantity k∇ ϕ ǫ k ( s, r ) k is essentially negligible when s ∂ C ( r, ε ) and ε is chosen to be sufficientlylarge. To this end, define the deterministic quantity b k ( ε ) = sup n k∇ ϕ ǫ k ( s, r ) k (cid:12)(cid:12)(cid:12) r ∈ R p and s ∂ C ( r, ε ) o , where the supremum involves both s and r . Thus, Hölder’s inequality gives E [ | R Xk ( r ) | { A ck ( ε ) } ] ≤ n / · b k ( ε ) · E [ k X k k ∞ ] . Given that k∇ ϕ ǫ k ( s, r ) k can be written down explicitly based on (15), it is straightforwardto verify that if we choose ε = cǫ k p Log ( pn ) for a sufficiently large absolute constant c > , then b k ( ε ) ≤ cǫ k pn . Combining this with thefact that E [ k X k k ∞ ] ≤ c ( ν Log ( p ) / leads to the stated result.L EMMA
There is an absolute constant c > , such that the following holds for all n ≥ and p ≥ : If the conditions of Theorem 2.1 hold, then δ ≤ cν / Log ( p ) ρ / n / . P ROOF . As in the proof of the previous lemma, it suffices to bound sup r ∈ R p E [ | R X ( r ) | ] .Using the same steps as in (27) and (28), but ignoring the role of the indicator { A k ( ε ) } , wehave sup r ∈ R p E [ | R X ( r ) | ] ≤ c E [ k X k ∞ ] Log ( p ) / ǫ n / . Applying the previously used bound on E [ k X k ∞ ] from Lemma 7.1 completes the proof. Proof of Proposition 3.2.
By Lemma 5.2, the quantity δ is negligible in comparison to theright side of (8), and so it is enough to focus on δ + · · · + δ n − m . By Lemma 5.1, we havethat for k = 2 , . . . , n − m , δ k ≤ cν / Log ( pn ) ǫ k n / (cid:16) ǫ k Log ( pn ) q nk − + D k − + pn (cid:17) . Since we assume that H ( ¯ C ) , . . . , H n − ( ¯ C ) hold, we may derive a bound on δ k for each k = 2 , . . . , n − m by applying H k − ( ¯ C ) to D k − , δ k ≤ cν / Log ( pn ) ǫ k n √ k − + cν / Log ( pn ) ( D k − + pn ) ǫ k n / ≤ (cid:0) c (cid:0) ν / ρ (cid:1) Log ( pn ) (cid:1) n − k ) √ k − + (cid:0) c ¯ C (cid:0) ν ρ (cid:1) Log ( pn ) Log ( k − (cid:1) n − k ) / √ k − . Finally, to bound the sum δ + · · · + δ n − m , observe that n − m X k =2 1( n − k ) √ k − ≤ c Log ( n ) n / , and n − m X k =2 1( n − k ) / √ k − ≤ cn / m / . Combining the last few steps leads to the stated result.
6. Proof of Theorem 2.2.
Let N be a positive integer that will be chosen later. Also,let Z , . . . , Z N be i.i.d. copies of Z , and let Y , . . . , Y N be an independent sequence ofi.i.d. copies of Y . Due to the scale invariance of the Kolmogorov metric, we may assumewithout loss of generality that Σ Y has all ones along the diagonal. In addition, we will applyprevious notations such as S k ( X ) , δ Xk ( r ) , etc. in a corresponding manner to the randomvectors Z , . . . , Z N , with N playing the role of n in the proof of Theorem 2.1. In particular,we have the following equalities in distribution for every choice of N , S N ( Z ) L = Z,S N ( Y ) L = Y. In order to re-use the proof of Theorem 2.1, the main part that needs to be revised is themoment matching argument in Section 4.2. Specifically, the relation (24) must be modified,because in the current context, there is no guaranteed cancellation of the quadratic terms inthe expansions (19) and (20). If we account for this detail in the reasoning leading up to (24),then we have the following relation for every k = 1 , . . . , N − , δ Zk ( r ) − δ Yk ( r ) = E [ Q Zk ( r )] − E [ Q Yk ( r )] + E [ R Zk ( r )] − E [ R Yk ( r )] . The terms R Zk ( r ) and R Yk ( r ) can be handled in the same manner as before in Section 5. Tohandle the difference of the quadratic terms Q Zk ( r ) and Q Yk ( r ) , observe that in the currentcontext, the random vector b r k +1 defined in (17) is independent of both Z k and Y k , and so for k = 1 , . . . , N − , we have E (cid:2) Q Zk ( r ) − Q Yk ( r ) (cid:3) = 12 E (cid:20)D ∇ ϕ ǫ k ( S k − ( Z ) , b r k +1 ) , N − (cid:0) Z k Z ⊤ k − Y k Y ⊤ k (cid:1)E(cid:21) = 12 N D E h ∇ ϕ ǫ k ( S k − ( Z ) , b r k +1 ) i , Σ Z − Σ Y E . (30)Next, with regard to the Hessian of ϕ ǫ k ( · , r ) , the previously used result (Bentkus, 1990, The-orem 3) underlying (28) implies sup s,r ∈ R p k∇ ϕ ǫ k ( s, r ) k ≤ c Log ( p ) ǫ k . (See also (O’Donnell, Servedio and Tan, 2019, Theorem 6.5).) So, combining this with (30),Hölder’s inequality, and the fact that Nǫ k = ρ ( N − k ) , we have(31) sup r ∈ R p (cid:12)(cid:12)(cid:12) E (cid:2) Q Xk ( r ) − Q Yk ( r ) (cid:3)(cid:12)(cid:12)(cid:12) ≤ c Log ( p )∆ ρ ( N − k ) , where ∆ = k Σ Z − Σ Y k ∞ , due to the reduction that Σ Y has all ones along the diagonal.Hence, when re-using the proof of Theorem 2.1, the right side of (31) should be added to EAR / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS the bound on δ k in the statement of Lemma 5.1. Apart from this, the only other modificationneeded is to replace the inequality (H k ( ¯ C ) ) in the induction hypothesis with D k ≤ ¯ C (cid:18) (cid:0) ρ / (cid:1) Log ( pk ) Log ( k ) k / + ρ Log ( p ) Log ( k )∆ (cid:19) , where ν is absent above, because it is an absolute constant in the context of Gaussian vectors.Once these two updates are made, all of the corresponding steps in the proof of Theorem 2.1can be repeated to show there is an absolute constant c > such that the bound(32) sup r ∈ R p (cid:12)(cid:12)(cid:12) P ( Z (cid:22) r ) − P ( Y (cid:22) r ) (cid:12)(cid:12)(cid:12) ≤ (cid:0) cρ / (cid:1) Log ( pN ) Log ( N ) N / + (cid:0) cρ (cid:1) Log ( p ) Log ( N )∆ holds for all N and p . (The “new” term (cid:0) cρ (cid:1) Log ( p ) Log ( N )∆ is simply a consequence ofincluding the right side of (31) in the sum δ + · · · + δ N − m .)The only remaining task is to choose N in the bound (32). To do this, first observe that wemay assume(33) ρ Log ( p )∆ ≤ , for otherwise there is nothing to prove. Also, for purposes of simplification, there is an abso-lute constant c > such that Log ( pN ) Log ( N ) N / ≤ c Log ( p ) N / , with the exponent / being unimportant. If we choose N such that N / = ⌈ √ ρ ∆ Log ( p ) ⌉ , then the bound (32) leads to sup r ∈ R p (cid:12)(cid:12)(cid:12) P ( Z (cid:22) r ) − P ( Y (cid:22) r ) (cid:12)(cid:12)(cid:12) ≤ (cid:0) cρ (cid:1) Log ( p ) Log ( N )∆ . Finally, to simplify this, observe that (33) implies Log ( N ) ≤ c Log ( ) , which leads to thestated result.
7. Background results.
The following facts about random vectors with sub-Gaussianentries are essentially standard, and can be proved using the results in (van der Vaart and Wellner,1996, Section 2.2) and (Vershynin, 2018, Sections 2.5-2.6).L
EMMA
There is an absolute constant c > such that the following statements holdfor all n and p , provided that the conditions of Theorem 2.1 hold, and var( X j ) = 1 for all ≤ j ≤ p : (i) The expectation of k X k ∞ satisfies E [ k X k ∞ ] ≤ c ( ν Log ( p )) / . (ii) If t = cν Log ( pn ) / n − / , then P ( √ n k X − Y k ∞ ≥ t ) ≤ cn . (iii) If ≤ m ≤ n , and the vectors ζ = S n ( X ) and ξ = S n − m ( X ) + S n − m +1: n ( Y ) are asin the proof of Lemma 3.1, then the following bound holds when t ′ = cν p mn Log ( pn ) / , (34) P ( k ζ − ξ k ∞ ≥ t ′ ) ≤ cn . (iv) Let b Σ be as defined in (3) , let Σ = E [ X X ⊤ ] , and suppose that n ≥ Log ( pn ) . Then, theevent k b Σ − Σ k ∞ ≤ cν Log ( pn ) / n / holds with probability at least − cn . The next result is known as Nazarov’s Gaussian anti-concentration inequality, which origi-nates from the paper (Nazarov, 2003), and was further elucidated by (Chernozhukov, Chetverikov and Kato,2017b, Theorem 1).L
EMMA
There is an absolute constant c > such that the following holds for all p : Let ξ ∈ R p be a Gaussian random vector, and suppose that ς = min ≤ j ≤ p p var( ξ j ) ispositive. Then, the following inequality holds for any t > , sup r ∈ R p P ( ξ ∈ ∂ C ( r, t )) ≤ ctς p Log ( p ) , where the set ∂ C ( r, t ) is defined in (7) . Here, we introduce some notation for the statement and proof of Lemma 7.3 below. Forany set A ⊂ R p and any t > , define the outer t -neighborhood A t = (cid:8) x ∈ R p | d ( x, A ) ≤ t } , where d ( x, A ) = inf {k x − y k | y ∈ A } , with k · k being any norm on R p . In addition, a cor-responding inner t -neighborhood may be defined as A − t = (cid:8) x ∈ A | B ( x, t ) ⊂ A } , where B ( x, t ) = { y ∈ R p |k x − y k ≤ t } . Although the following result is commonly used for scalarrandom variables, it seems to be stated less frequently in the case of random vectors.L EMMA
Let k · k be any norm on R d , and let ζ, ξ ∈ R p be any two random vectors.Then, the following inequality holds for any Borel set A ⊂ R p , and any t > , | P ( ζ ∈ A ) − P ( ξ ∈ A ) | ≤ P (cid:0) ξ ∈ ( A t \ A − t ) (cid:1) + P (cid:0) k ζ − ξ k ≥ t (cid:1) . P ROOF . Let δ = ζ − ξ and observe that P (cid:0) ξ ∈ A −k δ k (cid:1) ≤ P ( ζ ∈ A ) ≤ P (cid:0) ξ ∈ A k δ k (cid:1) . This implies | P ( ζ ∈ A ) − P ( ξ ∈ A ) | ≤ P (cid:16) ξ ∈ (cid:0) A k δ k \ A −k δ k (cid:1)(cid:17) ≤ P (cid:16) ξ ∈ ( A t \ A − t ) (cid:17) + P ( k δ k ≥ t ) . REFERENCES B ELLONI , A., C
HERNOZHUKOV , V., C
HETVERIKOV , D., H
ANSEN , C. and K
ATO , K. (2018). High-dimensionaleconometrics and regularized GMM. arXiv:1806.01888 .B ENTKUS , V. (1990). Smooth approximations of the norm and differentiable functions with bounded support inBanach space ℓ k ∞ . Lithuanian Mathematical Journal ENTKUS , V. (2003). On the dependence of the Berry-Esseen bound on dimension.
Journal of Statistical Plan-ning and Inference
ENTKUS , V. (2005). A Lyapunov-type bound in R d . Theory of Probability & Its Applications HERNOZHUKOV , V., C
HETVERIKOV , D. and K
ATO , K. (2013). Gaussian approximations and multiplier boot-strap for maxima of sums of high-dimensional random vectors.
The Annals of Statistics HERNOZHUKOV , V., C
HETVERIKOV , D. and K
ATO , K. (2017a). Central limit theorems and bootstrap in highdimensions.
The Annals of Probability / √ n RATES FOR CLT AND BOOTSTRAP IN HIGH DIMENSIONS C HERNOZHUKOV , V., C
HETVERIKOV , D. and K
ATO , K. (2017b). Detailed proof of Nazarov’s inequality. arXiv:1711.10696 .C HERNOZHUKOV , V., C
HETVERIKOV , D., K
ATO , K. and K
OIKE , Y. (2019). Improved central limit theorem andbootstrap approximations in high dimensions. arXiv:1912.10529 .D AS , D. and L AHIRI , S. (2020). Central limit theorem in high dimensions: The optimal bound on dimensiongrowth rate. arXiv:2008.04389 .D ENG , H. (2020). Slightly Conservative bootstrap for maxima of sums. arXiv:2007.15877 .D ENG , H. and Z
HANG , C. H. (2020+). Beyond Gaussian approximation: Bootstrap for maxima of sums ofindependent random vectors.
The Annals of Statistics (to appear) arXiv:1705.09528 .F ANG , X. and K
OIKE , Y. (2020). High-dimensional central limit theorems by Stein’s method. arXiv:2001.10917 .K OIKE , Y. (2019). Notes on the dimension dependence in high-dimensional central limit theorems for hyperrect-angles. arXiv:1911.00160 .K UCHIBHOTLA , A. K., M
UKHERJEE , S. and B
ANERJEE , D. (2018). High-dimensional CLT: Improvements,non-uniform extensions and large deviations. arXiv:1806.06153 .L OPES , M. E., L IN , Z. and M ÜLLER , H. G. (2020). Bootstrapping max statistics in high dimensions: Near-parametric rates under weak variance decay and application to functional and multinomial data.
The Annals ofStatistics AZAROV , F. (2003). On the maximal perimeter of a convex set in R n with respect to a Gaussian measure. In Geometric Aspects of Functional Analysis
ONNELL , R., S
ERVEDIO , R. A. and T AN , L. Y. (2019). Fooling polytopes. In Proceedings of the 51st AnnualACM SIGACT Symposium on Theory of Computing
AI ˇC , M. (2019). A multivariate Berry–Esseen theorem with explicit constants.
Bernoulli VAN DER V AART , A. W. and W
ELLNER , J. A. (1996).
Weak Convergence and Empirical Processes . Springer.V
ERSHYNIN , R. (2018).