[PDF] Approximating high-dimensional infinite-order U -statistics: statistical and computational guarantees

Abstract

We study the problem of distributional approximations to high-dimensional non-degenerate U -statistics with random kernels of diverging orders. Infinite-order U -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.

Full PDF

aa r X i v : . [ m a t h . S T ] N ov arXiv: arXiv:1901.01163 Approximating high-dimensionalinﬁnite-order U -statistics: statistical andcomputational guarantees Yanglei Song Xiaohui Chen and Kengo Kato Department of Mathematics and Statistics, Queen’s University, 48 University Ave,Kingston, ON, Canada, K7L 3N6e-mail: [email protected] Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. WrightStreet, Champaign, IL 61820e-mail: [email protected] Department of Statistics and Data Science, Cornell University, 1194 Comstock Hall,Ithaca, NY 14853e-mail: [email protected]

Abstract:

We study the problem of distributional approximations to high-dimensional non-degenerate U -statistics with random kernels of divergingorders. Inﬁnite-order U -statistics (IOUS) are a useful tool for constructingsimultaneous prediction intervals that quantify the uncertainty of ensemblemethods such as subbagging and random forests. A major obstacle in usingthe IOUS is their computational intractability when the sample size and/ororder are large. In this article, we derive non-asymptotic Gaussian approxi-mation error bounds for an incomplete version of the IOUS with a randomkernel. We also study data-driven inferential methods for the incompleteIOUS via bootstraps and develop their statistical and computational guar-antees. Keywords and phrases:

Inﬁnite-order U -statistics, incomplete U -statistics,Gaussian approximation, bootstrap, random forests, uncertainty quantiﬁ-cation.

1. Introduction

Let X , . . . , X n be independent and identically distributed (i.i.d.) random vari-ables taking value in a measurable space ( S, S ) with common distribution P ,and let h : S r → R d be a symmetric and measurable function with respect tothe product space S r equipped with the product σ -ﬁeld S r = S ⊗ · · · ⊗ S ( r times). Assume E [ | h j ( X , . . . , X r ) | ] < ∞ for 1 j d , and consider the sta-tistical inference on the mean vector θ = ( θ , . . . , θ d ) T = E [ h ( X , . . . , X r )]. Anatural estimator for θ is the U -statistic with kernel h : U n := 1 | I n,r | X ι ∈ I n,r h ( X i , . . . , X i r ) := 1 | I n,r | X ι ∈ I n,r h ( X ι ) , (1)where I n,r := { ι = ( i , . . . , i r ) : 1 i < . . . < i r n } is the set of all ordered r -tuples of 1 , . . . , n and | · | denotes the set cardinality. The positive integer r is . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics called the order or degree of the kernel h or the U -statistic U n . We refer to [21]as an excellent monograph on U -statistics.In the present paper, we are interested in the situation where the order r may be nonneglible relative to the sample size n , i.e., r = r n → ∞ as n → ∞ . U -statistics with divergent orders are called inﬁnite-order U -statistic (IOUS)[14]. IOUS has attracted renewed interests in the recent statistics and machinelearning literature in relation to uncertainty quantiﬁcation for Breiman’s bag-ging [3] and random forests [4]. In such applications, the tree-based predictionrules can be thought of as U -statistics with deterministic and random kernels,respectively, and their order corresponds to the sub-sample size of the trainingdata [23]. Statistically, the subsample size r used to build each tree needs toincrease with the total sample size n to produce reliable predictions. As a lead-ing example, we consider construction of simultaneous prediction intervals for aversion of random forests discussed in [23]. Example 1.1 (Simultaneous prediction intervals for random forests) . Considera training dataset of size n , { ( Y , Z ) , . . . , ( Y n , Z n ) } = { X , . . . , X n } = X n ,where Y i ∈ Y is a vector of features and Z i ∈ R is a response. Let h be adeterministic prediction rule that takes as input a sub-sample { X i , . . . , X i r } and outputs predictions on d testing points ( y ∗ , . . . , y ∗ d ) in the feature space Y . Then U n in (1) are the overall predictions by averaging over all possiblesub-samples of size r .For random forests [4, 23], the tree-based prediction rule may be constructedwith additional randomness: in building a tree or multiple trees based on a sub-sample, the split at each node may only occur on a randomly selected subsetof features. Thus, let { W ι : ι ∈ I n,r } be a collection of i.i.d. random variablestaking value in a measurable space ( S ′ , S ′ ) that are independent of the data X n , and that determine the potential splits for each sub-sample. Here, each W ι captures the random mechanism in building a prediction function basedon X ι = ( X i , . . . , X i r ), but are assumed to be independent for diﬀerent sub-samples. Further, let H : S r × S ′ → R d be an S r ⊗ S ′ -measurable function,that represents the random forest algorithm, such that E [ H ( x , . . . , x r , W )] = h ( x , . . . , x r ). Then predictions of random forests are given by a d -dimensional U -statistic with random kernel H : b U n := | I n,r | − X ι ∈ I n,r H ( X i , . . . , X i r , W ι ) = | I n,r | − X ι ∈ I n,r H ( X ι , W ι ) , (2)where the random kernel H varies with r .Compared to U -statistics with ﬁxed orders (i.e., r being ﬁxed), the analy-sis of IOUS brings nontrivial computational and statistical challenges due toincreasing orders. First, even for a moderately large value of r , exact computa-tion of all possible (cid:0) nr (cid:1) trees is intractable. For diverging r , it is not possibleto compute U n in polynomial-time of n . Second, the variance of the H´ajekprojection (i.e., the ﬁrst-order term in the Hoeﬀding decomposition [19]) of U n − θ tends to zero as r → ∞ . To wit, deﬁne a function g : S → [0 , ∞ ) by . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics g ( x ) = E [ h ( x , X , . . . , X r )], and σ g,j := E [( g j ( X ) − θ j ) ] for 1 j d, σ g := min j d σ g,j . Then the H´ajek projection of U n − θ is given by n − r P ni =1 ( g ( X i ) − θ ). By theorthogonality of the projections, we have E [( h j ( X , . . . , X r ) − θ j ) ] > X i r E [( g j ( X i ) − θ j ) ] = rσ g,j . Thus the variances of the kernel h and its associated H´ajek projection g havediﬀerent magnitudes. In particular, if the variance of h j ( X , . . . , X r ) is boundedby a constant C > σ g,j C/r , which vanishes as r diverges. Thus standard Gaussian ap-proximation results in literature are no longer applicable in our setting sincethey require that the componentwise variances are bounded below from zero toavoid degeneracy, i.e., there is an absolute constant σ > σ g > σ (cf. [6, 10, 11]).In this work, our focus is to derive computationally tractable and statisticallyvalid sub-sampling procedures for making inference on θ with a class of high-dimensional random kernels (i.e., large d ) of diverging orders (i.e., increasing r ).To break the computational bottleneck, we consider the incomplete version of b U n by sampling (possibly much) fewer terms than | I n,r | . In particular, we considerthe Bernoulli sampling scheme introduced in [8]. Given a positive integer N ,which represents our computational budget, deﬁne the sparsity design parameter p n := N/ | I n,r | , and let { Z ι : ι ∈ I n,r } be a collection of i.i.d. Bernoulli randomvariables with success probability p n , that are independent of the data X n and { W ι : ι ∈ I n,r } . Consider the following incomplete U -statistic (on the data X n )with random kernel and weights: U ′ n,N := b N − X ι ∈ I n,r Z ι H ( X ι , W ι ) , where b N := X ι ∈ I n,r Z ι . (3)Obviously, U ′ n,N is an unbiased estimator of θ and it only involves computing b N terms, which on average is much smaller than | I n,r | if p n ≪ h is both deterministic and of ﬁxed order, ﬁnite samplebounds for the Gaussian and bootstrap approximations of U ′ n,N − θ (after asuitable normalization) are established in [8]. Roughly speaking, error boundanalysis in [8] has two major steps: i ) establish the Gaussian approximationto the H´ajek projection, and ii ) bound the maximum norm of all higher-orderdegenerate terms. As discussed above, the ﬁrst-order H´ajek projection in theHoeﬀding decomposition is asymptotically vanishing for the IOUS, and we mustcontrol the moments of an increasing number of degenerate terms, which makesthe analysis of the incomplete IOUS with random kernels substantially moresubtle.In Section 2, we derive non-asymptotic Gaussian approximation error boundsfor approximating the distribution of the incomplete IOUS U ′ n,N with random . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics kernels subject to sub-exponential moment conditions. Speciﬁcally, our ratesof convergence for the Gaussian approximation of U ′ n,N have the explicit de-pendence on all parameters ( n, N, d, r, σ g , D n ), where D n is an upper boundfor the ψ norms of the random kernels (for precise statements, see conditions(C3), (C4), and (C3’) ahead). In particular, asymptotic validity of the Gaussianapproximation can be achieved if σ − g r D n log ( dn ) = o ( n ∧ N ). The order of σ − g will be application speciﬁc. As we shall verify in Section 4, under certainregularity conditions, σ − g = O ( r ) . (4)It is worth noting that (4) is sharp in the sense that for the linear kernel h ( x , · · · , x r ) = ( x + · · · + x r ) /r , we have σ − g ≍ r if c Var( X j ) C .If further D n = O (1), log( d ) = O (log( n )) and n = O ( N ) (i.e., the compu-tational complexity is at least linear in sample size), then the order of U ′ n,N isallowed to increase at the rate of r = o ( n / − ǫ ) for any ǫ ∈ (0 , / d = O ( e n c )for some constant c ∈ (0 , / r is still allowed to increase at a polynomial rate in n .The proof of our Gaussian approximation results for IOUS builds upon anumber of recently developed technical tools such as Gaussian approximationresults for sum of independent random vectors and U -statistics of ﬁxed orders [6,7, 10, 11], anti-concentration inequality for Gaussian maxima [9], and iterativeconditioning argument for high-dimensional incomplete U -statistics (with theﬁxed kernel and order) [8]. However, there are three technical innovations inour proof to accommodate the issues of diverging orders and randomness ofthe kernel. First, we use the iterative renormalization for each dimension of g and also H by its variance. This simple trick turns out to be the crux to avoidthe lower bound assumption for Gaussian approximation in the literature [8,10]. Second, we derive an order-explicit maximal inequality for the expectedsupremum of the remainder of the H´ajek projection of the IOUS (cf. Section5). This maximal inequality is new in literature and our main tools include asymmetrization inequality of [27] and Bonami inequality [13, Theorem 3.2.2]for the Rademacher chaos, both with the explicit dependence on r . Third, wedevelop new tail probability inequalities for U -statistics with random kernels byleveraging the independence between { W ι , ι ∈ I n,r } and the data X n .In Section 3, we derive computationally tractable and fully data-driven infer-ential methods of θ based on the incomplete IOUS when the sample size n , thedimension d , and the order r , are all large. We consider a multiplier bootstrapprocedure consisting of two partial bootstraps that are conditionally indepen-dent given X n and { W ι , Z ι : ι ∈ I n,r } : one estimates the covariance matrix ofthe randomized kernel, and the other estimates the H´ajek projection. The latteris usually computationally demanding, and we develop a divide and conquer al-gorithm to maintain the overall computational cost of our multiplier bootstrapprocedure at most O ( n d + B ( N + n ) d ), where B denotes the number of boot-strap iterations. Thus the computational cost of the bootstrap to approximatethe sampling distribution for incomplete IOUS can be made independent of the . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics order r , even though r diverges.In Section 4, we discuss the key non-degeneracy condition (4) for derivingthe validity of Gaussian and bootstrap approximations. We provide a generalembedding scheme where a Cram´er-Rao type lower bound can be establishedfor the minimum σ g of the projection variances. Speciﬁcally, the lower boundfor r σ g only involves the sensitivity of E [ h ( X , . . . , X r )] under perturbationand the Fisher information of the embedded family, which in some cases remainconstants as r diverges. In non-parametric regressions, there is a natural em-bedding of the response variable into a location family such that the sensitivityand Fisher information can be explicitly computed. For univariate U -statistics ( d = 1), the asymptotic distributions are derived inthe seminal paper [19] for the non-degenerate case. [14] introduced the notion“inﬁnite-order U statistics” (IOUS) with diverging orders and established thecentral limit theorem for U n when d = 1. For univariate IOUS, asymptotic nor-mality of IOUS can be found in [2, Chapter 4.6], and the Berry-Esseen typebounds for IOUS were established by [16, 30, 31]. Further, [23] applied IOUSto construct a prediction interval for one test point. However, i ) . [23] does notaddress the issue that the variance of the H´ajek projection is vanishing: the twoconditions in Theorem 1 therein, E h k n ( Z , . . . , Z k n ) C < ∞ and lim ζ ,k n = 0,are not compatible based on our previous discussions ; ii ) . in practice, the size d of a test set may be comparable to or even much larger than the size n of a train-ing set, and the current work is motivated by such consideration. Limit theoremsof the related inﬁnite-order V -statistics and the inﬁnite-order U -processes werestudied in [18, 28]. The high-dimensional Gaussian approximation results andbootstrap methods were established in [10, 11] for sum of independent randomvectors, and in [6, 8] for U -statistics. We refer readers to these references forextensive literature review.Incomplete U -statistics were ﬁrst introduced in [1], which can be viewed as aspecial case of weighted U -statistics. There is a large literature on limit theoremsfor weighted U -statistics; see [22, 24, 25, 26]. The asymptotic distributions ofincomplete U -statistics (for ﬁxed d ) were derived in [5] and [20]; see also Section4.3 in [21] for a review on incomplete U -statistics. Recently, incomplete U-statistics have gained renewed interests in the statistics and machine learningliteratures [12, 23]. To the best of our knowledge, the current paper is theﬁrst work that establishes distributional approximation theorems for incompleteIOUS with random kernels and increasing orders in high dimensions.The remaining of the paper is organized as follows. We develop Gaussianapproximation results for above U -statistics in Section 2, and bootstrap methodsfor the variance of the approximating Gaussian distribution in Section 3. Weapply the theoretical results to several examples in Section 4. We highlight amaximal inequality in Section 5, and present all other proofs in Appendix A. . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics We write l.h.s. . r.h.s. if there exists a ﬁnite and positive absolute constant C such that l.h.s. C × r.h.s.. We shall use c, C, C , C , . . . to denote ﬁniteand positive absolute constants, whose value may diﬀer from place to place. Wedenote X i , . . . X i ′ by X i ′ i for i i ′ .For a, b ∈ R , let ⌊ a ⌋ denote the largest integer that does not exceed a , a ∨ b =max { a, b } and a ∧ b = min { a, b } . For a, b ∈ R d , we write a b if a j b j for 1 j d , and write [ a, b ] for the hyperrectangle Q dj =1 [ a j , b j ] if a b .We denote by R := { Q dj =1 [ a j , b j ] : −∞ a j b j ∞} the collection ofhyperrectangles in R d . Further, for a ∈ R d , r, t ∈ R , ra + t is a vector in R d with j th component being ra j + t . For a matrix A = ( a ij ), denote k A k ∞ = max i,j | a ij | .For a diagonal matrix Λ with positive diagonal entries, Λ − / (resp. Λ / ) is thediagonal matrix, with j -th diagonal entry being Λ − / jj (resp. Λ / jj ).For β >

0, let ψ β : [0 , ∞ ) → R be a function deﬁned by ψ β ( x ) = e x β −

1, and for any real-valued random variable ξ , deﬁne k ξ k ψ β = inf { C > E [ ψ β ( | ξ | /C )] } . Further, we deﬁne a family of functions { e ψ β ( · ) } on [0 , ∞ )indexed by β >

0. For β >

1, deﬁne e ψ β = ψ β . For β ∈ (0 , τ β = ( βe ) /β , x β = (1 /β ) /β , and e ψ β ( x ) = τ β x { x x β } .For a generic random variable Y , let P | Y ( · ) and E | Y [ · ] denote the conditionalprobability and expectation given Y , respectively. Further, we write “a.s.” for“almost surely” and “w.r.t.” for “with respect to”. Throughout the paper, weassume that r > d > n > p n := N/ | I n,r | /

2. Gaussian approximations for IOUS

In this section, we shall derive non-asymptotic Gaussian approximation errorbounds for: (i) the IOUS with random kernel b U n in (2), which includes the IOUSwith deterministic kernel U n in (1) as a special case, and (ii) the incompleteIOUS U ′ n,N in (3) under the Bernoulli sampling scheme.Recall that h ( x r ) = E [ H ( x r , W )], g ( x ) = E [ h ( x , X r )], θ = E [ g ( X )], σ g,j = E [( g j ( X ) − θ j ) ] and σ g = min j d σ g,j . Further, deﬁneΓ g := Cov( g ( X )) , Γ H := Cov( H ( X r , W )) ,σ H,j := E [( H j ( X r , W ) − θ j ) ] for 1 j d. Clearly, for 1 j d , σ H,j > σ g,j and thus σ H := min j d σ H,j > σ g . Deﬁnetwo d × d diagonal matrices Λ g and Λ H such thatΛ g,jj := σ g,j σ H,j := Λ

H,jj for 1 j d. (5)Let Y A and Y B be two independent d -dimensional zero mean Gaussian ran-dom vectors with variance Γ g and Γ H respectively. We may take Y A and Y B to . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics be independent of any other random variables. Further, for any two zero mean d -dimensional random vectors U and Y , ρ ( U, Y ) := sup R ∈R | P ( U ∈ R ) − P ( Y ∈ R ) | , where we recall that R := { Q dj =1 [ a j , b j ] : −∞ a j b j ∞} is the collectionof hyperrectangles in R d .Finally, in view of the discussions in the Introduction (Section 1) and tosimplify presentation, we assume σ g

1. Otherwise, the conclusions in thispaper hold with σ g replaced by min { σ g , } . We start with b U n . Deﬁne for 1 j d , q >

0, and ( x , . . . , x r ) ∈ S r , B n,j ( x , . . . , x r ) := k H j ( x , . . . , x r , W ) − h j ( x , . . . , x r ) k ψ q . (6)We make following assumptions: there exist D n > q > σ g,j > , for all j = 1 , . . . , d, (C1-ND) E | g j ( X ) − θ j | σ g,j D n , for all j = 1 , . . . , d, (C2) k h j ( X r ) − θ j k ψ q D n , for all j = 1 , . . . , d, (C3) k B n,j ( X r ) k ψ q D n for all j = 1 , . . . , d. (C4)Clearly, if | H j ( X r , W ) | . D n a.s. for 1 j d , then the latter three conditionshold. Indeed, (C3) and (C4) follow immediately from the deﬁnition, and (C2)is due to the observation that E | g j ( X ) − θ j | . E | g j ( X ) − θ j | D n = σ g,j D n . Theorem 2.1.

Assume (C1-ND) , (C2) , (C3) and (C4) hold. Then ρ ( √ n ( b U n − θ ) , rY A ) . (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / , where q ∗ := (6 /q + 1) ∨ , Y A ∼ N (0 , Γ g ) and . means up to a multiplicativeconstant that only depends on q .Proof. See Section A.3. We highlight that a key step to establish Theorem 2.1 isto control the expected supremum of the remainder of the H´ajek projection of thecomplete IOUS with deterministic kernel (See Theorem 5.1). Then the Gaussianapproximation result for IOUS follows from Gaussian approximation results forsum of independent random vectors [10] and anti-concentration inequality [9],by a similar argument in [8] with proper normalization. (cid:4)

Clearly, in the special case of non-random kernel, i.e., H ( x , . . . , x r , W ) = h ( x , . . . , x r ), (C4) trivially holds. Thus we have the following immediate resultfor the IOUS with deterministic kernel U n in (1). . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Corollary 2.2.

Assume (C1-ND) , (C2) and (C3) hold. Then ρ ( √ n ( U n − θ ) , rY A ) . (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / . where q ∗ := (6 /q + 1) ∨ , and . means up to a multiplicative constant that onlydepends on q . Remark 2.3 (Comparisons with existing results for d = 1) . For the univariateIOUS with non-random kernels, asymptotic normality and its rate of conver-gence are well understood in literature; see [2] for a survey of results in this di-rection. In [31], a Berry-Esseen bound is derived for symmetric statistics, whichinclude IOUS (with non-random kernels) as a special case. In particular, apply-ing Corollary 4.1 in [31] to IOUS, the rate of convergence to normality is of order O ( r n − / σ H /σ g ) for a bounded kernel, which implies that asymptotic normal-ity requires (at least) r = o ( n / ). A related Berry-Esseen bound is given in [16].In both papers, the rates of convergence are suboptimal. For elementary sym-metric polynomials (which are U -statistics corresponding to the product kernel h ( x , . . . , x r ) = x · · · x r ), it is shown in [30] that the sharp rate of convergenceto normality is of order O ( rn − / ), provided that E [ X ] = 0 , Var( X ) ∈ (0 , ∞ ), E [ | X | ] < ∞ and r = O ((log n ) − (log ( n )) − n / ). This result implies thatasymptotic normality for the IOUS with the product kernel is achieved when r = O (log − ( n ) n / ). If σ − g = O ( r ), which holds under regularity condi-tions in Lemma 4.1, our Corollary 2.2 with q = 1 implies that the rate of con-vergence for high-dimensional IOUS is O (( r log ( dn ) n − ) / ) (with suitablybounded moments). In particular, Gaussian approximation is asymptoticallyvalid if log d = O (log n ) and r = o ( n / − ǫ ) for any ǫ ∈ (0 , / r and the rate is slower than the optimalrate in the case d = 1, Corollary 2.2 does allow the dimension to grow sub-exponentially fast in sample size, which is a useful feature for high-dimensionalstatistical inference. In addition, to the best of our knowledge, the validity ofbootstrap procedures proposed in Section 3 to approximate the sampling dis-tribution of IOUS (on hyperrectangles in R d ) are new in literature. Now we consider U ′ n,N , where we recall that N is some given computationalbudget. We will assume the following conditions: for q > k H j ( X r , W ) − θ j k ψ q D n , for all j = 1 , . . . , d, (C3’) E | H j ( X r ) − θ j | σ H,j D n , for all j = 1 , . . . , d. (C5)Clearly, (C4) and (C3’) implies (C3) up to a multiplicative constant. Fur-ther, (C3’) and (C5) hold if | H j ( X r , W ) | . D n a.s. for 1 j d . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Theorem 2.4.

Assume (C1-ND) , (C2) , (C4) , (C3’) and (C5) hold. Then ρ (cid:16) √ n ( U ′ n,N − θ ) , rY A + α / n Y B (cid:17) . ̟ n , where ̟ n := (cid:18) r q D n log q ∗ ( dn ) σ g ( n ∧ N ) (cid:19) / , where α n := n/N , q := 2 ∨ (2 /q ) , q ∗ := (6 /q + 1) ∨ , . means up to amultiplicative constant that only depends on q , and we recall that Y A ∼ N (0 , Γ g ) , Y B ∼ N (0 , Γ H ) and Y A , Y B are independent.Proof. See Section A.4.4. (cid:4)

Remark 2.5. If q >

1, then q = 2 and q ∗ = 7. Since k ξ k ψ . k ξ k ψ q for anyrandom variable ξ and q >

1, we may assume without loss of generality that q r is ﬁxed, q = 1, the kernel is deterministic, andthere exists some absolute constant σ > σ g > σ , then the aboveTheorem recovers Theorem 3.1 from [8].Further, by ﬁrst conditioning on X n , we haveΓ H = Cov ( H ( X r , W )) (cid:23) Cov ( h ( X r )) := Γ h , where for two square matrices, A (cid:23) B means A − B is positive semi-deﬁnite.Thus the random kernel H ( · ) increases the variance of the approximating Gaus-sian distribution compared to the associated deterministic kernel h ( · ).

3. Bootstrap approximations

In Section 2.2, we have seen that the incomplete U -statistic with random kernelis approximated by a Gaussian distribution N (0 , r Γ g + α n Γ H ). However, thecovariance term is typically unknown in practice. In this section, we will estimateΓ g and Γ H by bootstrap methods. Γ H Let D n := { X , . . . , X n } ∪ { W ι , Z ι : ι ∈ I n,r } be the data involved in thedeﬁnition of U ′ n,N , and take a collection of independent N (0 ,

1) random variables { ξ ′ ι : ι ∈ I n,r } that is independent of the data D n . Deﬁne the following bootstrapdistribution: U n,B := 1 p b N X ι ∈ I n,r ξ ′ ι p Z ι (cid:0) H ( X ι , W ι ) − U ′ n,N (cid:1) . (7)The next theorem establishes the validity of U n,B . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Theorem 3.1.

Assume the conditions (C1-ND) (C2) , (C4) , (C3’) and (C5) hold. If r q D n log q ( dn )( σ H ∧

1) ( n ∧ N ) C n − ζ , (8) for q := 2 ∨ (2 /q ) , q := (4 /q + 1) ∨ , some constants C > and ζ ∈ (0 , ,then there exists a constant C depending only on q , C and ζ such that withprobability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n,B ∈ R (cid:17) − P ( Y B ∈ R ) (cid:12)(cid:12)(cid:12) Cn − ζ/ . Proof.

See Section A.5.1. (cid:4)

Let S ⊂ { , . . . , n } , and n = | S | . Further, consider a collection of D n -measurable R d -valued random vectors { G i : i ∈ S } , where G i is some “good”estimator of g ( X i ), and its form is speciﬁed later. We use the following quantityto measure the quality of G i as an estimator of g ( X i ) b ∆ A, := max j d n σ g,j X i ∈ S ( G i ,j − g j ( X i )) . (9)Deﬁne G := n P i ∈ S G i and consider the following bootstrap distributionfor N (0 , Γ g ): U n ,A := 1 √ n X i ∈ S ξ i (cid:0) G i − G (cid:1) , (10)where { ξ i : i ∈ S } is a collection of independent N (0 ,

1) random variablesthat is independent of D n and { ξ ′ ι : ι ∈ I n,r } . Lemma 3.2.

Assume the conditions (C1-ND) , (C2) and (C3’) hold. If D n log q ( dn ) σ g n C n − ζ , and P (cid:16) b ∆ A, log ( d ) > C n − ζ (cid:17) C n − , (11) for q := (4 /q + 1) ∨ , some constants C , and ζ , ζ ∈ (0 , . Then there existsa constant C depending only on q , C and ζ such that with probability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n ,A ∈ R (cid:17) − P ( Y A ∈ R ) (cid:12)(cid:12)(cid:12) Cn − ( ζ ∧ ζ ) / , where we recall that Y A ∼ N (0 , Γ g ) .Proof. See Subsection A.5.2. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Hereafter we consider a special case of the divide and conquer bootstrapalgorithm in [8] to estimate Γ g . For each i ∈ S , partition the remaining indexes, { , . . . , n } \ { i } , into disjoint subsets { S ( i )2 ,k : k = 1 , . . . , K } , each of size L = r −

1, where K = ⌊ ( n − / ( r − ⌋ .Now deﬁne for each i ∈ S and k = 1 , . . . , K , S ( i )2 ,k := { i } ∪ S ( i )2 ,k , G i := 1 K K X k =1 H ( X S ( i ,k , W S ( i ,k ) . Finally, deﬁne U n,n := rU n ,A + α / n U n,B . Theorem 3.3.

Assume the conditions (C1-ND) (C2) , (C4) (C3’) and (C5) hold. If r q D n log q ∗ ( dn ) σ g ( n ∧ N ) C n − ζ , (12) for q := 2 ∨ (2 /q ) , q ∗ := (6 /q + 1) ∨ , some constants C > , ζ ∈ (0 , . Forany ν ∈ (max { / , /ζ } , ∞ ) , there exists a constant C depending only on q , ζ , ν and C such that with probability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:0) U n,n ∈ R (cid:1) − P ( rY A + α / n Y B ∈ R ) (cid:12)(cid:12)(cid:12) Cn − ( ζ − /ν ) / . Proof.

See Subsection A.5.3. (cid:4)

We ﬁrst combine the Gaussian approximation result with the bootstrap result.

Corollary 3.4.

Assume (C1-ND) , (C2) (C4) (C3’) and (C5) hold. Further,assume that for some constants C > , ζ ∈ (0 , , (12) holds. Then thereexists a constant C depending only on q , C and ζ such that with probability atleast − C/n , sup R ∈R (cid:12)(cid:12) P (cid:0) √ n (cid:0) U ′ n,N − θ (cid:1) ∈ R (cid:1) − P |D n (cid:0) U n,n ∈ R (cid:1)(cid:12)(cid:12) Cn − ζ/ . Proof.

It follows from Theorem 2.4 and Theorem 3.3 (with ν = 7 /ζ ). (cid:4) In simultaneous conﬁdence interval construction, it is sometimes desirableto normalize the variance of each dimension, so that if we use maximum-typestatistics, the critical value is not dominated by terms with large variance. Deﬁnefor 1 j d , b σ g,j := 1 n X i ∈ S (cid:0) G i ,j − G j (cid:1) , b σ H,j := 1 b N X ι ∈ I n,r Z ι (cid:0) H j ( X ι , W ι ) − U ′ n,N,j (cid:1) , . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics which are the diagonal elements in the conditional covariance matrices of U n,A (10)and U n,B (7) respectively. Further, deﬁne a d × d diagonal matrix b Λ with b Λ j,j = r b σ g,j + α n b σ H,j , for each 1 j d. Corollary 3.5.

Assume the conditions in Corollary 3.4. Then there exists aconstant C depending only on q , C and ζ such that with probability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P (cid:16) √ n b Λ − / (cid:0) U ′ n,N − θ (cid:1) ∈ R (cid:17) − P |D n (cid:16)b Λ − / U n,n ∈ R (cid:17)(cid:12)(cid:12)(cid:12) Cn − ζ/ . Consequently, sup t> (cid:12)(cid:12)(cid:12) P (cid:16) k√ n b Λ − / ( U ′ n,N − θ ) k ∞ t (cid:17) − P |D n (cid:16) k b Λ − / U n,n k ∞ t (cid:17)(cid:12)(cid:12)(cid:12) Cn − ζ/ . Proof.

See Subsection A.5.4. (cid:4)

Remark 3.6.

From Corollary 3.5, we can immediately construct conﬁdenceintervals for θ in a data-dependent way. Speciﬁcally, let b q − α be a (1 − α ) th quantile of the conditional distribution of k b Λ − / U n,n k ∞ given D n . Then oneway to construct simultaneous conﬁdence intervals with conﬁdence level (1 − α )is as follows: for 1 j d , U ′ n,N,j ± b q − α n − / b Λ / j,j .

4. Applications

In many applications, g ( x ) = E [ h ( x, X , . . . , X r )] does not admit an explicitform, and thus it is usually hard to compute σ g in conditions (C1-ND) and(12) directly. When the kernel h has special structures, we can establish a lowerbound on σ g with explicit dependence on r , which can be applied to Exam-ple 1.1. We shall give additional examples in Section 4.3 and 4.4 to illustratethe usefulness of U -statistics as a tool to estimate and make inference of certainstatistical functionals of X , . . . , X r . In Section 4.3 for the expected maximumand log-mean functionals, we also establish a lower bound on σ g with explicitdependence on r . In Section 4.4 for the kernel density estimation problem, r isassumed to be ﬁxed, but we allow the diameter of the design points to diverge.For simplicity of the presentation, in this section, we assume that all involvedderivatives and integrals exist and are ﬁnite, and that the order of integrals andthe order of integral and diﬀerentiation can be exchanged. These assumptionscan be justiﬁed under standard smoothness and moment conditions. For illus-tration, we use q = 1 in (C4) and (C3’). σ g Suppose that the distribution P of X has a density function f with respect tosome σ -ﬁnite (reference) measure µ , i.e., P ( A ) = Z A f ( x ) µ ( dx ) for any A ∈ S . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics We ﬁrst embed f into a family of densities { f β : β ∈ B ⊂ R ℓ } , where B isan open neighborhood of 0 ∈ R ℓ . Such embeddings always exist and below aresome examples for S = R ℓ .1. Location and scale family. If µ is the Lebesgue measure on R ℓ , we mayconsider the following location or scaling families: for x ∈ R ℓ , f β ( x ) = f ( x − β ) with β ∈ R ℓ , or f β ( x ) =(1 + β ) f ((1 + β ) x ) with β ∈ ( − , . Exponential family. If φ ( β ) := log (cid:16)R f ( x ) e β T x µ ( dx ) (cid:17) < ∞ for β ∈ B ,then we may consider the exponential family: f β ( x ) = f ( x ) exp( β T x − φ ( β )) , for x ∈ R ℓ , β ∈ B. Additive noise model.

Let Υ be a R ℓ -dimensional random vector inde-pendent of X , whose distribution is absolutely continuous w.r.t. µ , then X + β Υ has a density f β given by the convolution of those of X and β Υ.For β ∈ B , deﬁne the following perturbed expectation θ ( β ) := Z h ( x , . . . , x r ) r Y i =1 f β ( x i ) µ ( dx i ) := E β [ h ( X , . . . , X r )] , where E β denotes the expectation when X , . . . , X r have density f β . Further,deﬁne Ψ( β ) := r X i =1 ∇ ln f β ( X i ) , J ( β ) := r − Var β (Ψ( β )) , where ∇ denotes the gradient (or derivative when β is a scalar) with respectto β and Var β denotes the covariance matrix when X , . . . , X r have the density f β . Thus Ψ( β ) is the score function and J ( β ) is the Fisher-information for asingle observation. Lemma 4.1.

If we assume J (0) is positive deﬁnite, then σ g,j > r − ( ∇ θ j (0)) T J − (0) ∇ θ j (0) , for j d. (13) In particular, if there exists an absolute positive constant c such that ( ∇ θ j (0)) T J − (0) ∇ θ j (0) > c for j d, then σ g > cr − .Proof. See Subsection A.6. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Consider the Example 1.1 and assume that ( Y , Z ) has density q ( y ) p ( z ; y )w.r.t. the product measure ν ( dy ) ⊗ dz on Y × R , i.e., for A ∈ B ( Y ) , A ∈ B ( R ), P ( Y ∈ A , Z ∈ A ) = Z A × A q ( y ) p ( z ; y ) ν ( dy ) dz. That is, the feature Y has the density q ( y ) w.r.t. some σ -ﬁnite measure ν on Y , and thus is allowed to have both continuous and discrete components. Theresponse Z given Y = y has a conditional density p ( z ; y ) w.r.t. the Lebesguemeasure.For many regression algorithms such as tree based methods, if we ﬁx thefeatures and increase the responses of training samples by β ∈ R , the predictionat any test point will increase by β , i.e., for 1 j d , H j (( y , z + β ) , . . . , ( y r , z r + β ) , w ) = H j (( y , z ) , . . . , ( y r , z r ) , w ) + β, which implies that h (( y , z + β ) , . . . , ( y r , z r + β )) = h (( y , z ) , . . . , ( y r , z r )) + β . Now we consider the embedding into the “location” family { q ( y ) p ( z − β ; y ) : β ∈ R } . Observe that θ j ( β ) = E β [ h j ( X , . . . , X r )] = θ j (0) + β, for 1 j d, which implies that θ ′ j (0) = 1. In addition, J ( β ) = Var β (cid:18) ddβ ln( q ( Y ) p ( Z − β ; Y )) (cid:19) = E β "(cid:18) ∂ z p ( Z − β ; Y ) p ( Z − β ; Y ) (cid:19) . Thus if we assume that there exists c such that J (0) = E "(cid:18) ∂ z p ( Z ; Y ) p ( Z ; Y ) (cid:19) c − , (14)then (13) reduces to σ g > cr − . If further we assume that H j ( X r , W ) C a.s. for some constant C and each 1 j d (this holds for example whenthe response is bounded a.s.), then the conditions (C2), (C3), (C4) and (C5)hold with D n = ln − (2) C . With these assumptions, the condition (12) in Corol-lary 3.5 simpliﬁes as r log ( dn ) n ∧ N C n − ζ . Thus if r = O ( n / − ǫ ) for some ǫ >

0, log( d ) = O (log( n )), and n = O ( n ∧ N ),then Corollary 3.5 can be used to construct asymptotically valid simultaneousprediction intervals with the error of approximation decaying polynomially fastin n . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Remark 4.2 (Fisher information in nonparametric regressions) . Let us take acloser look at the condition (14). Consider the nonparametric regression model Z i = κ ( Y i ) + ǫ i , for 1 i n, where κ : Y → R is a deterministic measurable function, and ǫ , . . . ǫ n arei.i.d. with some density f with respect to the Lebesgue measure. Then p ( z ; y ) = f ( z − κ ( y )) and thus J (0) = Z (cid:18) f ′ ( z − κ ( y )) f ( z − κ ( y )) (cid:19) q ( y ) f ( z − κ ( y )) ν ( dy ) dz = Z ( f ′ ( z )) f ( z ) dz, where for the last equality, we ﬁrst perform integration w.r.t. dz and apply achange-of-variable. Thus J (0) only depends the density of the noise. Next we compute the lower bounds on σ g for two additional statistical func-tionals. Example 4.3.

Let S = R d and consider the following two kernels: for 1 j d , h j ( x , . . . , x r ) = max i r x ij , and h j ( x , . . . , x r ) = log r r X i =1 x ij ! . In the former case, we are interested in estimating the expectation for thecoordinate-wise maxima of r independent random vectors, { E [max i r X ij ] :1 j d } . In the latter, we assume X j > j d and are interested inestimating { E [log( r − P ri =1 X ij )] : 1 j d } . In both cases, the coordinatesof X can have arbitrary dependence, and we allow r → ∞ .Consider the ﬁrst kernel in Example 4.3, where S = R d , and h j ( x , . . . , x r ) =max i r x ij for 1 j d . Assume X j has a density f j w.r.t. the Lebesguemeasure on R for 1 j d , and we consider the following embedding { f j ( ·− β ) : β ∈ R } . As in the previous example, for β ∈ R θ ′ j ( β ) = 1 , J j ( β ) = Var β (cid:18) dd β ln f j ( X j − β ) (cid:19) = Z ( f ′ j ( x − β )) f j ( x − β ) d x. Thus, by Lemma 4.1, if we assume for some absolute positive constant c Z ( f ′ j ( x )) f j ( x ) d x c − , j d, we have σ g > cr − . Further, if we assume that there exists a positive constant C such that k X j k ψ C, j d, . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics then by maximal inequality (e.g., see [29, Lemma 2.2.2]), k max i r X ij k ψ . log( r ). Then if we select D n = C ′ σ − g log ( r ) , the conditions (C2), (C3) and(C5) hold. Further, (C4) trivially holds for non-random kernels. With aboveassumptions and selection of D n , the condition (12) in Corollary 3.5 simpliﬁesas ( n ∧ N ) − r log ( r ) log ( dn ) C n − ζ .Now consider the second kernel in Example 4.3, where h j ( x , . . . , x r ) =log (cid:0) r − P ri =1 x ij (cid:1) and X j > j d . Assume X j has a density f j w.r.t. the Lebesgue measure on R for 1 j d , and consider the followingembedding { (1 + β ) f j ((1 + β ) · ) : β ∈ ( − , } . As before, it is easy to see thatfor 1 j d , θ ′ j (0) = 1 , and J j (0) = Z ( xf ′ j ( x ) + f j ( x )) f j ( x ) d x. Thus if there exists a constant c such that max j d J j (0) c − , then σ g > cr − . Further, if there exists a constant C > P (0 < X j C ) = 1 , j d, then the conditions (C2), (C3), (C4) and (C5) hold with D n = ln − (2) log( C ).With these assumptions, the condition (12) in Corollary 3.5 simpliﬁes as ( n ∧ N ) − r log ( dn ) C n − ζ . Example 4.4 (Kernel density estimation) . Let τ : S r → R ℓ be a measurablefunction that is symmetric in its r arguments, and { t j : 1 j d } ⊂ R ℓ be d design points. [15, 17] used U n as a kernel density estimator (KDE) for thedensity of τ ( X , . . . , X r ) at the given design points with h j ( x , . . . , x r ) = 1 b ℓn κ (cid:18) t j − τ ( x , . . . , x r ) b n (cid:19) , j d, where b n > κ ( · ) is the density estimation kernelwith R κ ( z ) dz = 1, which should not be confused with the U -statistic kernel h .For this example, we will assume r ﬁxed and the bandwidth b n →

0, but allowthe diameter of the design points, max j d k t j k , to grow, where k · k denotesthe usual Euclidean norm.Assume that given X = x , τ ( x , X r ) has a density f ( z ; x ) w.r.t. theLebesgue measure on R ℓ , i.e., P ( τ ( x , X , . . . , X r ) ∈ A ) = R A f ( z ; x ) dz forany A ∈ B ( R ℓ ). Then by deﬁnition, for 1 j d , g j ( x ) = E [ h j ( x , X r )] = Z b ℓn κ (cid:18) t j − zb n (cid:19) f ( z ; x ) dz = Z κ ( z ) f ( t j − b n z ; x ) dz. . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics For t ∈ R ℓ , denote V n ( t ) := Var (cid:18)Z κ ( z ) f ( t − b n z ; X ) dz (cid:19) , V ( t ) := Var( f ( t ; X )) . As in [15], if R κ ( z ) dz < ∞ and sup t E [ f ( t ; X )] < ∞ , then lim n →∞ V n ( t ) = V ( t ) for any ﬁxed t . If there exists some R > j d k t j k R for any d ∈ N and inf t ∈ R ℓ : | t | R V ( t ) >

0, under mild continuity assumptions(e.g. the equicontinuty of V n ( t )), there exists an absolute constant c > σ g > c for large n . Then we can apply the result in [8], which does notallow σ g to vanish.In this work, we allow σ g to vanish, and thus allow the diameter of the designpoints to grow as n becomes large. Speciﬁcally, if we assume κ ( · ) is boundedby some constant C , we can select D n = ln − (2) Cb − n in conditions (C2), (C3),(C4) and (C5). Then the condition (12) in Corollary 3.5 simpliﬁes aslog ( dn ) σ g b n ( n ∧ N ) C n − ζ . Thus if log( d ) = O (log( n )) and n = O ( n ∧ N ), to apply Corollary 3.5, werequire that σ − g = O ( b n n − ǫ ) for any ǫ > Remark 4.5. [15] considers the case d = 1, and shows the √ n -convergence rateof the KDE. The same discussion applies here. [17] constructs conﬁdence bands(without computational considerations and bootstrap results) for the density of τ ( X r ), under the additional assumptions required to establish the convergenceof empirical processes.

5. Maximal inequality

In this section, we derive an upper bound on the expected supremum of theremainder of the H´ajek projection of the complete IOUS with deterministickernel. This maximal inequality (with the explicit dependence on r ) serves asa key step to establish the Gaussian approximation result for the incompleteIOUS with random kernel. Theorem 5.1.

Assume (C3) hold. Then there exist constants c, C , dependingonly on q , such that if r log( d ) /n c , then E " max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( U n,j − θ j ) − rn n X i =1 ( g j ( X i ) − θ j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C r log /q ( d ) D n n . The proof of Theorem 5.1 is quite involved: we need to develop a numberof technical tools such as the symmetrization inequality and Bonami inequality(i.e., exponential moment bound) for the Rademacher chaos, all with the explicitdependence on r . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics We start with some notation. Let X ′ := ( X ′ , . . . , X ′ n ) be an independentcopy of X := ( X , . . . , X n ), and ǫ := ( ǫ , . . . , ǫ n ) be i.i.d. Rademacher randomvariables, i.e., P ( ǫ = 1) = P ( ǫ = −

1) = 1 /

2, that are independent of X and X ′ . If all involved random variables are independent, we write E ǫ (resp. E X ′ )for expectation only w.r.t. ǫ (resp. X ′ ).For a given probability space ( X, A , Q ), a measurable function f on X and x ∈ X , we use the notation Qf = R f dQ whenever the latter integral is well-deﬁned, and denote δ x the Dirac measure on X , i.e., δ x ( A ) = { x ∈ A } for any A ∈ A . For a measurable symmetric function f on S r and k = 0 , , . . . , r , let P r − k f denote the function on S k deﬁned by P r − k f ( x , . . . , x k ) := E [ f ( x , . . . , x k , X k +1 , . . . , X r )] , whenever it is well deﬁned. To prove Theorem 5.1, without loss of generality,we may assume θ = P r h = 0 , since we can always consider h ( · ) − θ instead. For 0 k r , deﬁne e π k h ( x , . . . , x k ) := P r − k h,π k h ( x , . . . , x k ) := ( δ x − P ) × · · · × ( δ x k − P ) × P r − k h. (15)Clearly π k is degenerate of order k with respect to the distribution P in thesense of (16) below. For any ι = ( i , . . . , i k ) ∈ I n,k , and J = ( j , . . . , j ℓ ) ∈ I k,ℓ where 0 ℓ k , deﬁne ι J := ( i j , . . . , i j ℓ ) ∈ I n,ℓ . Then π k h ( x ι ) = E X ′ hP kℓ =0 ( − k − ℓ P J ∈ I k,ℓ e π k h ( x ι J , X ′ ι \ ι J ) i for all ι ∈ I n,k . Further, the Hoeﬀding decomposition [19] for the U -statistic (with θ = 0) isas follows: U n = 1 | I n,r | X ι ∈ I n,r h ( X ι ) = r X k =1 (cid:18) nr (cid:19) − (cid:18) n − kr − k (cid:19) X ι ∈ I n,k π k h ( X ι ) . = r X k =1 (cid:18) rk (cid:19)(cid:18) nk (cid:19) − X ι ∈ I n,k π k h ( X ι ) =: r X k =1 (cid:18) rk (cid:19) U ( k ) n ( π k h ) . Finally, for any 1 k r , deﬁne the envelope function F k ( x , . . . , x k ) := max j d | e π k h j ( x , . . . , x k ) | . For each integer k , consider a symmetric kernel f : S k → R d . We say that f is degenerate of order k with respect to the distribution P if E X [ f j ( X , X , . . . , X k )] = 0 a.s. , for any 1 j d. (16) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics The following result is essentially due to [27, Section 3, Symmetrization in-equality] in the U -process setting. We provide a self-contained (and perhapsmore transparent) proof for completeness. Theorem 5.2 (Symmetrization inequality) . Assume (16) holds. E  max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k E  max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Remark 5.3.

In Theorem 5.2, the symmetrization costs a multiplicative factorof 2 k for a degenerate kernel of order k . Standard symmetrization argument forsuch degenerate U -statistics (cf. [13, Theorem 3.5.3]) together with the decou-pling inequalities (cf. [13, Theorem 3.1.1]) in literature yield that E  max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C k E  max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where C k = 2 k − ( k − k k − k − k − − × · · · × (2 − k ≪ C k ,improvement of the constant to the exponential growth in k turns out to becrucial to obtain the maximal inequality for the IOUS in Theorem 5.1. Themajor component for the super-exponential behavior of C k is due to the stepfor applying the decoupling inequality in [13, Theorem 3.1.1], which is valid forany (measurable) symmetric kernel. If the kernel f is degenerate of order k , thensymmetrization can be directly done without the decoupling inequality (cf. theproof of Theorem 5.2 below). Proof of Theorem 5.2.

Deﬁne a new sequence of random variables { Z i : 1 i n } : Z i = X i { ǫ i =1 } + X ′ i { ǫ i = − } . Further, for each ι = { i , . . . , i k } ∈ I n,k , deﬁne e f j,ι = 2 k E ǫ [ f j ( Z i , . . . , Z i k ) ǫ i · · · ǫ i k ] . Due to degeneracy, we have E X ′ h e f j,ι i = 2 k E ǫ E X ′ [ f j ( Z i , . . . , Z i k ) ǫ i · · · ǫ i k ]= 2 k E ǫ h f j ( X i , . . . , X i k ) { ǫ i =1 ,...,ǫ ik =1 } i = f j ( X i , . . . , X i k ) , where the ﬁrst and third equalities follow from deﬁnitions and Fubini Theorem,and the second follows from the degeneracy. To wit, on the event that { ǫ i ℓ = − } for some 1 ℓ k , E X ′ iℓ (cid:2) f j ( Z i , . . . , Z i ℓ − , X ′ i ℓ , Z i ℓ +1 , . . . , Z i k ) ǫ i · · · ǫ i k (cid:3) = 0 . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics The rest of the argument is standard: by Jensen’s inequality,max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k E X ′ h e f j,ι i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k E ǫ,X ′ max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( Z i , . . . , Z i k ) ǫ i · · · ǫ i k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Since ( X , . . . , X n , ǫ , . . . , ǫ n ) and ( Z , . . . , Z n , ǫ , . . . , ǫ n ) have the same distri-bution, taking expectation on both sides completes the proof. (cid:4) We start with a lemma, whose proof is elementary and thus omitted. Recall thedeﬁnition of e ψ β in Subsection 1.2. Lemma 5.4.

For any β > , e ψ β ( · ) is strictly increasing, convex, and e ψ β (0) = 0 .Further, for any β > , e ψ β ( x ) e x β e ψ β ( x ) + e /β , and consequently e ψ − β ( m ) log /β (cid:16) m + e /β (cid:17) . Now we state the maximal inequality with explicit constants.

Lemma 5.5.

Fix β ∈ (0 , . Consider a sequence of non-negative random vari-ables { Z j : 1 j d } , and assume that there exists some real number ∆ > such that E [ e ψ β ( Z j / ∆)] , for j d . Then E (cid:20) max j d Z j (cid:21) ∆ log /β (2 d + e /β ) . Proof.

By monotonicity and convexity, e ψ β (cid:18) E (cid:20) max j d ( Z j / ∆) (cid:21)(cid:19) E (cid:20) e ψ β (cid:18) ∆ − max j d Z j (cid:19)(cid:21) = E (cid:20) max j d e ψ β ( Z j / ∆) (cid:21) X j d E h e ψ β ( Z j / ∆) i = 2 d. Then the proof is complete by Lemma 5.4. (cid:4)

The goal is to establish an exponential moment bound (i.e., Bonami inequality)of Rademacher chaos of order k . Based on the well-known hyper-contractivity of . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Rademacher chaos variables in literature (cf. [13, Corollary 3.2.6]), our Lemma 5.6below provides an exponential moment bound with an explicit dependence onthe order.

Lemma 5.6 (Exponential moment of Rademacher chaos) . Fix k > , β = 2 /k and let { x ι : ι ∈ I n,k } be a collection of real numbers. Consider the followinghomogeneous chaos of order k : Z = X ι ∈ I n,k x ι ǫ i · · · ǫ i k , where ǫ , . . . , ǫ n are i.i.d. Rademacher random variables. Then E h e ψ β ( | Z | / ∆ n ) i , where ∆ n = 7 k/ s X ι ∈ I n,k x ι . Proof.

Denote κ = p E [ Z ], c = √ n = c k κ . Observe that β βk = 2. From [13, Theorem 3.2.2], we have for any q > E | Z | q (cid:16) q q/β _ (cid:17) κ q ( q q/β + 1) κ q . Here, the ﬁrst inequality clearly holds for q

2, and we use [13, Theorem 3.2.2]for q >

2. Then using the fact that e x P ∞ ℓ =1 | x | ℓ /ℓ ! and by Lemma 5.4,we have E e ψ β ( | Z | / ∆ n ) E exp (cid:0) ( | Z | / ∆ n ) β (cid:1) ∞ X ℓ =1 E | Z | βℓ / ( ℓ !∆ βℓn ) ∞ X ℓ =1 ( βℓ ) ℓ κ βℓ ℓ !∆ βℓn + ∞ X ℓ =0 κ βℓ ℓ !∆ βℓn = ∞ X ℓ =1 β ℓ ℓ ℓ ℓ ! c ℓ + ∞ X ℓ =0 ℓ ! c ℓ . Using the fact that ℓ ℓ e ℓ ℓ !, we have E e ψ β ( | Z | / ∆ n ) ∞ X ℓ =1 (cid:18) βec (cid:19) ℓ + ∞ X ℓ =0 ℓ ! c ℓ ∞ X ℓ =1 (cid:16) ec (cid:17) ℓ + ∞ X ℓ =0 ℓ ! c ℓ . Since c = 7 > e , we have E e ψ β ( | Z | / ∆ n ) ec − e + e c − < , which completes the proof. (cid:4) Now we are in position to prove Theorem 5.1. Recall that we assume θ = 0.First, for each 2 k r and 1 j d , deﬁne Z k,j = E ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k π k h j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics where π k h is deﬁned in (15), and ǫ , . . . , ǫ n are i.i.d. Rademacher random vari-ables. Deﬁne∆ k,j = X ι ∈ I n,k ( π k h j ( X ι )) = X ι ∈ I n,k  E X ′  k X ℓ =0 ( − k − ℓ X J ∈ I k,ℓ e π k h j ( X ι J , X ′ ι \ ι J )  . By Jensen’s inequality and the fact that ( P ni =1 z n ) n P ni =1 z n , we have forany 1 j d ,∆ k,j k E X ′  X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ (cid:16)e π k h j ( X ι J , X ′ ι \ ι J ) (cid:17)  k E X ′  X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J )  . Then by Lemma 5.6, E ǫ (cid:20) e ψ /k (cid:18) | Z k,j | k/ ∆ k,j (cid:19)(cid:21) . Further, by Lemma 5.5 with β = 2 /k , we have E ǫ max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k π k h j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k/ max j d (∆ k,j ) log k/ (2 d + e k/ ) k/ log k/ (2 d + e k/ ) vuuut E X ′  X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J )  . Then by Lemma 5.2 and Jensen’s inequality, we have E  max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k π k h j ( X ι ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k/ log k/ (2 d + e k/ ) E vuuut E X ′  X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J )  k/ log k/ (2 d + e k/ ) vuuut E  X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J )  = 56 k/ log k/ (2 d + e k/ ) s(cid:18) nk (cid:19) k E [ F k ( X , . . . , X k )] . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Now we bound E [ F k ( X , . . . , X k )]. By the deﬁnition of e π k h j , condition (C3),Lemma 5.4 and Jensen’s inequality, we have E h e ψ q ( | e π k h j ( X , . . . , X k ) | /D n ) i = E h e ψ q ( | E X ′ [ h j ( X , . . . , X k , X ′ k +1 , . . . , X ′ r )] | /D n ) i E h e ψ q ( | h j ( X , . . . , X k , X ′ k +1 , . . . , X ′ r ) | /D n ) i E [ ψ q ( | h j ( X , . . . , X k , X k +1 , . . . , X r ) | /D n )] + 1 . Since e ψ q (0) = 0, by Jensen’s inequality, we have k e π k h j ( X , . . . , X k ) |k e ψ q D n .Then by the standard maximal inequality (e.g., see [29, Lemma 2.2.2]), thereexists a constant C , depending only on q , such that for 1 k r , p E | F k ( X , . . . , X k ) | C log /q ( d ) D n . Thus we obtain that E " max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U n,j − rn n X i =1 g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r X k =2 (cid:18) rk (cid:19) E (cid:20) max j d (cid:12)(cid:12)(cid:12) U ( k ) n ( π k h j ) (cid:12)(cid:12)(cid:12)(cid:21) r X k =2 (cid:0) rk (cid:1)q(cid:0) nk (cid:1) (112) k/ log k/ (2 d + e k/ ) q E F k ( X , . . . , X k ) C log /q ( d ) D n r X k =2 (cid:0) rk (cid:1)q(cid:0) nk (cid:1) (112) k/ log k/ (2 d + e k/ ) . Observe that if r n , we have for any 1 i rr − i √ n − i r √ n ⇒ (cid:0) rk (cid:1)q(cid:0) nk (cid:1) √ k ! (cid:18) r n (cid:19) k/ . Further, for any x, y >

2, log k/ ( x + y ) k/ (log k/ ( x ) + log k/ ( y )). Now, take c = 1 / r n . Then E " max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U n,j − rn n X i =1 g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C log /q ( d ) D n r X k =2 (cid:18) r n (cid:19) k/ (log k/ (2 d ) + 1 √ k ! ( k/ k/ ) . For the ﬁrst term, by geometric series formula, I = C log /q ( d ) D n r X k =2 (cid:18) r log(2 d ) n (cid:19) k/ C r log /q ( d ) D n n . For the second term, since for any ℓ > ℓ ℓ e ℓ ℓ !, we have II = C log /q ( d ) D n r X k =2 (cid:18) e r n (cid:19) k/ C r log /q ( d ) D n n , which completes the proof of Theorem 5.1. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Appendix A: Proofs

A.1. Tail probabilities

In this section, we collect and prove some results regarding tail probabilities forsum of independent random vectors, U -statistics, and U -statistics with randomkernels. For each type of statistics, we present two versions, one for non-negativerandom variables and the other for general cases.These inequalities are used in bounding the eﬀects due to sampling (Subsec-tion A.4.3), and also in controlling the k · k ∞ distance between the bootstrapcovariance matrices and their targets (Section A.5). A.1.1. Tail probabilities for sum of independent random vectors

In this subsection, m, n, d > Lemma A.1.

Let Z , . . . , Z m be independent R d -valued random vectors and β ∈ (0 , . Assume that Z ij > , k Z ij k ψ β u n , for all i = 1 , . . . , m, and j = 1 , . . . , d. Then there exists some constant C that only depends on β such that P max j d m X i =1 Z ij > C max j d E " m X i =1 Z ij + u n log /β ( dm ) (cid:16) log( dm ) + log /β ( n ) (cid:17)!! /n. Proof.

See Subsection A.7.1. (cid:4)

Lemma A.2.

Let Z , . . . , Z m be independent R d -valued random vectors and β ∈ (0 , . Assume that E [ Z ij ] = 0 , k Z ij k ψ β u n , for all i = 1 , . . . , m, and j = 1 , . . . , d. Then there exists some constant C that only depends on β such that P max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C (cid:16) σ log / ( dn ) + u n log /β ( dm ) (cid:16) log( dm ) + log /β ( n ) (cid:17)(cid:17)! /n, where σ := max j d P mi =1 E [ Z ij ] .Proof. See Subsection A.7.2 (cid:4)

Lemma A.3.

Let Z , . . . , Z m be independent and identical distributed Bernoullirandom variables with success probability p n , i.e., P ( Z i = 1) = 1 − P ( Z i = 0) = . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics p n for i m . Further, let a , . . . , a m be deterministic R d vectors. Thenthere exists an absolute constant C such that P max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 ( Z i − p n ) a ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C (cid:16)p p n (1 − p n ) σ log / ( dn ) + M log( dn ) (cid:17)! /n, where σ := max j d P mi =1 a ij and M = max i m, j d | a ij | .Proof. See Subsection A.7.3 (cid:4)

A.1.2. Tail probabilities for U -statistics Lemma A.4.

Let X , . . . , X n be i.i.d. random variables taking value in ( S, S ) and ﬁx β ∈ (0 , . Let f : ( S r , S r ) → R d be a measurable, symmetric functionsuch that for all j = 1 , . . . , d , f j ( X , . . . , X r ) > a.s. , E [ f j ( X , . . . , X r )] v n , k f j ( X , . . . , X r ) k ψ β u n . Deﬁne U n := | I n,r | − P ι ∈ I n,r f ( X ι ) . Then there exists a constant C that onlydepends on β such that P (cid:18) max j d U n,j > C (cid:16) v n + n − r log /β +1 ( dn ) log /β − ( n ) u n (cid:17)(cid:19) n . Clearly, we can replace v n by u n .Proof. See Subsection A.7.4. (cid:4)

Lemma A.5.

Let X , . . . , X n be i.i.d. random variables taking value in ( S, S ) and ﬁx β ∈ (0 , . Let f : ( S r , S r ) → R d be a measurable, symmetric functionsuch that E [ f j ( X , . . . , X r )] = 0 , k f j ( X , . . . , X r ) k ψ β u n for all j = 1 , . . . , d. Deﬁne U n := | I n,r | − P ι ∈ I n,r f ( X ι ) and σ := max j d E [ f j ( X r )] . Then thereexists a constant C that only depends on β such that P (cid:18) max j d | U n,j | > C (cid:16) n − / r / log / ( dn ) σ + n − r log /β +1 ( dn ) log /β − ( n ) u n (cid:17)(cid:19) n . Clearly, we can replace σ by u n .Proof. See subsection A.7.5. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics A.1.3. Tail probabilities for U -statistics with random kernel Let X , . . . , X n be i.i.d. random variables taking value in ( S, S ) and W, { W ι , ι ∈ I n,r } be i.i.d. random variables taking value in ( S ′ , S ′ ), that are independent of X n . In this subsection, we consider a measurable function F : S r × S ′ → R d that is symmetric in the ﬁrst r variables, and ﬁx some β ∈ (0 , f ( x , . . . , x r ) := E [ F ( x , . . . , x r , W )] ,b j ( x , . . . , x r ) := k F j ( x , . . . , x r , W ) − f j ( x , . . . , x r ) k ψ β for all j = 1 , . . . , d. We ﬁrst consider the non-negative random kernels.

Lemma A.6.

Consider Z := max j d | I n,r | − P ι ∈ I n,r F j ( X ι , W ι ) . Assumethat for all j = 1 , . . . , d , F j ( · ) > , and that there exists u n > such that k b j ( X r ) k ψ β u n , k f j ( X r ) k ψ β u n , for all j = 1 , . . . , d. Then there exists some constant C that only depends on β such that with prob-ability at least − /n , Z C max j d E [ f j ( X r )] + Cn − r log /β +1 ( dn ) log /β − ( n ) u n + C | I n,r | − r /β log /β +1 ( dn ) log /β − ( n ) u n . Proof.

See subsection A.7.6. (cid:4)

Next, we consider centered random kernels.

Lemma A.7.

Consider Z := max j d (cid:12)(cid:12)(cid:12) | I n,r | − P ι ∈ I n,r ( F j ( X ι , W ι ) − f j ( X ι )) (cid:12)(cid:12)(cid:12) . Assume there exists u n > such that for all j = 1 , . . . , d , k b j ( X , . . . , X r ) k ψ β u n . Then there exists some constant C that only depends on β such that with prob-ability at least − /n , Z Cu n | I n,r | − / r / log / ( dn ) (cid:16) n − / r / log /β +1 / ( dn ) log /β − / ( n ) (cid:17) + Cu n | I n,r | − r /β log /β +1 ( dn ) log /β − ( n ) . Proof.

See subsection A.7.7. (cid:4)

A.2. Additional lemmas

The following Lemma concerns Gaussian approximation for sum of independentvectors. It replaces the k · k ψ condition in Proposition 2.1 of [10] by k · k ψ q . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Lemma A.8.

Let Z , . . . , Z n be independent R d -valued random vectors. Assumethat for some absolute constant σ > , and q > , n − n X i =1 E (cid:2) Z ij (cid:3) > σ , n − n X i =1 E (cid:2) Z kij (cid:3) D kn for j = 1 , . . . , d, k = 1 , , k Z ij k ψ q D n , for i = 1 , . . . , n, j = 1 , . . . , d. Then there exists some constant C that only depends on q and σ such that ρ ( n − / n X i =1 ( Z i − E [ Z i ]) , Y ) C (cid:18) D n log q ∗ ( dn ) n (cid:19) / , where q ∗ = (6 /q + 1) ∨ , Y ∼ N (0 , Σ) , and Σ := n − P ni =1 E [ Z i Z ′ i ] .Proof. See Subsection A.8. (cid:4)

The following lemmas are elementary, but used repeatedly.

Lemma A.9.

Let β > . There exits a constant C , only depending on β , suchthat for any positive integers r, n such that r √ n , n r β C k I n,r k . Proof.

Fix β . If r → ∞ , n r β / k I n,r k →

0. Thus there exits M such that if r > M , n r β k I n,r k . For r < M , the inequality holds with C = M β . (cid:4) Lemma A.10.

Let β, k > . For any random variable X , k X k k ψ β = k X k kψ kβ . Proof.

Observe that E (cid:20) exp (cid:16) | X | k / k X k kψ kβ (cid:17) β (cid:21) = E h exp (cid:0) | X | / k X k ψ kβ (cid:1) kβ i , which implies that k X k k ψ β k X k kψ kβ . The reverse direction is similar. (cid:4) For β < k · k ψ β is not a norm , but the usual triangle inequality andmaximal inequality hold up to a multiplicative constant. Lemma A.11.

Fix β ∈ (0 , .(i) For any random variables X and Y , k X + Y k ψ β /β (cid:0) k X k ψ β + k Y k ψ β (cid:1) . (ii) Let ξ , . . . , ξ n be a sequence of random variables such that k ξ i k ψ β D for i n , and n > . Then there exists a constant C depending only on β such that k max i n ξ i k ψ β C log /β ( n ) D. Proof.

See Subsection A.8. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics A.3. Proofs in Section 2.1

We ﬁrst prove Corollary 2.2 and then prove Theorem 2.1.

Proof of Corollary 2.2.

Let c be the constant in Theorem 5.1. Without loss ofgenerality, we assume r D n log q ∗ ( dn ) σ g n c, and θ = 0 , (17)since ρ ( · , · ) h ( · ) − θ instead. Recall that q ∗ =(6 /q + 1) ∨ R = [ a, b ] ∈ R , where a, b ∈ R d and a b . Deﬁne˜ a = r − Λ − / g a, ˜ b = r − Λ − / g b, ˜ U n = r − Λ − / g U n , ˜ G i = Λ − / g g ( X i ) . Denote ξ n := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ U n,j − n n X i =1 ˜ G i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Then by Theorem 5.1, E [ ξ n ] r − σ − g max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U n,j − rn n X i =1 g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . σ − g n − r log /q ( d ) D n . For any t >

0, by Markov inequality and deﬁnition, P ( √ nU n ∈ R ) = P ( −√ nU n − a ∩ √ nU n b ) = P ( −√ n ˜ U n − ˜ a ∩ √ n ˜ U n ˜ b ) P ( −√ n ˜ U n − ˜ a ∩ √ n ˜ U n ˜ b ∩ √ nξ n t ) + P ( √ nξ n > t ) P ( − √ n n X i =1 ˜ G i − ˜ a + t ∩ √ n n X i =1 ˜ G i ˜ b + t ) + Ct − σ − g n − / r log /q ( d ) D n . Due to assumptions (C2), (C3) and Cauchy-Schwarz inequality, E [ ˜ G i,j ] = 1 , for 1 i n, j d, E [ ˜ G i,j ] ( σ − g,j D n ) ( σ − g D n ) , for 1 i n, j d, E [ | ˜ G i,j | ] q E [ ˜ G i,j ] E [ ˜ G i,j ] σ − g D n , for 1 i n, j d, k ˜ G i,j k ψ q σ − g,j D n σ − g D n , for 1 i n, j d. Then due to Lemma A.8, we have P ( √ nU n ∈ R ) P ( − Λ − / g Y A − ˜ a + t ∩ Λ − / g Y A ˜ b + t )+ C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + Ct − σ − g n − / r log /q ( d ) D n . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Further, by anti-concentration inequality [10, Lemma A.1], P ( √ nU n ∈ R ) P ( − Λ − / g Y A − ˜ a ∩ Λ − / g Y A ˜ b ) + Ct p log( d )+ C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + Ct − σ − g n − / r log /q ( d ) D n . Finally, taking t = (cid:16) σ − g n − r log /q ( d ) D n (cid:17) / and due to convention (17),we have P ( √ nU n ∈ R ) P ( rY A ∈ R ) + C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + C (cid:16) σ − g n − r log /q ( d ) D n (cid:17) / P ( rY A ∈ R ) + C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + C (cid:0) σ − g n − r log q ∗ ( d ) D n (cid:1) / P ( rY A ∈ R ) + C (cid:0) σ − g n − r D n log q ∗ ( dn ) (cid:1) / . Likewise, we can show the lower inequality P ( √ nU n ∈ R ) > P ( rY A ∈ R ) − C (cid:0) σ − g n − r D n log q ∗ ( dn ) (cid:1) / , which completes the proof. (cid:4) Proof of Theorem 2.1.

As before, without loss of generality, we assume θ = 0 , and r D n log q ∗ ( dn ) σ g n c , (18)for some suﬃciently small c ∈ (0 , ι = ( i , . . . , i r ) ∈ I n,r , H ι := H ( X i , . . . , X i r , W ι ) − h ( X i , . . . , X i r ) := H ( X ι , W ι ) − h ( X ι ) . Then by deﬁnition, b U n = R n + U n , where R n := | I n,r | − X ι ∈ I n,r H ι . Step 1 . We ﬁrst show that E (cid:20) max j d | R n,j | (cid:21) . D n log / /q ( dn ) n . (19)Note that conditional on X n , R n is an average of independent random vectors.Thus by [9, Lemma 8], E | X n (cid:20) max j d | I n,r | | R n,j | (cid:21) . s log( d ) max j d X ι ∈ I n,r E | X n h H j ( X ι , W ι ) i + log( d ) s E | X n (cid:20) max ι ∈ I n,r max j d H j ( X ι , W ι ) (cid:21) . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics By deﬁnition (6) and maximal inequality ([29, Lemma 2.2.2] and Lemma A.11), E | X n h H j ( X ι , W ι ) i B n,j ( X ι ) for all ι ∈ I n,r , s E | X n (cid:20) max ι ∈ I n,r max j d H j ( X ι , W ι ) (cid:21) r /q log /q ( dn ) max ι ∈ I n,r max j d B n,j ( X ι ) . Deﬁne Z := max j d | I n,r | X ι ∈ I n,r B n,j ( X ι ) max ι ∈ I n,r max j d B n,j ( X ι ) := M . Under the assumption (C4) and again maximal inequality ([29, Lemma 2.2.2]and Lemma A.11), we have k M k ψ q r /q log /q ( dn ) D n . Then, we have E (cid:20) max j d | R n,j | (cid:21) . s log( d ) | I n,r | E [ M ] + r /q log /q ( dn ) | I n,r | E [ M ] . s log( d ) | I n,r | + r /q log /q ( dn ) | I n,r | ! r /q log /q ( dn ) D n . Then due to Lemma A.9 and (18), we have E (cid:20) max j d | R n,j | (cid:21) . D n log / /q ( dn ) n . Step 2 . We ﬁnish the proof by a similar argument as in the proof of Corol-lary 2.2.Fix any rectangle R = [ a, b ] ∈ R , where a, b ∈ R d and a b . Deﬁne˜ a = r − Λ − / g a, ˜ b = r − Λ − / g b, ˜ Y A = Λ − / g Y A . where we recall that Λ g is deﬁned in (5). Recall that b U n = U n + R n . For any t >

0, by Markov inequality, the result from Step 1, and Corollary 2.2, P ( √ n b U n ∈ R ) = P ( −√ n b U n − a ∩ √ n b U n b ) P ( −√ n b U n − a ∩ √ n b U n b ∩ √ n k R n k ∞ t ) + P ( √ n k R n k ∞ > t ) P ( −√ nU n − a + t ∩ √ nU n b + t ) + Ct − n − / D n log / /q ( dn ) P ( − rY A − a + t ∩ rY A b + t ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / + Ct − n − / D n log / /q ( dn ) P ( − ˜ Y A − ˜ a + tr − σ − g ∩ ˜ Y A ˜ b + tr − σ − g ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / + Ct − n − / D n log / /q ( dn ) . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Observe that E [ ˜ Y A,j ] = 1 for 1 j d . By anti-concentration inequality [10,Lemma A.1], P ( √ n b U n ∈ R ) P ( − ˜ Y A − ˜ a ∩ ˜ Y A ˜ b ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / + Ctr − σ − g p log( d ) + Ct − n − / D n log / /q ( dn ) . Finally, taking t = (cid:16) σ g n − r log /q ( dn ) D n (cid:17) / and due to convention (18),we have P ( √ n b U n ∈ R ) P ( rY A ∈ R ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / . By a similar argument, we can show P ( √ n b U n ∈ R ) > P ( rY A ∈ R ) − C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / . which completes the proof. (cid:4) A.4. Proofs in Section 2.2

In this subsection, without loss of generality, we assume θ = 0. Recall thedeﬁnition Λ H in (5). Further, deﬁne a function ˜ H : S r ∗ S ′ → R d by ˜ H ( x r , w ) =Λ − / H H ( x r , w ) for any x r ∈ S r , w ∈ S ′ , andΓ ˜ H := Cov( ˜ H ( X r , W )) = Λ − / H Γ H Λ − / H , b Γ ˜ H := 1 | I n,r | X ι ∈ I n,r ˜ H ( X ι , W ι ) ˜ H ( X ι , W ι ) T . (20)Clearly, if (C5) holds, then E | ˜ H j ( X r , W ) | k ( σ − H,j D n ) k ( σ − H D n ) k , for 1 j d, k = 1 , , (21)where again we applied CauchySchwarz inequality for k = 1. A.4.1. Bounding b N/N

The following lemma follows from an application of Bernstein’s inequality andis proved in the Step 5 of the proof of [8, Theorem 3.1]. It is included here foreasy reference.

Lemma A.12.

Assume p log( n ) /N / . Then P (cid:16) | b N /N − | > p log( n ) /N (cid:17) n − , P (cid:16) | N/ b N − | > p log( n ) /N (cid:17) n − . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics A.4.2. Bounding the normalized covariance estimator

Lemma A.13.

Assume (C3’) , (C4) and (C5) hold. Then there exists a constant C , depending only on q , such that with probability at least − /n , k b Γ ˜ H − Γ ˜ H k ∞ C (cid:16) σ − H n − / r / log / ( dn ) D n + σ − H n − r log /q +1 ( dn ) log /q − ( n ) D n (cid:17) + Cσ − H D n | I n,r | − / r / log / ( dn ) (cid:16) n − / r / log /q +1 / ( dn ) log /q − / ( n ) (cid:17) + Cσ − H D n | I n,r | − r /q log /q +1 ( dn ) log /q − ( n ) . Proof.

Deﬁne v ( x r ) := E [ ˜ H ( x r , W ) ˜ H ( x r , W ) T ] , b V := | I n,r | − P ι ∈ I n,r v ( X ι ).Observe that k b Γ ˜ H − Γ ˜ H k ∞ k b Γ ˜ H − b V k ∞ + k b V − Γ ˜ H k ∞ . We will bound these two terms separately.Step 0. We ﬁrst make a few observations. Clearly, E [ v ( X r )] = Γ ˜ H , and for all1 j, k d , by Jensen’s inequality for conditional expectation and (21), E | v jk ( X r ) | E [ ˜ H j ( X r , W ) ˜ H k ( X r , W )] E [ ˜ H j ( X r , W )] + E [ ˜ H k ( X r , W )] . σ − H D n . (22)Further, by deﬁnition | v jk ( x r ) | E [ ˜ H j ( x r , W )] + E [ ˜ H k ( x r , W )] . σ − H (cid:0) B n,j ( x r ) + h j ( x r ) + B n,k ( x r ) + h k ( x r ) (cid:1) . As a result, by the assumptions (C4) and (C3’), and Lemma A.10,max j,k d k v jk ( X r ) k ψ q/ . σ − H max j d (cid:0) k B n,j ( X r ) k ψ q/ + k h j ( X r ) k ψ q/ (cid:1) = σ − H max j d (cid:16) k B n,j ( X r ) k ψ q + k h j ( X r ) k ψ q (cid:17) . ( σ − H D n ) . (23) Step 1 . We bound k b Γ ˜ H − b V k ∞ using Lemma A.7 with F = ˜ H ˜ H T and ψ q/ . For1 j, k d , deﬁne b jk ( x r ) := k ˜ H j ( x r , W ) ˜ H k ( x r , W ) − v jk ( x r ) k ψ q/ . Observe that due to Lemma A.10 and A.11, b jk ( x r ) . k ˜ H j ( x r , W ) k ψ q/ + k ˜ H k ( x r , W ) k ψ q/ + v jk ( x r )= k ˜ H j ( x r , W ) k ψ q + k ˜ H k ( x r , W ) k ψ q + v jk ( x r ) . σ − H ( h j ( x r ) + B n,j ( x r ) + h k ( x r ) + B n,k ( x r )) + v jk ( x r ) . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Then due to (23) and the assumptions (C4) and (C3’), k b jk ( X r ) k ψ q/ . ( σ − H D n ) , for all 1 j, k d. Now we apply Lemma A.7, with probability at least 1 − /n , k b Γ ˜ H − b V k ∞ . σ − H D n | I n,r | − / r / log / ( dn ) (cid:16) n − / r / log /q +1 / ( dn ) log /q − / ( n ) (cid:17) + σ − H D n | I n,r | − r /q log /q +1 ( dn ) log /q − ( n ) . Step 2 . We bound k b V − Γ ˜ H k ∞ using Lemma A.5 with ψ q/ . By (22) and (23),with probability at least 1 − /n , k b V − Γ ˜ H k ∞ . n − / r / log / ( dn ) σ − H D n + n − r log /q +1 ( dn ) log /q − ( n ) σ − H D n . Then the proof is complete by combining step 1 and 2. (cid:4)

A.4.3. Bounding the eﬀect of sampling

The following quantity will appear in the proof of Theorem 2.4: √ N ζ n := 1 p | I n,r | X ι ∈ I n,r Z ι − p n p p n (1 − p n ) ˜ H ( X ι , W ι ) := 1 p | I n,r | X ι ∈ I n,r e Z ι , (24)The next lemma establishes conditional Gaussian approximation for √ N ζ n . Lemma A.14.

Suppose the assumptions in Theorem 2.4 hold. There exists aconstant C , depending on q , such that with probability at least − C/n , ρ R| X,W ( √ N ζ n , Λ − / H Y B ) := sup R ∈R (cid:12)(cid:12)(cid:12) P | X,W (cid:16) √ N ζ n ∈ R (cid:17) − P (Λ − / H Y B ∈ R ) (cid:12)(cid:12)(cid:12) C̟ n , where we recall that Y B ∼ N (0 , Γ H ) , and we abbreviate P | X,W for P | X n , { W ι : ι ∈ I n,r } .Proof. Consider conditionally independent (conditioned on

X, W ) R d -valuedrandom vectors { b Y ι : ι ∈ I n,r } such that b Y ι | X, W ∼ N (0 , ˜ H ( X ι , W ι ) ˜ H ( X ι , W ι ) T ) , b Y := | I n,r | − / X ι ∈ I n,r b Y ι . Clearly, b Y | X, W ∼ N (0 , b Γ ˜ H ). Further, deﬁne ρ R| X,W ( √ N ζ n , b Y ) := sup R ∈R (cid:12)(cid:12)(cid:12) P | X,W (cid:16) √ N ζ n ∈ R (cid:17) − P | X,W ( b Y ∈ R ) (cid:12)(cid:12)(cid:12) ,ρ R| X,W ( b Y , Λ − / H Y B ) := sup R ∈R (cid:12)(cid:12)(cid:12) P | X,W (cid:16) b Y ∈ R (cid:17) − P (Λ − / H Y B ∈ R ) (cid:12)(cid:12)(cid:12) . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics By triangle inequality, it then suﬃces to show that each of the following eventshappens with probability at least 1 − C/n , ρ R| X,W ( √ N ζ n , b Y ) C̟ n , ρ R| X,W ( b Y , Λ − / H Y B ) C̟ n , (25)on which we now focus. Without loss of generality, since σ g

1, we assume r q D n log q ∗ ( dn ) σ g n ∧ N c , and r q D n log q ∗ ( dn ) n ∧ N c . (26)for some suﬃciently small constant c ∈ (0 ,

1) that is to be determined. Recallthat q = 2 ∨ (2 /q ) and q ∗ = (6 /q + 1) ∨ Step 0 . By Lemma A.13 and A.9, P  k b Γ ˜ H − Γ ˜ H k ∞ C r log ∨ (2 /q − ( dn ) D n σ H n ! /  > − n . (27)In particular, since Γ ˜ H,jj = 1, if we take c small enough such that Cc / / P (cid:16) min j d b Γ ˜ H,jj > / (cid:17) > − /n . Step 1 . The goal is to show that the ﬁrst event in (25), ρ R| X,W ( √ N ζ n , b Y ) C̟ n ,holds with probability at least 1 − C/n . Step 1.1.

Deﬁne b L n := max j d | I n,r | − X ι ∈ I n,r E | X,W h | e Z ι,j | i . (28)Further, c M n ( φ ) := c M n,X ( φ ) + c M n,Y ( φ ), where c M n,X ( φ ) := | I n,r | − X ι ∈ I n,r E | X,W " max j d | e Z ι,j | ; max j d | e Z ι,j | > p | I n,r | φ log d , c M n,Y ( φ ) := | I n,r | − X ι ∈ I n,r E | X,W " max j d | b Y ι,j | ; max j d | b Y ι,j | > p | I n,r | φ log d , (29)By Theorem 2.1 in [10], there exist absolute constants K and K such thatfor any real numbers L n and M n , we have ρ R| X,W ( √ N ζ n , b Y ) K  L n log ( d ) | I n,r | ! / + M n L n  with φ n := K L n log ( d ) | I n,r | ! − / , on the event E n := { b L n L n } ∩ { c M n ( φ n ) M n } ∩ { min j d b Γ ˜ H,jj > / } . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics In Step 0, we have shown P (cid:16) min j d b Γ ˜ H,jj > / (cid:17) > − /n . In Step1.2-1.4, we select proper L n and M n such that the ﬁrst two events happen withprobability at least 1 − C/n . In Step 1.5, we plug in these values.

3. Thus for 1 j d ,deﬁne f j ( x r ) := E (cid:20)(cid:12)(cid:12)(cid:12) ˜ H j ( x r , W ) (cid:12)(cid:12)(cid:12) (cid:21) , b j ( x r ) := (cid:13)(cid:13)(cid:13)(cid:13)(cid:12)(cid:12)(cid:12) ˜ H j ( x r , W ) (cid:12)(cid:12)(cid:12) − f j ( x r ) (cid:13)(cid:13)(cid:13)(cid:13) ψ q/ . First, by iterated expectation and due to (21), E [ f j ( X r )] = E (cid:20)(cid:12)(cid:12)(cid:12) ˜ H j ( X r , W ) (cid:12)(cid:12)(cid:12) (cid:21) σ − H D n , for 1 j d. Second, observe that σ H,j f j ( x r ) . E (cid:2) | H j ( x r , W ) − h j ( x r ) | (cid:3) + | h j ( x r ) | . B n,j ( x r ) + | h j ( x r ) | , and thus due to (C3), (C4) and Lemma A.10 and A.11, k f j ( X r ) k ψ q/ . σ − H,j (cid:0) k B n,j ( X r ) k ψ q/ + k h j ( X r ) k ψ q/ (cid:1) = σ − H,j (cid:16) k B n,j ( X r ) k ψ q + k h j ( X r ) k ψ q (cid:17) . ( σ − H D n ) . Further, observe that by Lemma A.11, σ H,j b j ( x r ) . k | H j ( x r , W ) − h j ( x r ) | k ψ q/ + | h j ( x r ) | + σ H,j f j ( x r )= B n,j ( x r ) + | h j ( x r ) | + σ H,j f j ( x r ) . Thus by the same argument, k b j ( X r ) k ψ q/ . ( σ − H D n ) . Then by Lemma A.6,with probability at least 1 − n − , Z C (cid:16) σ − H D n + n − r log /q ( dn ) σ − H D n + | I n,r | − r /q log /q ( dn ) σ − H D n (cid:17) . Due to Lemma A.9 and assumption (26), P ( b L n Cσ − H p − / n D n ) > − /n .Thus there is a constant C , depending on q , such that if L n := C σ − H p − / n r /q D n , (30)then P ( b L n L n ) > − /n . Step 1.3: bounding c M n,X ( φ n ) . Since Z ι is a Bernoulli random variable, it isclear that c M n,X ( φ n ) = 0 on the event M := max ι ∈ I n,r max j d | ˜ H j ( X ι , W ι ) | √ N φ n log( d ) = 4 − K − C / (cid:18) r /q D n Nσ H log( d ) (cid:19) / , . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics where we use the value (30) for L n .By assumption (C3’) and Lemma A.11, k M k ψ q C ′ σ − H r /q D n log /q ( dn ) ⇒ P (cid:16) M C ′ σ − H r /q D n log /q ( dn ) (cid:17) > − /n. Due to (26), (cid:18) r /q D n Nσ H log( d ) (cid:19) / > c − / σ − H r /q D n log /q ( dn ) ,φ − n = K − C / (cid:18) r /q D n log ( d ) σ H N (cid:19) / K − C / c / . Thus if we take c in (26) to be suﬃciently small such that c − / − K − C / > C ′ and K − C / c / . then P ( c M n,X ( φ n ) = 0) > − /n and φ n > Step 1.4: select M n . From Step 1.3, we have shown that P ( E ′ n ) > − /n, where E ′ n := (cid:26) M := max ι ∈ I n,r max j d | ˜ H j ( X ι , W ι ) | C ′ σ − H r /q D n log /q ( dn ) (cid:27) . Then by the same argument as in Step 1.4 of the proof of [8, Theorem 3.1] anddue to (26) and φ n >

1, on the event E ′ n , for any ι ∈ I n,r , E | X,W " max j d | b Y ι,j | ; max j d | b Y j,ι | > p | I n,r | φ n log d C p | I n,r | φ n log d + CM log / ( d ) ! exp − p | I n,r | CM φ n log / d ! Cn r/ exp − | I n,r | / Cσ − / H r / q D / n log /q +5 / ( dn ) ! Cn r/ exp (cid:16) −| I n,r | / /C (cid:17) . Thus there exists an absolute constant C such that if we set M n := C n r/ exp (cid:16) −| I n,r | / /C (cid:17) , (31)then P ( c M n,Y ( φ n ) M n ) > − /n . Step 1.5: plug in L n and M n . Recall the deﬁnition L n and M n in (30) and (31).With these selections, we have shown that P ( E n ) > − C/n , where we recallthat E n := { b L n L n } ∩ { c M n ( φ n ) M n } ∩ { min j d b Γ ˜ H,jj > / } . Further, . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics on the event E n , ρ R| X,W ( √ N ζ n , b Y ) . L n log ( d ) | I n,r | ! / + M n L n . (cid:18) r /q D n log ( d ) σ H N (cid:19) / + p / n σ H D n r /q n r/ exp (cid:16) −| I n,r | / /C (cid:17) . C̟ n , which completes the proof of Step 1. Step 2.

The goal is to show that the second event in (25), ρ R| X,W ( b Y , Λ − / H Y B ) C̟ n , holds with probability at least 1 − C/n .Observe that Cov(Λ − / H Y B ) = Γ ˜ H and Γ ˜ H,jj = 1 for 1 j d . By theGaussian comparison inequality [8, Lemma C.5], ρ R| X,W ( b Y , Λ − / H Y B ) . ∆ / log / ( d ) , on the event that {k b Γ ˜ H − Γ ˜ H k ∞ ∆ } . From (27) in Step 0, P (cid:16) k b Γ ˜ H − Γ ˜ H k ∞ C ( σ − H n − r log ∨ (2 /q − ( dn ) D n ) / (cid:17) > − /n. Thus if we set ∆ = C ( σ − H n − r log ∨ (2 /q − ( dn ) D n ) / , then with probabilityat least 1 − C/n , ρ R| X,W ( b Y , Y B ) C r log ∨ (2 /q +3) ( dn ) D n σ H n ! / C̟ n . (cid:4) A.4.4. Proof of Theorem 2.4

Without loss of generality, we assume that r q D n log q ∗ ( dn ) σ g n ∧ N . (32)Observe that U ′ n,N = N b N  N X ι ∈ I n,r ( Z ι − p n )Λ / H ˜ H ( X ι , W ι ) + 1 | I n,r | X ι ∈ I n,r H ( X ι , W ι )  = N b N (cid:16)p − p n Λ / H ζ n + b U n (cid:17) := N b N Φ n , . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics where we recall that b U n and ζ n is deﬁned in Section 2.1 and in (24) respectively.Denote Y := rY A + α / n Y B .Step 1: the goal is to show that ρ (cid:16) √ n Φ n , rY A + α / n Y B (cid:17) . ̟ n . For any rectangle R ∈ R , observe that P ( √ n (cid:16) b U n + p − p n Λ / H ζ n (cid:17) ∈ R )= E " P | X,W √ Nζ n ∈ p α n (1 − p n ) Λ − / H R − s N − p n Λ − / H b U n !! . By Lemma A.14, since n − . ̟ n , we have P ( √ n (cid:16) b U n + p − p n Λ / H ζ n (cid:17) ∈ R ) E " P | X,W Λ − / H Y B ∈ p α n (1 − p n ) Λ − / H R − s N − p n Λ − / H b U n !! + C̟ n = P (cid:16) √ n b U n ∈ h R − p α n (1 − p n ) Y B i(cid:17) + C̟ n , where we recall that Y B is independent of all other random variables. Further,by Theorem 2.1, P ( √ n (cid:16) b U n + p − p n Λ / H ζ n (cid:17) ∈ R ) E h P | Y B (cid:16) √ n b U n ∈ h R − p α n (1 − p n ) Y B i(cid:17)i + C̟ n E h P | Y B (cid:16) rY A ∈ h R − p α n (1 − p n ) Y B i(cid:17)i + C̟ n , = P (cid:16) Λ − / g ( rY A + p α n (1 − p n ) Y B ) ∈ Λ − / g R (cid:17) + C̟ n . Observe that E [( σ − g,j rY A,j ) ] = r > j d , k Γ H k ∞ . D n due to (C3’), and α n p n = n/ | I n,r | . n − . Then by the Gaussian comparisoninequality [8, Lemma C.5] and due to (32) P ( √ n Φ n ∈ R ) P (cid:16) Λ − / g ( rY A + √ α n Y B ) ∈ Λ − / g R (cid:17) + C̟ n + C (cid:18) D n log ( d ) σ g n (cid:19) / P ( rY A + √ α n Y B ∈ R ) + C̟ n . Similarly, we can show P ( √ n Φ n ∈ R ) > P (cid:0) rY A + √ α n Y B ∈ R (cid:1) − C̟ n . Thusthe proof of Step 1 is complete.Step 2: we show that with probability at least 1 − C̟ n , k ( N b N − √ N Φ n k ∞ Cν n , where ν n := s log ( dn ) r D n n ∧ N . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Clearly, E [ Y j ] = r σ g,j + α n σ H,j . Then due to (C3’), E [ Y j ] ( r + α n ) D n . Since Y is a multivariate Gaussian, max j d k Y j k ψ p ( r + α n ) D n . Then by themaximal inequality [29, Lemma 2.2.2] k max j d | Y j |k ψ C p ( r + α n ) D n log( d ),which further implies that P (cid:18) max j d | Y j | > C p ( r + α n ) D n log( d ) log( n ) (cid:19) n − . Since n − . ̟ n , and from the result in Step 1, we have P (cid:16) k√ n Φ n k ∞ > C p ( r + α n ) D n log( d ) log( n ) (cid:17) C̟ n . Finally, due to Lemma A.12 and (32), we have with probability at least 1 − C̟ n , k ( N / b N − √ N Φ n k ∞ C q ( r + α n ) D n log( d ) log ( n ) N − α − n . Since ( r + α n ) N − α − n = r n − + N − r ( n ∧ N ) − , the proof is complete.Step 3: ﬁnal step. Recall that √ N U ′ n,N = √ N Φ n + ( N/ b N − √ N Φ n and ν n is deﬁned in Step 2. For any rectangle R = [ a, b ] with a b , by Step 2, P (cid:16) √ N U ′ n,N ∈ R (cid:17) P (cid:16) √ N U ′ n,N ∈ R ∩ k ( N/ b N − √ N Φ n k ∞ Cν n (cid:17) + C̟ n P (cid:16) √ N Φ n − a + Cν n ∩ √ N Φ n b + Cν n (cid:17) + C̟ n . Then by the result in Step 1, we have P (cid:16) √ N U ′ n,N ∈ R (cid:17) P (cid:16) α − / n Y − a + Cν n ∩ α − / n Y b + Cν n (cid:17) + C̟ n P (cid:16) α − / n ˜ Y − ˜ a + Cσ − H ν n ∩ α − / n ˜ Y ˜ b + Cσ − H ν n (cid:17) + C̟ n , where ˜ Y = Λ − H Y , ˜ a = Λ − H a and ˜ b = Λ − H b . Observe that E [( α − / n ˜ Y j ) ] > E [( σ − H,j Y B,j ) ] = 1 for 1 j d , and thus by anti-concentration inequality [10,Lemma A.1], P (cid:16) √ N U ′ n,N ∈ R (cid:17) P (cid:16) α − / n ˜ Y − ˜ a ∩ α − / n Y ˜ b (cid:17) + Cσ − H ν n log / ( d ) + C̟ n = P (cid:16) α − / n Y ∈ R (cid:17) + s log( d ) log ( dn ) r D n σ H n ∧ N + C̟ n P (cid:16) α − / n Y ∈ R (cid:17) + C̟ n , where the last inequality is due to (32). Similarly, we can show P (cid:16) √ N U ′ n,N ∈ R (cid:17) > P (cid:16) α − / n Y ∈ R (cid:17) − C̟ n , and thus ρ ( √ N U ′ n,N , α − / n Y ) . ̟ n , which completes the proof. . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics A.5. Proofs in Section 3

In this subsection, without loss of generality, we assume q A.5.1. Proof of Theorem 3.1Proof.

Without loss of generality, we can assume θ = E [ H ( X r , W )] = 0, sinceotherwise we can center H ﬁrst. Recall the deﬁnition of Λ H in (5), ˜ H ( · ) =Λ − / H H ( · ), and Γ ˜ H , b Γ ˜ H in (20). Observe that for any integer k , there existssome constant C that depends only on k and ζ such thatlog k ( n ) n − ζ C. (33)Step 0. Deﬁne ˜ U ′ n,N := Λ − / H U ′ n,N and b ∆ B := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b N X ι ∈ I n,r Z ι (cid:16) ˜ H ( X ι , W ι ) − ˜ U ′ n,N (cid:17) (cid:16) ˜ H ( X ι , W ι ) − ˜ U ′ n,N (cid:17) T − Γ ˜ H (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Since Γ ˜ H,jj = 1 for 1 j d , by Gaussian comparison inequality [8, C.5],sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n,B ∈ R (cid:17) − P ( Y B ∈ R ) (cid:12)(cid:12)(cid:12) = sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) Λ − / H U n,B ∈ R (cid:17) − P (Λ − / H Y B ∈ R ) (cid:12)(cid:12)(cid:12) . (cid:16) b ∆ B log ( d ) (cid:17) / . Thus it suﬃces to show that with probability at least 1 − C/n , b ∆ B log ( d ) . n − ζ/ . Deﬁne b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N − X ι ∈ I n,r ( Z ι − p n ) ˜ H ( X ι , W ι ) ˜ H ( X ι , W ι ) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)b Γ ˜ H − Γ ˜ H (cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := | N/ b N − | k Γ ˜ H k ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N − X ι ∈ I n,r Z ι ˜ H ( X ι , W ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Then clearly b ∆ B | N/ b N | (cid:16) b ∆ B, + b ∆ B, (cid:17) + b ∆ B, + ( N/ b N ) b ∆ B, .Without loss of generality, we can assume C n − ζ /

16, since we can alwaystake C to be large enough. Then by Lemma A.12, P ( | N/ b N | C ) > − n − ,and thus it suﬃces to show that P (cid:16) b ∆ B,i log ( d ) Cn − ζ/ (cid:17) > − C/n, for all i = 1 , . . . , , . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics on which we now focus.Step 1: bounding b ∆ B, . Conditional on { X ι , W ι : ι ∈ I n,r } , by Lemma A.3, P (cid:16) N b ∆ B, C (cid:16)p N V n log( dn ) + M log( dn ) (cid:17)(cid:17) > − C/n, where V n := max j,k d | I n,r | − X ι ∈ I n,r ˜ H j ( X ι , W ι ) ˜ H k ( X ι , W ι ) , M := max ι ∈ I n,r max j d ˜ H j ( X ι , W ι ) . First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) anddue to (C3’) and Lemma A.10 and A.11, k M k ψ q/ . σ − H r /q log /q ( dn ) max ι ∈ I n,r max j d k H j ( X ι , W ι ) k ψ q/ . σ − H r /q log /q ( dn ) D n . As a result, P (cid:16) M Cσ − H r /q D n log /q ( n ) log /q ( dn ) (cid:17) > − /n .Second, we will apply Lemma A.6 to bound V n with F jk ( · ) = ˜ H j ( · ) ˜ H k ( · ) and β = q/

4. Note that by Lemma A.11, for 1 j, k d , σ H,j σ H,k f jk ( x r ) := E (cid:2) H j ( x r , W ) H k ( x r , W ) (cid:3) . E (cid:2) H j ( x r , W ) + H k ( x r , W ) (cid:3) . h j ( x r ) + B n,j ( x r ) + h k ( x r ) + B n,k ( x r ) ,σ H,j σ H,k b jk ( x r ) := k H j ( x r , W ) H k ( x r , W ) − σ H,j σ H,k f jk ( x r ) k ψ q/ . h j ( x r ) + B n,j ( x r ) + h k ( x r ) + B n,k ( x r ) + σ H,j σ H,k f jk ( x r ) . As a result, due to (C5), (C3) and (C4) E [ f jk ( X r )] . ( σ − H D n ) , k f jk ( X r ) k ψ q/ . ( σ − H D n ) , k b jk ( X r ) k ψ q/ . ( σ − H D n ) . Then by Lemma A.6 and A.9, and due to (8) and (33) P ( V n Cσ − H D n ) > − /n. Finally, putting the two results together and again by (33), we have P (cid:16) b ∆ B, C (cid:16) N − / log / ( dn ) σ − H D n + N − r /q log /q ( n ) log /q +1 ( dn ) σ − H D n (cid:17)(cid:17) > − C/n.

Then by (8), P (cid:16) b ∆ B, Cσ − H N − / r /q log / ( dn ) D n (cid:17) > − C/n , which im-plies that with probability at least 1 − C/n , b ∆ B, log ( d ) Cn − ζ/ . Step 2: bounding b ∆ B, . By Lemma A.13 and A.9, and due to assumptions (8)and (33) P (cid:16) b ∆ B, Cσ − H n − / r / log / ( dn ) D n (cid:17) > − /n, . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics which implies P ( b ∆ B, log ( d ) Cn − ζ/ ) > − /n .Step 3: bounding b ∆ B, . By deﬁnition, k Γ ˜ H k ∞ = 1. Then by Lemma A.12 and (8), b ∆ B, log ( d ) N − / log / ( n ) log ( d ) Cn − ζ/ , with probability at least 1 − n − .Step 4: bounding b ∆ B, . Deﬁne b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N − X ι ∈ I n,r ( Z ι − p n ) ˜ H ( X ι , W ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I n,r | − X ι ∈ I n,r ˜ H ( X ι , W ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Clearly, b ∆ B, (cid:16) b ∆ B, + b ∆ B, (cid:17) . In the next two sub-steps, we will boundthese two terms separately.Step 4.1: bounding b ∆ B, . Conditional on { X ι , W ι : ι ∈ I n,r } , by Lemma A.3, P (cid:18) N b ∆ B, C (cid:18)q N e V n log( dn ) + f M log( dn ) (cid:19)(cid:19) > − C/n, where e V n := max j d | I n,r | − P ι ∈ I n,r ˜ H j ( X ι , W ι ) , f M := max ι ∈ I n,r max j d | ˜ H j ( X ι , W ι ) | .First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) anddue to (C3’), k f M k ψ q . σ − H r /q log /q ( dn ) D n . As a result, P (cid:16) f M Cσ − H r /q D n log /q ( n ) log /q ( dn ) (cid:17) > − /n .Second, we will apply Lemma A.6 to bound e V n with F j ( · ) = ˜ H j ( · ) and β = q/

2. Deﬁne for 1 j d , f j ( x r ) := E h ˜ H j ( x r , W ) i , b j ( x r ) := k ˜ H j ( x r , W ) − f j ( x r ) k ψ q/ . By the similar argument as in Step 1, E [ f j ( X r )] = 1 , k f j ( X r ) k ψ q/ . ( σ − H D n ) , k b j ( X r ) k ψ q/ . ( σ − H D n ) . Then by Lemma A.6 and A.9, and due to (8) and (33) we have P ( e V n C ) > − /n .Finally, putting the two results together, we have P (cid:16) b ∆ B, C (cid:16) N − log( dn ) + σ − H N − r /q log /q +2 ( dn ) log /q ( n ) D n (cid:17)(cid:17) > − C/n.

Then by (8), P (cid:16) b ∆ B, CN − log( dn ) (cid:17) > − C/n , which implies that withprobability at least 1 − C/n , b ∆ B, log ( d ) Cn − ζ holds. . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Step 4.2: bounding b ∆ B, . Observe that b ∆ B, b ∆ B, + b ∆ B, , where b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I n,r | − X ι ∈ I n,r Λ − / H ( H ( X ι , W ι ) − h ( X ι )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I n,r | − X ι ∈ I n,r Λ − / H h ( X ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . By directly applying Lemma A.7 with β = q , due to (8) and Lemma A.9, P (cid:16) b ∆ B, Cσ − H D n n − log / ( dn ) (cid:17) > − /n. By directly applying Lemma A.5 with β = q and due to (8), P (cid:16) b ∆ B, Cσ − H n − / r / log / ( dn ) D n (cid:17) > − /n. Thus P (cid:16) b ∆ B, log ( d ) Cn − ζ (cid:17) > − C/n .Combining sub-step 4.1 and 4.2, we have P (cid:16) b ∆ B, log ( d ) Cn − ζ (cid:17) > − C/n . And combining Step 0-4, we ﬁnish the proof. (cid:4)

A.5.2. Proof of Lemma 3.2Proof.

Without loss of generality, we can assume θ = E [ H ( X r , W )] = 0. Recallthe deﬁnition Λ g is (5). By deﬁnition, E [( σ − g,j Y A,j ) ] = 1 for 1 j d . Thenby the Gaussian comparison inequality [8, Lemma C.5],sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n ,A ∈ R (cid:17) − P ( Y A ∈ R ) (cid:12)(cid:12)(cid:12) = sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) Λ − / g U n ,A ∈ R (cid:17) − P (Λ − / g Y A ∈ R ) (cid:12)(cid:12)(cid:12) . ( b ∆ A log ( d )) / , where b ∆ A := max j,k d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ g,j σ g,k n X i ∈ S ( G i ,j − G j )( G i ,k − G k ) − σ g,j σ g,k Γ g,jk (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . By the same argument as in the proof of [8, Theorem 4.2], b ∆ A . b ∆ / A, + b ∆ A, + b ∆ A, + b ∆ , where b ∆ A, is deﬁned in (9), and b ∆ A, := max j,k d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ g,j σ g,k n X i ∈ S ( g j ( X i ) g k ( X i ) − Γ g,jk ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , b ∆ A, := max j,k d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ g,j n X i ∈ S g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Step 1: bounding b ∆ A, . By the second part of (11), we have P (cid:16) b ∆ / A, log ( d ) C / n − ζ / (cid:17) − Cn − , P (cid:16) b ∆ A, log ( d ) C n − ζ (cid:17) − Cn − . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Step 2: bounding b ∆ A, . We apply Lemma A.2 with β = q/ m = n and notethat n n : P (cid:16) b ∆ A, > C (cid:16) n − σ log / ( dn ) + n − u n log /q +1 ( dn ) log /q ( n ) (cid:17)(cid:17) n − . where σ = max j,k d σ − g,j σ − g,k P i ∈ S E [( g j ( X i ) g k ( X i ) − Γ g,jk ) ] and u n = k σ − g,j σ − g,k ( g j ( X i ) g k ( X i ) − Γ g,jk ) k ψ q/ . By Lemma A.11, (C2) and (C3’), σ n (cid:0) σ − g D n (cid:1) , u n (cid:0) σ − g D n (cid:1) . Thus P (cid:16) b ∆ A, > C (cid:16) n − / σ − g D n log / ( dn ) + n − σ − g D n log /q +1 ( dn ) log /q ( n ) (cid:17)(cid:17) n − . Then due to the ﬁrst part of (11) and (33), P ( b ∆ A, log ( d ) > Cn − ζ / ) Cn − .Step 3: bounding b ∆ A, . We apply Lemma A.2 with β = q , m = n : P (cid:16) b ∆ A, > C (cid:16) n − / log / ( dn ) + n − σ − g D n log ( dn ) log( n ) (cid:17)(cid:17) n − . Then due to the ﬁrst part of (11) and (33), P ( b ∆ A, log ( d ) > Cn − ζ ) Cn − . (cid:4) A.5.3. Proof of Theorem 3.3Proof.

Without loss of generality, we can assume θ = E [ H ( X r , W )] = 0.Step 1. Let ζ := ζ , ζ := ζ − /ν . Due to Theorem 3.1, Lemma 3.2 and usingthe same argument as in the Step 3 of the proof of [8, Theorem 4.2], it suﬃcesto show the second part of (11) holds. From the deﬁnition (9), b ∆ A, σ − g max j d n X i ∈ S ( G i ,j − g j ( X i )) := σ − g ∆ A, . In Step 2, we will show that E h ∆ νA, i . (cid:16) n − rD n log /q +1 ( d ) (cid:17) ν . (34)Then by Markov inequality and (12), P (cid:16) b ∆ A, log ( d ) > C n − ζ (cid:17) . n ζ ν σ − νg log ν ( d ) (cid:16) n − rD n log /q +1 ( d ) (cid:17) ν = n − (cid:16) n ζ n − σ − g rD n log /q +5 ( d ) (cid:17) ν . n − , which completes the proof. . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Step 2. The goal is to show (34). Deﬁne F ( x r , w ) := max j d | H j ( x r , w ) | , g ( i ,k ) ( X i ) := H ( X S ( i ,k , W S ( i ,k ) for i ∈ S , k = 1 , . . . , K. By Jensen’s inequality, E [∆ νA, ] n X i ∈ S E (cid:20) max j d | G i ,j − g j ( X i ) | ν (cid:21) , and for each i ∈ S , conditional on X i , by Hoﬀmann-Jorgensen inequality [29,A.1.6.], E | Xi (cid:20) max j d | G i ,j − g j ( X i ) | ν (cid:21) . I i + II i := (cid:18) E | X i (cid:20) max j d | G i ,j − g j ( X i ) | (cid:21)(cid:19) ν + K − ν E | X i (cid:20) max k K max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21) . Step 2.1: bounding II i . Observe that for each 1 k K , E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21) = E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − E | X i h g ( i ,k ) j ( X i ) i(cid:12)(cid:12)(cid:12) ν (cid:21) . E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21) = E | X i (cid:20) F ν ( X S ( i ,k , W S ( i ,k ) (cid:21) = E | X i h F ν ( X S ( i , , W S ( i , ) i := b ( X i ) . Thus II i . K − ν +1 b ( X i ).Step 2.2: bounding I i . Observe that for each i ∈ S ,max j d K X k =1 E | X i (cid:20)(cid:16) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:17) (cid:21) . K E | X i h F ( X S ( i , , W S ( i , ) i := K e b ( X i ) . Further, by Jensen’s inequality, E | X i (cid:20) max k K max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) (cid:21) K X k =1 E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21)! /ν . K /ν b /ν ( X i ) , where b ( X i ) is deﬁned in Step 1. Then by the same argument as in the proofof [8, Proposition 4.4], I i . K − ν log ν ( d ) e b ν ( X i ) + K − ν +1 log ν ( d ) b ( X i ) . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Step 2.3: combining 2.1 and 2.2. By Jensen’s inequality, assumption (C3’) andby the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) E he b ν ( X i ) i log ν/q ( d ) D νn , E [ b ( X i )] log ν/q ( d ) D νn . Thus combining the results from 2.1 and 2.2, we have E (cid:20) max j d | G i ,j − g j ( X i ) | ν (cid:21) . K − ν D νn log ν ( d ) (cid:0) K − ν +1 log ν ( d ) (cid:1) . (cid:16) n − rD n log /q +1 ( d ) (cid:17) ν , where the second inequality is due to (12) and that ν > / K = ⌊ ( n − / ( r − ⌋ . (cid:4) A.5.4. Proof of Corollary 3.5Proof.

We have shown in Step 0 of the proof (Subsection A.5.1) for Theorem 3.1that P (cid:18) max j d | b σ H,j /σ H,j − | log ( d ) . Cn − ζ/ (cid:19) > − Cn − , Further, if we take ν = 7 /ζ in Theorem 3.3, then in the proof for Theorem 3.2and Theorem 3.3, we have shown that P (cid:18) max j d | b σ g,j /σ g,j − | log ( d ) . Cn − ζ/ (cid:19) > − Cn − . The rest of the proof is the same as the proof for [8, Corollary A.1], and thusomitted. (cid:4)

A.6. Proof of Lemma 4.1

Proof.

Clearly, the inequality is for each dimension, and thus without loss ofgenerality, we assume d = 1 and omit the dependence on j .We denote E β and Cov β the expectation and covariance when X , . . . , X r have densities f β . Further, deﬁne g β ( x ) = E β [ h ( x , X , . . . , X r )] for x ∈ S and by deﬁnition g ( · ) = g ( · ).First, note that by interchanging the order of integration and diﬀerentiation E β [Ψ( β )] = Z r X i =1 ∇ ln f β ( x i ) ! r Y i =1 f β ( x i ) µ ( dx i ) = Z ∇ r Y i =1 f β ( x i ) ! r Y i =1 µ ( dx i ) = 0 . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics Further, by a similar argument,Cov β ( g β ( X ) , Ψ( β )) = Z g β ( x ) r X i =1 ∇ ln f β ( x i ) ! r Y i =1 f β ( x i ) µ ( dx i )= Z g β ( x ) ∇ ln f β ( x ) f β ( x ) µ ( dx )= Z Z h ( x , x , . . . , x r ) r Y i =2 f β ( x i ) µ ( dx i ) ! ( ∇ ln f β ( x )) f β ( x ) µ ( dx )= Z h ( x , x , . . . , x r ) ∇ ln f β ( x ) r Y i =1 f β ( x i ) µ ( dx i ) , which implies that r X i =1 Cov β ( g β ( X i ) , Ψ( β )) = Z h ( x , x , . . . , x r ) r X i =1 ∇ ln f β ( x i ) ! r Y i =1 f β ( x i ) µ ( dx i )= Z h ( x , x , . . . , x r ) ∇ r Y i =1 f β ( x i ) ! r Y i =1 µ ( dx i ) = ∇ θ ( β ) . Finally, observe that0 Var β r X i =1 g β ( X i ) − ∇ θ ( β ) T ( r J ( β )) − Ψ( β ) ! = r X i =1 Var β ( g β ( X i )) − r − Cov β r X i =1 g β ( X i ) , ∇ θ ( β ) T J ( β ) − Ψ( β ) ! + r − Var β (cid:0) ∇ θ ( β ) T J ( β ) − Ψ( β ) (cid:1) = r Var β ( g β ( X )) − r − ∇ θ ( β ) T J ( β ) − ∇ θ ( β ) , which completes the proof. (cid:4) A.7. Proofs of tail probabilities in Section A.1

A.7.1. Proof of Lemma A.1Proof.

We ﬁrst deﬁne S := max j d m X i =1 Z ij , M := max i m max j d Z ij . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dm ).By [10, Lemma E.4], P ( S > E [ S ] + t ) − (cid:18) tC k M k ψ β (cid:19) β ! . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics The right hand side is 3 /n if t = C k M k ψ β log /β ( n ) Cu n log /β ( n ) log /β ( dm ) . Further by [10, Lemma E.3], E [ S ] . max j d E " m X i =1 Z ij + log( d ) E [ M ] . max j d E " m X i =1 Z ij + u n log /β +1 ( dm ) . Combining two parts ﬁnishes the proof. (cid:4)

A.7.2. Proof of Lemma A.2Proof.

We ﬁrst deﬁne S := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , M := max i m max j d | Z ij | . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dm ).By [10, Lemma E.2], P ( S > E [ S ] + t ) exp( − t / (3 σ )) + 3 exp − (cid:18) tC k M k ψ β (cid:19) β ! . The right hand side is 4 /n if t = √ σ log / ( n ) + C k M k ψ β log /β ( n ) C (cid:16) σ log / ( n ) + log /β ( dm ) log /β ( n ) u n (cid:17) . Further by [10, Lemma E.1], E [ S ] . σ log / ( d ) + log( d ) p E [ M ] . σ log / ( d ) + log /β +1 ( dm ) u n . Combining two parts ﬁnishes the proof. (cid:4)

A.7.3. Proof of Lemma A.3Proof.

We ﬁrst deﬁne S := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 ( Z i − p n ) a ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , ˜ M := max i m max j d | ( Z ij − p n ) a ij | max i m max j d | a ij | ˜ σ := max j d m X i =1 E [( Z i − p n ) a ij ] p n (1 − p n ) max j d m X i =1 a ij . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics By [10, Lemma E.2], P ( S > E [ S ] + t ) exp( − t / (3˜ σ )) + 3 exp − tC k ˜ M k ψ ! . The right hand side is 4 /n if t = √ σ log / ( n ) + C k ˜ M k ψ log( n ) C (cid:16)p p n (1 − p n ) σ log / ( n ) + M log( n ) (cid:17) . Further by [10, Lemma E.1], E [ S ] . ˜ σ log / ( d ) + log( d ) q E [ ˜ M ] . p p n (1 − p n ) σ log / ( d ) + M log( d ) . Combining two parts ﬁnishes the proof. (cid:4)

A.7.4. Proof of Lemma A.4Proof.

Let m = ⌊ n/r ⌋ , and deﬁne the following quantity Z := max j d m X i =1 f j ( X ir ( i − r +1 ) , M := max i m max j d f j ( X ir ( i − r +1 ) . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dn ).By [6, Lemma E.3], P (cid:18) m max j d U n,j > E [ Z ] + t (cid:19) − (cid:18) tC k M k ψ β (cid:19) β ! . The right hand side is 3 /n if we set t = C k M k ψ β log /β ( n ) Cu n log /β ( dn ) log /β ( n ) , Further, by [9, Lemma 9], E [ Z ] C max j d E " m X i =1 f j ( X ir ( i − r +1 ) + log( d ) E [ M ] ! C (cid:16) mv n + u n log /β +1 ( dn ) (cid:17) . Putting two parts together, we have P (cid:18) max j d U n,j > C (cid:16) v n + n − ru n log /β +1 ( dn ) + n − ru n log /β ( dn ) log /β ( n ) (cid:17)(cid:19) n , which completes the proof. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics A.7.5. Proof of Lemma A.5Proof.

Let m = ⌊ n/r ⌋ , and deﬁne the following quantity Z := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 f j ( X ir ( i − r +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , M := max i m max j d (cid:12)(cid:12)(cid:12) f j ( X ir ( i − r +1 ) (cid:12)(cid:12)(cid:12) . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dn ).By [8, Lemma C.3], P (cid:18) m max j d | U n,j | > E [ Z ] + t (cid:19) exp (cid:18) − t mσ (cid:19) + 3 exp − (cid:18) tC k M k ψ β (cid:19) β ! . The right hand side is 4 /n if we take t = σ √ m log / ( n ) + C k M k ψ β log /β ( n ) C (cid:16) σm / log / ( n ) + u n log /β ( dn ) log /β ( n ) (cid:17) . Further, by [9, Lemma 8], E [ Z ] . p log( d ) mσ + q E [ M ] log( d ) . m / log / ( d ) σ + u n log /β +1 ( dn ) . Putting two parts together completes the proof. (cid:4)

A.7.6. Proof of Lemma A.6Proof.

First, observe that k F j ( x r , W ) k ψ β . f j ( x r ) + b j ( x r ). Denote Z := max j d | I n,r | X ι ∈ I n,r f j ( X ι ) , M := max ι ∈ I n,r max j d ( f j ( X ι ) + b j ( X ι )) , Then conditional on X n , by Lemma A.1, P | X n (cid:16) Z > C (cid:16) Z + | I n,r | − M r /β log /β +1 ( dn ) log /β − ( n ) (cid:17)(cid:17) | I n,r | n . By Lemma A.4, P (cid:18) Z > C (cid:18) max j d E [ f j ( X r )] + n − r log /β +1 ( dn ) log /β − ( n ) u n (cid:19)(cid:19) n . Further, by maximal inequality [29, Lemma 2.2.2] k M k ψ β Cr /β log /β ( dn ) u n ⇒ P (cid:16) M > Cr /β log /β ( n ) log /β ( dn ) u n (cid:17) n . Then the proof is complete by combining above results. (cid:4) . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics A.7.7. Proof of Lemma A.7Proof.

First, we deﬁne σ := max j d X ι ∈ I n,r E | X n h ( F j ( X ι , W ι ) − f j ( X ι )) i . max j d X ι ∈ I n,r b j ( X ι ) ,M := max ι ∈ I n,r max j d b j ( X ι ) . Then by ﬁrst conditional on X n and by Lemma A.2, P (cid:16) | I n,r | Z > C ( σr / log / ( dn ) + M r /β log /β +1 ( dn ) log /β − ( n )) (cid:17) | I n,r | n . Observe that k b j ( X r ) k ψ β/ = k b j ( X r ) k ψ β u n . Then by Lemma A.4 with ψ β/ , P (cid:18) σ | I n,r | > Cu n (cid:16) n − r log /β +1 ( dn ) log /β − ( n ) (cid:17)(cid:19) n . Further, by maximal inequality [29, Lemma 2.2.2] k M k ψ β Cr /β log /β ( dn ) u n ⇒ P ( M > Cr /β log /β ( dn ) log /β ( n ) u n ) n . Then the proof is complete by combining above results. (cid:4)

A.8. Proofs of additional lemmas

The following lemma is similar to [10, Lemma C.1], and is needed in provingLemma A.8.

Lemma A.15.

Let q ∈ (0 , , and ξ be a non-negative random variable suchthat k ξ k ψ q D . Then there exists a constant C , depending only on q , such that E (cid:2) ξ ; ξ > t (cid:3) C ( t + D ) e − ( t/D ) q , for t > . Proof.

Since k ξ k ψ q D , we have for x > P ( ξ > x ) e − ( x/D ) q E h e − ( ξ/D ) q i e − ( x/D ) q . . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics By change of variable, we have E (cid:2) ξ ; ξ > t (cid:3) t P ( ξ > t ) + 3 Z ∞ t P ( ξ > x ) x d x . t e − ( t/D ) q + D Z ∞ ( t/D ) q e − u u /q − d u . t e − ( t/D ) q + D e − ( t/D ) q Z ∞ e − u ( u + ( t/D ) q ) /q − d u . t e − ( t/D ) q + D e − ( t/D ) q Z ∞ e − u (cid:16) u /q − + ( t/D ) − q (cid:17) d u . t e − ( t/D ) q + D e − ( t/D ) q Z ∞ e − u (cid:16) u /q − + ( t/D ) − q (cid:17) d u . (cid:0) t + D + t − q D q (cid:1) e − ( t/D ) q . (cid:0) t + D (cid:1) e − ( t/D ) q . (cid:4) Proof of Lemma A.8.

For q >

1, it has been established by [10, Proposition 2.1].For q <

1, the proof is almost identical to that for [10, Proposition 2.1], exceptthat we replace [10, Lemma C.1] by Lemma A.15. (cid:4)

Proof of Lemma A.11. (i). Without loss of generality, we assume 0 < x := k X k ψ β < ∞ , and 0 < y := k Y k ψ β < ∞ . Observe that E " exp (cid:18) | X + Y | /β ( x + y ) (cid:19) β E (cid:20) exp (cid:18) | X | β + | Y | β x + y ) β (cid:19)(cid:21) E (cid:20)

12 exp (cid:18) | X | β ( x + y ) β (cid:19)(cid:21) + E (cid:20)

12 exp (cid:18) | Y | β ( x + y ) β (cid:19)(cid:21) . (ii). From Lemma 5.4, for 1 i n , E (cid:20) e ψ β (cid:18) | ξ i | D (cid:19)(cid:21) E (cid:20) ψ β (cid:18) | ξ i | D (cid:19)(cid:21) + 1 , which, by the convexity of e ψ β and the fact e ψ β (0) = 0, implies k ξ i k e ψ β D . Bythe standard maximal inequality (e.g., see [29, Lemma 2.2.2]) and Lemma 5.4, k max i n ξ i k e ψ β C log /β ( n ) D . Thus by Lemma 5.4, E  exp max i n ξ i C log /β ( n ) D ! β  E " ψ β max i n ξ i C log /β ( n ) D ! + e /β e /β . Now we let m > (cid:0) e /β (cid:1) /m

2. Then by Jensen’s inequality( E [ X /m ] ( E [ X ]) /m for X > E  exp max i n ξ i Cm /β log /β ( n ) D ! β  , . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics which implies that k max i n ξ i k e ψ β . log /β ( n ) D . (cid:4) Acknowledgements

X. Chen is supported in part by NSF DMS-1404891, NSF CAREER AwardDMS-1752614, and UIUC Research Board Awards (RB17092, RB18099).

References [1] Gunnar Blom. Some properties of incomplete U -statistics. Biometrika ,63(3):573–580, 1976.[2] Yu. V. Borovskikh.

U-Statistics in Banach Spaces . V.S.P. Intl Science,1996.[3] Leo Breiman. Bagging predictors.

Machine Learning , 24:123–140, 1996.[4] Leo Breiman. Random forests.

Machine Learning , 45:5–32, 2001.[5] B.M. Brown and D.G. Kildea. Reduced U -statistics and the Hodges-Lehmann estimator. Annals of Statistics , 6:828–835, 1978.[6] Xiaohui Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics and their applications.

The Annals of Statistics ,46(2):642–678, 2018.[7] Xiaohui Chen and Kengo Kato. Jackknife multiplier bootstrap: ﬁnite sam-ple approximations to the U -process supremum with applications. 2017.arXiv:1708.02705.[8] Xiaohui Chen and Kengo Kato. Randomized incomplete u -statisticsin high dimensions. The Annals of Statistics, accepted (available atarXiv:1712.00771) , 2018+.[9] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Comparisonand anti-concentration bounds for maxima of gaussian random vectors.

Probability Theory and Related Fields , 162(1-2):47–70, 2015.[10] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limittheorems and bootstrap in high dimensions.

Ann. Probab. , 45(4):2309–2352, 07 2017.[11] Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Gaussianapproximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.

The Annals of Statistics , 41(6):2786–2819,2013.[12] St´ephan Cl´emen¸con, G´abor Lugosi, and Nicolas Vayatis. Ranking and em-pirical minimization of u -statistics. Annals of Statistics , 36(2), 844-874.[13] Victor De la Pena and Evarist Gin´e.

Decoupling: from dependence to inde-pendence . Springer Science & Business Media, 2012.[14] Edward W. Frees. Inﬁnite order u-statistics.

Scandinavian Journal ofStatistics , 16(1):29–45, 1989.[15] Edward W Frees. Estimating densities of functions of observations.

Journalof the American Statistical Association , 89(426):517–525, 1994. . Song, X. Chen, K. Kato/High-dimensional inﬁnite-order U -statistics [16] Karl O Friedrich. A berry-esseen bound for functions of independent ran-dom variables. The Annals of Statistics , pages 170–183, 1989.[17] Evarist Gin´e, David M Mason, et al. On local u-statistic processes and theestimation of densities of functions of several sample variables.

The Annalsof Statistics , 35(3):1105–1145, 2007.[18] Charles Heilig and Deborah Nolan. Limit theorems for the inﬁnite-degree u -process. Statistica Sinica , 11:289–302, 2001.[19] Wassily Hoeﬀding. A class of statistics with asymptotically normal distri-bution.

The Annals of Mathematical Statistics , 19(3):293–325, 1948.[20] Svante Janson. The asymptotic distributions of incomplete U -statistics. Z,Wahrscheinlichkeitstheorie verw. Gebiete , 66:495–505, 1984.[21] Alan J. Lee.

U-Statistics: Theory and Practice . Statistics: A Series ofTextbooks and Monographs (Book 110). CRC Press, 1990.[22] P. Major. Asymptotic distributions for weighted U-statistics.

Annals ofProbability , 21(2):1514–1535, 1994.[23] Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forestsvia conﬁdence intervals and hypothesis tests.

The Journal of MachineLearning Research , 17(1):841–881, 2016.[24] K.A. O’Neil and R.A. Redner. Asymptotic distributions of weighted U -statistics of degree 2. Annals of Probability , 21(2):1159–1169, 1993.[25] M. Riﬁ and F. Utzet. On the asymptotic behavior of weighted U-statistics.

Journal of Theoretical Probability , 13(1):141–167, 2000.[26] C.P. Shapiro and L. Hubert. Asymptotic normality of permutation statis-tics derived from weighted sums of bivariate functions.

Annals of Statistics ,7(4):788–794, 1979.[27] Robert P Sherman. Maximal inequalities for degenerate u-processes withapplications to optimization estimators.

The Annals of Statistics , pages439–459, 1994.[28] Grace S. Shieh. Inﬁnite-order v -statistics. Statistics & Probability Letters ,20:75–80, 1994.[29] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In

Weakconvergence and empirical processes , pages 16–28. Springer, 1996.[30] A. J. van Es and R. Helmers. Elementary symmetric polynomials of in-creasing order.

Probability Theory and Related Fields , 80(1):21–35, Dec1988.[31] van W.R. Zwet. A berry-esseen bound for symmetric statistics.