Approximating high-dimensional infinite-order U -statistics: statistical and computational guarantees
aa r X i v : . [ m a t h . S T ] N ov arXiv: arXiv:1901.01163 Approximating high-dimensionalinfinite-order U -statistics: statistical andcomputational guarantees Yanglei Song Xiaohui Chen and Kengo Kato Department of Mathematics and Statistics, Queen’s University, 48 University Ave,Kingston, ON, Canada, K7L 3N6e-mail: [email protected] Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. WrightStreet, Champaign, IL 61820e-mail: [email protected] Department of Statistics and Data Science, Cornell University, 1194 Comstock Hall,Ithaca, NY 14853e-mail: [email protected]
Abstract:
We study the problem of distributional approximations to high-dimensional non-degenerate U -statistics with random kernels of divergingorders. Infinite-order U -statistics (IOUS) are a useful tool for constructingsimultaneous prediction intervals that quantify the uncertainty of ensemblemethods such as subbagging and random forests. A major obstacle in usingthe IOUS is their computational intractability when the sample size and/ororder are large. In this article, we derive non-asymptotic Gaussian approxi-mation error bounds for an incomplete version of the IOUS with a randomkernel. We also study data-driven inferential methods for the incompleteIOUS via bootstraps and develop their statistical and computational guar-antees. Keywords and phrases:
Infinite-order U -statistics, incomplete U -statistics,Gaussian approximation, bootstrap, random forests, uncertainty quantifi-cation.
1. Introduction
Let X , . . . , X n be independent and identically distributed (i.i.d.) random vari-ables taking value in a measurable space ( S, S ) with common distribution P ,and let h : S r → R d be a symmetric and measurable function with respect tothe product space S r equipped with the product σ -field S r = S ⊗ · · · ⊗ S ( r times). Assume E [ | h j ( X , . . . , X r ) | ] < ∞ for 1 j d , and consider the sta-tistical inference on the mean vector θ = ( θ , . . . , θ d ) T = E [ h ( X , . . . , X r )]. Anatural estimator for θ is the U -statistic with kernel h : U n := 1 | I n,r | X ι ∈ I n,r h ( X i , . . . , X i r ) := 1 | I n,r | X ι ∈ I n,r h ( X ι ) , (1)where I n,r := { ι = ( i , . . . , i r ) : 1 i < . . . < i r n } is the set of all ordered r -tuples of 1 , . . . , n and | · | denotes the set cardinality. The positive integer r is . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics called the order or degree of the kernel h or the U -statistic U n . We refer to [21]as an excellent monograph on U -statistics.In the present paper, we are interested in the situation where the order r may be nonneglible relative to the sample size n , i.e., r = r n → ∞ as n → ∞ . U -statistics with divergent orders are called infinite-order U -statistic (IOUS)[14]. IOUS has attracted renewed interests in the recent statistics and machinelearning literature in relation to uncertainty quantification for Breiman’s bag-ging [3] and random forests [4]. In such applications, the tree-based predictionrules can be thought of as U -statistics with deterministic and random kernels,respectively, and their order corresponds to the sub-sample size of the trainingdata [23]. Statistically, the subsample size r used to build each tree needs toincrease with the total sample size n to produce reliable predictions. As a lead-ing example, we consider construction of simultaneous prediction intervals for aversion of random forests discussed in [23]. Example 1.1 (Simultaneous prediction intervals for random forests) . Considera training dataset of size n , { ( Y , Z ) , . . . , ( Y n , Z n ) } = { X , . . . , X n } = X n ,where Y i ∈ Y is a vector of features and Z i ∈ R is a response. Let h be adeterministic prediction rule that takes as input a sub-sample { X i , . . . , X i r } and outputs predictions on d testing points ( y ∗ , . . . , y ∗ d ) in the feature space Y . Then U n in (1) are the overall predictions by averaging over all possiblesub-samples of size r .For random forests [4, 23], the tree-based prediction rule may be constructedwith additional randomness: in building a tree or multiple trees based on a sub-sample, the split at each node may only occur on a randomly selected subsetof features. Thus, let { W ι : ι ∈ I n,r } be a collection of i.i.d. random variablestaking value in a measurable space ( S ′ , S ′ ) that are independent of the data X n , and that determine the potential splits for each sub-sample. Here, each W ι captures the random mechanism in building a prediction function basedon X ι = ( X i , . . . , X i r ), but are assumed to be independent for different sub-samples. Further, let H : S r × S ′ → R d be an S r ⊗ S ′ -measurable function,that represents the random forest algorithm, such that E [ H ( x , . . . , x r , W )] = h ( x , . . . , x r ). Then predictions of random forests are given by a d -dimensional U -statistic with random kernel H : b U n := | I n,r | − X ι ∈ I n,r H ( X i , . . . , X i r , W ι ) = | I n,r | − X ι ∈ I n,r H ( X ι , W ι ) , (2)where the random kernel H varies with r .Compared to U -statistics with fixed orders (i.e., r being fixed), the analy-sis of IOUS brings nontrivial computational and statistical challenges due toincreasing orders. First, even for a moderately large value of r , exact computa-tion of all possible (cid:0) nr (cid:1) trees is intractable. For diverging r , it is not possibleto compute U n in polynomial-time of n . Second, the variance of the H´ajekprojection (i.e., the first-order term in the Hoeffding decomposition [19]) of U n − θ tends to zero as r → ∞ . To wit, define a function g : S → [0 , ∞ ) by . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics g ( x ) = E [ h ( x , X , . . . , X r )], and σ g,j := E [( g j ( X ) − θ j ) ] for 1 j d, σ g := min j d σ g,j . Then the H´ajek projection of U n − θ is given by n − r P ni =1 ( g ( X i ) − θ ). By theorthogonality of the projections, we have E [( h j ( X , . . . , X r ) − θ j ) ] > X i r E [( g j ( X i ) − θ j ) ] = rσ g,j . Thus the variances of the kernel h and its associated H´ajek projection g havedifferent magnitudes. In particular, if the variance of h j ( X , . . . , X r ) is boundedby a constant C > σ g,j C/r , which vanishes as r diverges. Thus standard Gaussian ap-proximation results in literature are no longer applicable in our setting sincethey require that the componentwise variances are bounded below from zero toavoid degeneracy, i.e., there is an absolute constant σ > σ g > σ (cf. [6, 10, 11]).In this work, our focus is to derive computationally tractable and statisticallyvalid sub-sampling procedures for making inference on θ with a class of high-dimensional random kernels (i.e., large d ) of diverging orders (i.e., increasing r ).To break the computational bottleneck, we consider the incomplete version of b U n by sampling (possibly much) fewer terms than | I n,r | . In particular, we considerthe Bernoulli sampling scheme introduced in [8]. Given a positive integer N ,which represents our computational budget, define the sparsity design parameter p n := N/ | I n,r | , and let { Z ι : ι ∈ I n,r } be a collection of i.i.d. Bernoulli randomvariables with success probability p n , that are independent of the data X n and { W ι : ι ∈ I n,r } . Consider the following incomplete U -statistic (on the data X n )with random kernel and weights: U ′ n,N := b N − X ι ∈ I n,r Z ι H ( X ι , W ι ) , where b N := X ι ∈ I n,r Z ι . (3)Obviously, U ′ n,N is an unbiased estimator of θ and it only involves computing b N terms, which on average is much smaller than | I n,r | if p n ≪ h is both deterministic and of fixed order, finite samplebounds for the Gaussian and bootstrap approximations of U ′ n,N − θ (after asuitable normalization) are established in [8]. Roughly speaking, error boundanalysis in [8] has two major steps: i ) establish the Gaussian approximationto the H´ajek projection, and ii ) bound the maximum norm of all higher-orderdegenerate terms. As discussed above, the first-order H´ajek projection in theHoeffding decomposition is asymptotically vanishing for the IOUS, and we mustcontrol the moments of an increasing number of degenerate terms, which makesthe analysis of the incomplete IOUS with random kernels substantially moresubtle.In Section 2, we derive non-asymptotic Gaussian approximation error boundsfor approximating the distribution of the incomplete IOUS U ′ n,N with random . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics kernels subject to sub-exponential moment conditions. Specifically, our ratesof convergence for the Gaussian approximation of U ′ n,N have the explicit de-pendence on all parameters ( n, N, d, r, σ g , D n ), where D n is an upper boundfor the ψ norms of the random kernels (for precise statements, see conditions(C3), (C4), and (C3’) ahead). In particular, asymptotic validity of the Gaussianapproximation can be achieved if σ − g r D n log ( dn ) = o ( n ∧ N ). The order of σ − g will be application specific. As we shall verify in Section 4, under certainregularity conditions, σ − g = O ( r ) . (4)It is worth noting that (4) is sharp in the sense that for the linear kernel h ( x , · · · , x r ) = ( x + · · · + x r ) /r , we have σ − g ≍ r if c Var( X j ) C .If further D n = O (1), log( d ) = O (log( n )) and n = O ( N ) (i.e., the compu-tational complexity is at least linear in sample size), then the order of U ′ n,N isallowed to increase at the rate of r = o ( n / − ǫ ) for any ǫ ∈ (0 , / d = O ( e n c )for some constant c ∈ (0 , / r is still allowed to increase at a polynomial rate in n .The proof of our Gaussian approximation results for IOUS builds upon anumber of recently developed technical tools such as Gaussian approximationresults for sum of independent random vectors and U -statistics of fixed orders [6,7, 10, 11], anti-concentration inequality for Gaussian maxima [9], and iterativeconditioning argument for high-dimensional incomplete U -statistics (with thefixed kernel and order) [8]. However, there are three technical innovations inour proof to accommodate the issues of diverging orders and randomness ofthe kernel. First, we use the iterative renormalization for each dimension of g and also H by its variance. This simple trick turns out to be the crux to avoidthe lower bound assumption for Gaussian approximation in the literature [8,10]. Second, we derive an order-explicit maximal inequality for the expectedsupremum of the remainder of the H´ajek projection of the IOUS (cf. Section5). This maximal inequality is new in literature and our main tools include asymmetrization inequality of [27] and Bonami inequality [13, Theorem 3.2.2]for the Rademacher chaos, both with the explicit dependence on r . Third, wedevelop new tail probability inequalities for U -statistics with random kernels byleveraging the independence between { W ι , ι ∈ I n,r } and the data X n .In Section 3, we derive computationally tractable and fully data-driven infer-ential methods of θ based on the incomplete IOUS when the sample size n , thedimension d , and the order r , are all large. We consider a multiplier bootstrapprocedure consisting of two partial bootstraps that are conditionally indepen-dent given X n and { W ι , Z ι : ι ∈ I n,r } : one estimates the covariance matrix ofthe randomized kernel, and the other estimates the H´ajek projection. The latteris usually computationally demanding, and we develop a divide and conquer al-gorithm to maintain the overall computational cost of our multiplier bootstrapprocedure at most O ( n d + B ( N + n ) d ), where B denotes the number of boot-strap iterations. Thus the computational cost of the bootstrap to approximatethe sampling distribution for incomplete IOUS can be made independent of the . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics order r , even though r diverges.In Section 4, we discuss the key non-degeneracy condition (4) for derivingthe validity of Gaussian and bootstrap approximations. We provide a generalembedding scheme where a Cram´er-Rao type lower bound can be establishedfor the minimum σ g of the projection variances. Specifically, the lower boundfor r σ g only involves the sensitivity of E [ h ( X , . . . , X r )] under perturbationand the Fisher information of the embedded family, which in some cases remainconstants as r diverges. In non-parametric regressions, there is a natural em-bedding of the response variable into a location family such that the sensitivityand Fisher information can be explicitly computed. For univariate U -statistics ( d = 1), the asymptotic distributions are derived inthe seminal paper [19] for the non-degenerate case. [14] introduced the notion“infinite-order U statistics” (IOUS) with diverging orders and established thecentral limit theorem for U n when d = 1. For univariate IOUS, asymptotic nor-mality of IOUS can be found in [2, Chapter 4.6], and the Berry-Esseen typebounds for IOUS were established by [16, 30, 31]. Further, [23] applied IOUSto construct a prediction interval for one test point. However, i ) . [23] does notaddress the issue that the variance of the H´ajek projection is vanishing: the twoconditions in Theorem 1 therein, E h k n ( Z , . . . , Z k n ) C < ∞ and lim ζ ,k n = 0,are not compatible based on our previous discussions ; ii ) . in practice, the size d of a test set may be comparable to or even much larger than the size n of a train-ing set, and the current work is motivated by such consideration. Limit theoremsof the related infinite-order V -statistics and the infinite-order U -processes werestudied in [18, 28]. The high-dimensional Gaussian approximation results andbootstrap methods were established in [10, 11] for sum of independent randomvectors, and in [6, 8] for U -statistics. We refer readers to these references forextensive literature review.Incomplete U -statistics were first introduced in [1], which can be viewed as aspecial case of weighted U -statistics. There is a large literature on limit theoremsfor weighted U -statistics; see [22, 24, 25, 26]. The asymptotic distributions ofincomplete U -statistics (for fixed d ) were derived in [5] and [20]; see also Section4.3 in [21] for a review on incomplete U -statistics. Recently, incomplete U-statistics have gained renewed interests in the statistics and machine learningliteratures [12, 23]. To the best of our knowledge, the current paper is thefirst work that establishes distributional approximation theorems for incompleteIOUS with random kernels and increasing orders in high dimensions.The remaining of the paper is organized as follows. We develop Gaussianapproximation results for above U -statistics in Section 2, and bootstrap methodsfor the variance of the approximating Gaussian distribution in Section 3. Weapply the theoretical results to several examples in Section 4. We highlight amaximal inequality in Section 5, and present all other proofs in Appendix A. . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics We write l.h.s. . r.h.s. if there exists a finite and positive absolute constant C such that l.h.s. C × r.h.s.. We shall use c, C, C , C , . . . to denote finiteand positive absolute constants, whose value may differ from place to place. Wedenote X i , . . . X i ′ by X i ′ i for i i ′ .For a, b ∈ R , let ⌊ a ⌋ denote the largest integer that does not exceed a , a ∨ b =max { a, b } and a ∧ b = min { a, b } . For a, b ∈ R d , we write a b if a j b j for 1 j d , and write [ a, b ] for the hyperrectangle Q dj =1 [ a j , b j ] if a b .We denote by R := { Q dj =1 [ a j , b j ] : −∞ a j b j ∞} the collection ofhyperrectangles in R d . Further, for a ∈ R d , r, t ∈ R , ra + t is a vector in R d with j th component being ra j + t . For a matrix A = ( a ij ), denote k A k ∞ = max i,j | a ij | .For a diagonal matrix Λ with positive diagonal entries, Λ − / (resp. Λ / ) is thediagonal matrix, with j -th diagonal entry being Λ − / jj (resp. Λ / jj ).For β >
0, let ψ β : [0 , ∞ ) → R be a function defined by ψ β ( x ) = e x β −
1, and for any real-valued random variable ξ , define k ξ k ψ β = inf { C > E [ ψ β ( | ξ | /C )] } . Further, we define a family of functions { e ψ β ( · ) } on [0 , ∞ )indexed by β >
0. For β >
1, define e ψ β = ψ β . For β ∈ (0 , τ β = ( βe ) /β , x β = (1 /β ) /β , and e ψ β ( x ) = τ β x { x
2. Gaussian approximations for IOUS
In this section, we shall derive non-asymptotic Gaussian approximation errorbounds for: (i) the IOUS with random kernel b U n in (2), which includes the IOUSwith deterministic kernel U n in (1) as a special case, and (ii) the incompleteIOUS U ′ n,N in (3) under the Bernoulli sampling scheme.Recall that h ( x r ) = E [ H ( x r , W )], g ( x ) = E [ h ( x , X r )], θ = E [ g ( X )], σ g,j = E [( g j ( X ) − θ j ) ] and σ g = min j d σ g,j . Further, defineΓ g := Cov( g ( X )) , Γ H := Cov( H ( X r , W )) ,σ H,j := E [( H j ( X r , W ) − θ j ) ] for 1 j d. Clearly, for 1 j d , σ H,j > σ g,j and thus σ H := min j d σ H,j > σ g . Definetwo d × d diagonal matrices Λ g and Λ H such thatΛ g,jj := σ g,j σ H,j := Λ
H,jj for 1 j d. (5)Let Y A and Y B be two independent d -dimensional zero mean Gaussian ran-dom vectors with variance Γ g and Γ H respectively. We may take Y A and Y B to . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics be independent of any other random variables. Further, for any two zero mean d -dimensional random vectors U and Y , ρ ( U, Y ) := sup R ∈R | P ( U ∈ R ) − P ( Y ∈ R ) | , where we recall that R := { Q dj =1 [ a j , b j ] : −∞ a j b j ∞} is the collectionof hyperrectangles in R d .Finally, in view of the discussions in the Introduction (Section 1) and tosimplify presentation, we assume σ g
1. Otherwise, the conclusions in thispaper hold with σ g replaced by min { σ g , } . We start with b U n . Define for 1 j d , q >
0, and ( x , . . . , x r ) ∈ S r , B n,j ( x , . . . , x r ) := k H j ( x , . . . , x r , W ) − h j ( x , . . . , x r ) k ψ q . (6)We make following assumptions: there exist D n > q > σ g,j > , for all j = 1 , . . . , d, (C1-ND) E | g j ( X ) − θ j | σ g,j D n , for all j = 1 , . . . , d, (C2) k h j ( X r ) − θ j k ψ q D n , for all j = 1 , . . . , d, (C3) k B n,j ( X r ) k ψ q D n for all j = 1 , . . . , d. (C4)Clearly, if | H j ( X r , W ) | . D n a.s. for 1 j d , then the latter three conditionshold. Indeed, (C3) and (C4) follow immediately from the definition, and (C2)is due to the observation that E | g j ( X ) − θ j | . E | g j ( X ) − θ j | D n = σ g,j D n . Theorem 2.1.
Assume (C1-ND) , (C2) , (C3) and (C4) hold. Then ρ ( √ n ( b U n − θ ) , rY A ) . (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / , where q ∗ := (6 /q + 1) ∨ , Y A ∼ N (0 , Γ g ) and . means up to a multiplicativeconstant that only depends on q .Proof. See Section A.3. We highlight that a key step to establish Theorem 2.1 isto control the expected supremum of the remainder of the H´ajek projection of thecomplete IOUS with deterministic kernel (See Theorem 5.1). Then the Gaussianapproximation result for IOUS follows from Gaussian approximation results forsum of independent random vectors [10] and anti-concentration inequality [9],by a similar argument in [8] with proper normalization. (cid:4)
Clearly, in the special case of non-random kernel, i.e., H ( x , . . . , x r , W ) = h ( x , . . . , x r ), (C4) trivially holds. Thus we have the following immediate resultfor the IOUS with deterministic kernel U n in (1). . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Corollary 2.2.
Assume (C1-ND) , (C2) and (C3) hold. Then ρ ( √ n ( U n − θ ) , rY A ) . (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / . where q ∗ := (6 /q + 1) ∨ , and . means up to a multiplicative constant that onlydepends on q . Remark 2.3 (Comparisons with existing results for d = 1) . For the univariateIOUS with non-random kernels, asymptotic normality and its rate of conver-gence are well understood in literature; see [2] for a survey of results in this di-rection. In [31], a Berry-Esseen bound is derived for symmetric statistics, whichinclude IOUS (with non-random kernels) as a special case. In particular, apply-ing Corollary 4.1 in [31] to IOUS, the rate of convergence to normality is of order O ( r n − / σ H /σ g ) for a bounded kernel, which implies that asymptotic normal-ity requires (at least) r = o ( n / ). A related Berry-Esseen bound is given in [16].In both papers, the rates of convergence are suboptimal. For elementary sym-metric polynomials (which are U -statistics corresponding to the product kernel h ( x , . . . , x r ) = x · · · x r ), it is shown in [30] that the sharp rate of convergenceto normality is of order O ( rn − / ), provided that E [ X ] = 0 , Var( X ) ∈ (0 , ∞ ), E [ | X | ] < ∞ and r = O ((log n ) − (log ( n )) − n / ). This result implies thatasymptotic normality for the IOUS with the product kernel is achieved when r = O (log − ( n ) n / ). If σ − g = O ( r ), which holds under regularity condi-tions in Lemma 4.1, our Corollary 2.2 with q = 1 implies that the rate of con-vergence for high-dimensional IOUS is O (( r log ( dn ) n − ) / ) (with suitablybounded moments). In particular, Gaussian approximation is asymptoticallyvalid if log d = O (log n ) and r = o ( n / − ǫ ) for any ǫ ∈ (0 , / r and the rate is slower than the optimalrate in the case d = 1, Corollary 2.2 does allow the dimension to grow sub-exponentially fast in sample size, which is a useful feature for high-dimensionalstatistical inference. In addition, to the best of our knowledge, the validity ofbootstrap procedures proposed in Section 3 to approximate the sampling dis-tribution of IOUS (on hyperrectangles in R d ) are new in literature. Now we consider U ′ n,N , where we recall that N is some given computationalbudget. We will assume the following conditions: for q > k H j ( X r , W ) − θ j k ψ q D n , for all j = 1 , . . . , d, (C3’) E | H j ( X r ) − θ j | σ H,j D n , for all j = 1 , . . . , d. (C5)Clearly, (C4) and (C3’) implies (C3) up to a multiplicative constant. Fur-ther, (C3’) and (C5) hold if | H j ( X r , W ) | . D n a.s. for 1 j d . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Theorem 2.4.
Assume (C1-ND) , (C2) , (C4) , (C3’) and (C5) hold. Then ρ (cid:16) √ n ( U ′ n,N − θ ) , rY A + α / n Y B (cid:17) . ̟ n , where ̟ n := (cid:18) r q D n log q ∗ ( dn ) σ g ( n ∧ N ) (cid:19) / , where α n := n/N , q := 2 ∨ (2 /q ) , q ∗ := (6 /q + 1) ∨ , . means up to amultiplicative constant that only depends on q , and we recall that Y A ∼ N (0 , Γ g ) , Y B ∼ N (0 , Γ H ) and Y A , Y B are independent.Proof. See Section A.4.4. (cid:4)
Remark 2.5. If q >
1, then q = 2 and q ∗ = 7. Since k ξ k ψ . k ξ k ψ q for anyrandom variable ξ and q >
1, we may assume without loss of generality that q r is fixed, q = 1, the kernel is deterministic, andthere exists some absolute constant σ > σ g > σ , then the aboveTheorem recovers Theorem 3.1 from [8].Further, by first conditioning on X n , we haveΓ H = Cov ( H ( X r , W )) (cid:23) Cov ( h ( X r )) := Γ h , where for two square matrices, A (cid:23) B means A − B is positive semi-definite.Thus the random kernel H ( · ) increases the variance of the approximating Gaus-sian distribution compared to the associated deterministic kernel h ( · ).
3. Bootstrap approximations
In Section 2.2, we have seen that the incomplete U -statistic with random kernelis approximated by a Gaussian distribution N (0 , r Γ g + α n Γ H ). However, thecovariance term is typically unknown in practice. In this section, we will estimateΓ g and Γ H by bootstrap methods. Γ H Let D n := { X , . . . , X n } ∪ { W ι , Z ι : ι ∈ I n,r } be the data involved in thedefinition of U ′ n,N , and take a collection of independent N (0 ,
1) random variables { ξ ′ ι : ι ∈ I n,r } that is independent of the data D n . Define the following bootstrapdistribution: U n,B := 1 p b N X ι ∈ I n,r ξ ′ ι p Z ι (cid:0) H ( X ι , W ι ) − U ′ n,N (cid:1) . (7)The next theorem establishes the validity of U n,B . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Theorem 3.1.
Assume the conditions (C1-ND) (C2) , (C4) , (C3’) and (C5) hold. If r q D n log q ( dn )( σ H ∧
1) ( n ∧ N ) C n − ζ , (8) for q := 2 ∨ (2 /q ) , q := (4 /q + 1) ∨ , some constants C > and ζ ∈ (0 , ,then there exists a constant C depending only on q , C and ζ such that withprobability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n,B ∈ R (cid:17) − P ( Y B ∈ R ) (cid:12)(cid:12)(cid:12) Cn − ζ/ . Proof.
See Section A.5.1. (cid:4)
Let S ⊂ { , . . . , n } , and n = | S | . Further, consider a collection of D n -measurable R d -valued random vectors { G i : i ∈ S } , where G i is some “good”estimator of g ( X i ), and its form is specified later. We use the following quantityto measure the quality of G i as an estimator of g ( X i ) b ∆ A, := max j d n σ g,j X i ∈ S ( G i ,j − g j ( X i )) . (9)Define G := n P i ∈ S G i and consider the following bootstrap distributionfor N (0 , Γ g ): U n ,A := 1 √ n X i ∈ S ξ i (cid:0) G i − G (cid:1) , (10)where { ξ i : i ∈ S } is a collection of independent N (0 ,
1) random variablesthat is independent of D n and { ξ ′ ι : ι ∈ I n,r } . Lemma 3.2.
Assume the conditions (C1-ND) , (C2) and (C3’) hold. If D n log q ( dn ) σ g n C n − ζ , and P (cid:16) b ∆ A, log ( d ) > C n − ζ (cid:17) C n − , (11) for q := (4 /q + 1) ∨ , some constants C , and ζ , ζ ∈ (0 , . Then there existsa constant C depending only on q , C and ζ such that with probability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n ,A ∈ R (cid:17) − P ( Y A ∈ R ) (cid:12)(cid:12)(cid:12) Cn − ( ζ ∧ ζ ) / , where we recall that Y A ∼ N (0 , Γ g ) .Proof. See Subsection A.5.2. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Hereafter we consider a special case of the divide and conquer bootstrapalgorithm in [8] to estimate Γ g . For each i ∈ S , partition the remaining indexes, { , . . . , n } \ { i } , into disjoint subsets { S ( i )2 ,k : k = 1 , . . . , K } , each of size L = r −
1, where K = ⌊ ( n − / ( r − ⌋ .Now define for each i ∈ S and k = 1 , . . . , K , S ( i )2 ,k := { i } ∪ S ( i )2 ,k , G i := 1 K K X k =1 H ( X S ( i ,k , W S ( i ,k ) . Finally, define U n,n := rU n ,A + α / n U n,B . Theorem 3.3.
Assume the conditions (C1-ND) (C2) , (C4) (C3’) and (C5) hold. If r q D n log q ∗ ( dn ) σ g ( n ∧ N ) C n − ζ , (12) for q := 2 ∨ (2 /q ) , q ∗ := (6 /q + 1) ∨ , some constants C > , ζ ∈ (0 , . Forany ν ∈ (max { / , /ζ } , ∞ ) , there exists a constant C depending only on q , ζ , ν and C such that with probability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:0) U n,n ∈ R (cid:1) − P ( rY A + α / n Y B ∈ R ) (cid:12)(cid:12)(cid:12) Cn − ( ζ − /ν ) / . Proof.
See Subsection A.5.3. (cid:4)
We first combine the Gaussian approximation result with the bootstrap result.
Corollary 3.4.
Assume (C1-ND) , (C2) (C4) (C3’) and (C5) hold. Further,assume that for some constants C > , ζ ∈ (0 , , (12) holds. Then thereexists a constant C depending only on q , C and ζ such that with probability atleast − C/n , sup R ∈R (cid:12)(cid:12) P (cid:0) √ n (cid:0) U ′ n,N − θ (cid:1) ∈ R (cid:1) − P |D n (cid:0) U n,n ∈ R (cid:1)(cid:12)(cid:12) Cn − ζ/ . Proof.
It follows from Theorem 2.4 and Theorem 3.3 (with ν = 7 /ζ ). (cid:4) In simultaneous confidence interval construction, it is sometimes desirableto normalize the variance of each dimension, so that if we use maximum-typestatistics, the critical value is not dominated by terms with large variance. Definefor 1 j d , b σ g,j := 1 n X i ∈ S (cid:0) G i ,j − G j (cid:1) , b σ H,j := 1 b N X ι ∈ I n,r Z ι (cid:0) H j ( X ι , W ι ) − U ′ n,N,j (cid:1) , . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics which are the diagonal elements in the conditional covariance matrices of U n,A (10)and U n,B (7) respectively. Further, define a d × d diagonal matrix b Λ with b Λ j,j = r b σ g,j + α n b σ H,j , for each 1 j d. Corollary 3.5.
Assume the conditions in Corollary 3.4. Then there exists aconstant C depending only on q , C and ζ such that with probability at least − C/n , sup R ∈R (cid:12)(cid:12)(cid:12) P (cid:16) √ n b Λ − / (cid:0) U ′ n,N − θ (cid:1) ∈ R (cid:17) − P |D n (cid:16)b Λ − / U n,n ∈ R (cid:17)(cid:12)(cid:12)(cid:12) Cn − ζ/ . Consequently, sup t> (cid:12)(cid:12)(cid:12) P (cid:16) k√ n b Λ − / ( U ′ n,N − θ ) k ∞ t (cid:17) − P |D n (cid:16) k b Λ − / U n,n k ∞ t (cid:17)(cid:12)(cid:12)(cid:12) Cn − ζ/ . Proof.
See Subsection A.5.4. (cid:4)
Remark 3.6.
From Corollary 3.5, we can immediately construct confidenceintervals for θ in a data-dependent way. Specifically, let b q − α be a (1 − α ) th quantile of the conditional distribution of k b Λ − / U n,n k ∞ given D n . Then oneway to construct simultaneous confidence intervals with confidence level (1 − α )is as follows: for 1 j d , U ′ n,N,j ± b q − α n − / b Λ / j,j .
4. Applications
In many applications, g ( x ) = E [ h ( x, X , . . . , X r )] does not admit an explicitform, and thus it is usually hard to compute σ g in conditions (C1-ND) and(12) directly. When the kernel h has special structures, we can establish a lowerbound on σ g with explicit dependence on r , which can be applied to Exam-ple 1.1. We shall give additional examples in Section 4.3 and 4.4 to illustratethe usefulness of U -statistics as a tool to estimate and make inference of certainstatistical functionals of X , . . . , X r . In Section 4.3 for the expected maximumand log-mean functionals, we also establish a lower bound on σ g with explicitdependence on r . In Section 4.4 for the kernel density estimation problem, r isassumed to be fixed, but we allow the diameter of the design points to diverge.For simplicity of the presentation, in this section, we assume that all involvedderivatives and integrals exist and are finite, and that the order of integrals andthe order of integral and differentiation can be exchanged. These assumptionscan be justified under standard smoothness and moment conditions. For illus-tration, we use q = 1 in (C4) and (C3’). σ g Suppose that the distribution P of X has a density function f with respect tosome σ -finite (reference) measure µ , i.e., P ( A ) = Z A f ( x ) µ ( dx ) for any A ∈ S . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics We first embed f into a family of densities { f β : β ∈ B ⊂ R ℓ } , where B isan open neighborhood of 0 ∈ R ℓ . Such embeddings always exist and below aresome examples for S = R ℓ .1. Location and scale family. If µ is the Lebesgue measure on R ℓ , we mayconsider the following location or scaling families: for x ∈ R ℓ , f β ( x ) = f ( x − β ) with β ∈ R ℓ , or f β ( x ) =(1 + β ) f ((1 + β ) x ) with β ∈ ( − , . Exponential family. If φ ( β ) := log (cid:16)R f ( x ) e β T x µ ( dx ) (cid:17) < ∞ for β ∈ B ,then we may consider the exponential family: f β ( x ) = f ( x ) exp( β T x − φ ( β )) , for x ∈ R ℓ , β ∈ B. Additive noise model.
Let Υ be a R ℓ -dimensional random vector inde-pendent of X , whose distribution is absolutely continuous w.r.t. µ , then X + β Υ has a density f β given by the convolution of those of X and β Υ.For β ∈ B , define the following perturbed expectation θ ( β ) := Z h ( x , . . . , x r ) r Y i =1 f β ( x i ) µ ( dx i ) := E β [ h ( X , . . . , X r )] , where E β denotes the expectation when X , . . . , X r have density f β . Further,define Ψ( β ) := r X i =1 ∇ ln f β ( X i ) , J ( β ) := r − Var β (Ψ( β )) , where ∇ denotes the gradient (or derivative when β is a scalar) with respectto β and Var β denotes the covariance matrix when X , . . . , X r have the density f β . Thus Ψ( β ) is the score function and J ( β ) is the Fisher-information for asingle observation. Lemma 4.1.
If we assume J (0) is positive definite, then σ g,j > r − ( ∇ θ j (0)) T J − (0) ∇ θ j (0) , for j d. (13) In particular, if there exists an absolute positive constant c such that ( ∇ θ j (0)) T J − (0) ∇ θ j (0) > c for j d, then σ g > cr − .Proof. See Subsection A.6. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Consider the Example 1.1 and assume that ( Y , Z ) has density q ( y ) p ( z ; y )w.r.t. the product measure ν ( dy ) ⊗ dz on Y × R , i.e., for A ∈ B ( Y ) , A ∈ B ( R ), P ( Y ∈ A , Z ∈ A ) = Z A × A q ( y ) p ( z ; y ) ν ( dy ) dz. That is, the feature Y has the density q ( y ) w.r.t. some σ -finite measure ν on Y , and thus is allowed to have both continuous and discrete components. Theresponse Z given Y = y has a conditional density p ( z ; y ) w.r.t. the Lebesguemeasure.For many regression algorithms such as tree based methods, if we fix thefeatures and increase the responses of training samples by β ∈ R , the predictionat any test point will increase by β , i.e., for 1 j d , H j (( y , z + β ) , . . . , ( y r , z r + β ) , w ) = H j (( y , z ) , . . . , ( y r , z r ) , w ) + β, which implies that h (( y , z + β ) , . . . , ( y r , z r + β )) = h (( y , z ) , . . . , ( y r , z r )) + β . Now we consider the embedding into the “location” family { q ( y ) p ( z − β ; y ) : β ∈ R } . Observe that θ j ( β ) = E β [ h j ( X , . . . , X r )] = θ j (0) + β, for 1 j d, which implies that θ ′ j (0) = 1. In addition, J ( β ) = Var β (cid:18) ddβ ln( q ( Y ) p ( Z − β ; Y )) (cid:19) = E β "(cid:18) ∂ z p ( Z − β ; Y ) p ( Z − β ; Y ) (cid:19) . Thus if we assume that there exists c such that J (0) = E "(cid:18) ∂ z p ( Z ; Y ) p ( Z ; Y ) (cid:19) c − , (14)then (13) reduces to σ g > cr − . If further we assume that H j ( X r , W ) C a.s. for some constant C and each 1 j d (this holds for example whenthe response is bounded a.s.), then the conditions (C2), (C3), (C4) and (C5)hold with D n = ln − (2) C . With these assumptions, the condition (12) in Corol-lary 3.5 simplifies as r log ( dn ) n ∧ N C n − ζ . Thus if r = O ( n / − ǫ ) for some ǫ >
0, log( d ) = O (log( n )), and n = O ( n ∧ N ),then Corollary 3.5 can be used to construct asymptotically valid simultaneousprediction intervals with the error of approximation decaying polynomially fastin n . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Remark 4.2 (Fisher information in nonparametric regressions) . Let us take acloser look at the condition (14). Consider the nonparametric regression model Z i = κ ( Y i ) + ǫ i , for 1 i n, where κ : Y → R is a deterministic measurable function, and ǫ , . . . ǫ n arei.i.d. with some density f with respect to the Lebesgue measure. Then p ( z ; y ) = f ( z − κ ( y )) and thus J (0) = Z (cid:18) f ′ ( z − κ ( y )) f ( z − κ ( y )) (cid:19) q ( y ) f ( z − κ ( y )) ν ( dy ) dz = Z ( f ′ ( z )) f ( z ) dz, where for the last equality, we first perform integration w.r.t. dz and apply achange-of-variable. Thus J (0) only depends the density of the noise. Next we compute the lower bounds on σ g for two additional statistical func-tionals. Example 4.3.
Let S = R d and consider the following two kernels: for 1 j d , h j ( x , . . . , x r ) = max i r x ij , and h j ( x , . . . , x r ) = log r r X i =1 x ij ! . In the former case, we are interested in estimating the expectation for thecoordinate-wise maxima of r independent random vectors, { E [max i r X ij ] :1 j d } . In the latter, we assume X j > j d and are interested inestimating { E [log( r − P ri =1 X ij )] : 1 j d } . In both cases, the coordinatesof X can have arbitrary dependence, and we allow r → ∞ .Consider the first kernel in Example 4.3, where S = R d , and h j ( x , . . . , x r ) =max i r x ij for 1 j d . Assume X j has a density f j w.r.t. the Lebesguemeasure on R for 1 j d , and we consider the following embedding { f j ( ·− β ) : β ∈ R } . As in the previous example, for β ∈ R θ ′ j ( β ) = 1 , J j ( β ) = Var β (cid:18) dd β ln f j ( X j − β ) (cid:19) = Z ( f ′ j ( x − β )) f j ( x − β ) d x. Thus, by Lemma 4.1, if we assume for some absolute positive constant c Z ( f ′ j ( x )) f j ( x ) d x c − , j d, we have σ g > cr − . Further, if we assume that there exists a positive constant C such that k X j k ψ C, j d, . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics then by maximal inequality (e.g., see [29, Lemma 2.2.2]), k max i r X ij k ψ . log( r ). Then if we select D n = C ′ σ − g log ( r ) , the conditions (C2), (C3) and(C5) hold. Further, (C4) trivially holds for non-random kernels. With aboveassumptions and selection of D n , the condition (12) in Corollary 3.5 simplifiesas ( n ∧ N ) − r log ( r ) log ( dn ) C n − ζ .Now consider the second kernel in Example 4.3, where h j ( x , . . . , x r ) =log (cid:0) r − P ri =1 x ij (cid:1) and X j > j d . Assume X j has a density f j w.r.t. the Lebesgue measure on R for 1 j d , and consider the followingembedding { (1 + β ) f j ((1 + β ) · ) : β ∈ ( − , } . As before, it is easy to see thatfor 1 j d , θ ′ j (0) = 1 , and J j (0) = Z ( xf ′ j ( x ) + f j ( x )) f j ( x ) d x. Thus if there exists a constant c such that max j d J j (0) c − , then σ g > cr − . Further, if there exists a constant C > P (0 < X j C ) = 1 , j d, then the conditions (C2), (C3), (C4) and (C5) hold with D n = ln − (2) log( C ).With these assumptions, the condition (12) in Corollary 3.5 simplifies as ( n ∧ N ) − r log ( dn ) C n − ζ . Example 4.4 (Kernel density estimation) . Let τ : S r → R ℓ be a measurablefunction that is symmetric in its r arguments, and { t j : 1 j d } ⊂ R ℓ be d design points. [15, 17] used U n as a kernel density estimator (KDE) for thedensity of τ ( X , . . . , X r ) at the given design points with h j ( x , . . . , x r ) = 1 b ℓn κ (cid:18) t j − τ ( x , . . . , x r ) b n (cid:19) , j d, where b n > κ ( · ) is the density estimation kernelwith R κ ( z ) dz = 1, which should not be confused with the U -statistic kernel h .For this example, we will assume r fixed and the bandwidth b n →
0, but allowthe diameter of the design points, max j d k t j k , to grow, where k · k denotesthe usual Euclidean norm.Assume that given X = x , τ ( x , X r ) has a density f ( z ; x ) w.r.t. theLebesgue measure on R ℓ , i.e., P ( τ ( x , X , . . . , X r ) ∈ A ) = R A f ( z ; x ) dz forany A ∈ B ( R ℓ ). Then by definition, for 1 j d , g j ( x ) = E [ h j ( x , X r )] = Z b ℓn κ (cid:18) t j − zb n (cid:19) f ( z ; x ) dz = Z κ ( z ) f ( t j − b n z ; x ) dz. . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics For t ∈ R ℓ , denote V n ( t ) := Var (cid:18)Z κ ( z ) f ( t − b n z ; X ) dz (cid:19) , V ( t ) := Var( f ( t ; X )) . As in [15], if R κ ( z ) dz < ∞ and sup t E [ f ( t ; X )] < ∞ , then lim n →∞ V n ( t ) = V ( t ) for any fixed t . If there exists some R > j d k t j k R for any d ∈ N and inf t ∈ R ℓ : | t | R V ( t ) >
0, under mild continuity assumptions(e.g. the equicontinuty of V n ( t )), there exists an absolute constant c > σ g > c for large n . Then we can apply the result in [8], which does notallow σ g to vanish.In this work, we allow σ g to vanish, and thus allow the diameter of the designpoints to grow as n becomes large. Specifically, if we assume κ ( · ) is boundedby some constant C , we can select D n = ln − (2) Cb − n in conditions (C2), (C3),(C4) and (C5). Then the condition (12) in Corollary 3.5 simplifies aslog ( dn ) σ g b n ( n ∧ N ) C n − ζ . Thus if log( d ) = O (log( n )) and n = O ( n ∧ N ), to apply Corollary 3.5, werequire that σ − g = O ( b n n − ǫ ) for any ǫ > Remark 4.5. [15] considers the case d = 1, and shows the √ n -convergence rateof the KDE. The same discussion applies here. [17] constructs confidence bands(without computational considerations and bootstrap results) for the density of τ ( X r ), under the additional assumptions required to establish the convergenceof empirical processes.
5. Maximal inequality
In this section, we derive an upper bound on the expected supremum of theremainder of the H´ajek projection of the complete IOUS with deterministickernel. This maximal inequality (with the explicit dependence on r ) serves asa key step to establish the Gaussian approximation result for the incompleteIOUS with random kernel. Theorem 5.1.
Assume (C3) hold. Then there exist constants c, C , dependingonly on q , such that if r log( d ) /n c , then E " max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( U n,j − θ j ) − rn n X i =1 ( g j ( X i ) − θ j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C r log /q ( d ) D n n . The proof of Theorem 5.1 is quite involved: we need to develop a numberof technical tools such as the symmetrization inequality and Bonami inequality(i.e., exponential moment bound) for the Rademacher chaos, all with the explicitdependence on r . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics We start with some notation. Let X ′ := ( X ′ , . . . , X ′ n ) be an independentcopy of X := ( X , . . . , X n ), and ǫ := ( ǫ , . . . , ǫ n ) be i.i.d. Rademacher randomvariables, i.e., P ( ǫ = 1) = P ( ǫ = −
1) = 1 /
2, that are independent of X and X ′ . If all involved random variables are independent, we write E ǫ (resp. E X ′ )for expectation only w.r.t. ǫ (resp. X ′ ).For a given probability space ( X, A , Q ), a measurable function f on X and x ∈ X , we use the notation Qf = R f dQ whenever the latter integral is well-defined, and denote δ x the Dirac measure on X , i.e., δ x ( A ) = { x ∈ A } for any A ∈ A . For a measurable symmetric function f on S r and k = 0 , , . . . , r , let P r − k f denote the function on S k defined by P r − k f ( x , . . . , x k ) := E [ f ( x , . . . , x k , X k +1 , . . . , X r )] , whenever it is well defined. To prove Theorem 5.1, without loss of generality,we may assume θ = P r h = 0 , since we can always consider h ( · ) − θ instead. For 0 k r , define e π k h ( x , . . . , x k ) := P r − k h,π k h ( x , . . . , x k ) := ( δ x − P ) × · · · × ( δ x k − P ) × P r − k h. (15)Clearly π k is degenerate of order k with respect to the distribution P in thesense of (16) below. For any ι = ( i , . . . , i k ) ∈ I n,k , and J = ( j , . . . , j ℓ ) ∈ I k,ℓ where 0 ℓ k , define ι J := ( i j , . . . , i j ℓ ) ∈ I n,ℓ . Then π k h ( x ι ) = E X ′ hP kℓ =0 ( − k − ℓ P J ∈ I k,ℓ e π k h ( x ι J , X ′ ι \ ι J ) i for all ι ∈ I n,k . Further, the Hoeffding decomposition [19] for the U -statistic (with θ = 0) isas follows: U n = 1 | I n,r | X ι ∈ I n,r h ( X ι ) = r X k =1 (cid:18) nr (cid:19) − (cid:18) n − kr − k (cid:19) X ι ∈ I n,k π k h ( X ι ) . = r X k =1 (cid:18) rk (cid:19)(cid:18) nk (cid:19) − X ι ∈ I n,k π k h ( X ι ) =: r X k =1 (cid:18) rk (cid:19) U ( k ) n ( π k h ) . Finally, for any 1 k r , define the envelope function F k ( x , . . . , x k ) := max j d | e π k h j ( x , . . . , x k ) | . For each integer k , consider a symmetric kernel f : S k → R d . We say that f is degenerate of order k with respect to the distribution P if E X [ f j ( X , X , . . . , X k )] = 0 a.s. , for any 1 j d. (16) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics The following result is essentially due to [27, Section 3, Symmetrization in-equality] in the U -process setting. We provide a self-contained (and perhapsmore transparent) proof for completeness. Theorem 5.2 (Symmetrization inequality) . Assume (16) holds. E max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k E max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Remark 5.3.
In Theorem 5.2, the symmetrization costs a multiplicative factorof 2 k for a degenerate kernel of order k . Standard symmetrization argument forsuch degenerate U -statistics (cf. [13, Theorem 3.5.3]) together with the decou-pling inequalities (cf. [13, Theorem 3.1.1]) in literature yield that E max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C k E max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where C k = 2 k − ( k − k k − k − k − − × · · · × (2 − k ≪ C k ,improvement of the constant to the exponential growth in k turns out to becrucial to obtain the maximal inequality for the IOUS in Theorem 5.1. Themajor component for the super-exponential behavior of C k is due to the stepfor applying the decoupling inequality in [13, Theorem 3.1.1], which is valid forany (measurable) symmetric kernel. If the kernel f is degenerate of order k , thensymmetrization can be directly done without the decoupling inequality (cf. theproof of Theorem 5.2 below). Proof of Theorem 5.2.
Define a new sequence of random variables { Z i : 1 i n } : Z i = X i { ǫ i =1 } + X ′ i { ǫ i = − } . Further, for each ι = { i , . . . , i k } ∈ I n,k , define e f j,ι = 2 k E ǫ [ f j ( Z i , . . . , Z i k ) ǫ i · · · ǫ i k ] . Due to degeneracy, we have E X ′ h e f j,ι i = 2 k E ǫ E X ′ [ f j ( Z i , . . . , Z i k ) ǫ i · · · ǫ i k ]= 2 k E ǫ h f j ( X i , . . . , X i k ) { ǫ i =1 ,...,ǫ ik =1 } i = f j ( X i , . . . , X i k ) , where the first and third equalities follow from definitions and Fubini Theorem,and the second follows from the degeneracy. To wit, on the event that { ǫ i ℓ = − } for some 1 ℓ k , E X ′ iℓ (cid:2) f j ( Z i , . . . , Z i ℓ − , X ′ i ℓ , Z i ℓ +1 , . . . , Z i k ) ǫ i · · · ǫ i k (cid:3) = 0 . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics The rest of the argument is standard: by Jensen’s inequality,max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k E X ′ h e f j,ι i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k E ǫ,X ′ max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k f j ( Z i , . . . , Z i k ) ǫ i · · · ǫ i k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Since ( X , . . . , X n , ǫ , . . . , ǫ n ) and ( Z , . . . , Z n , ǫ , . . . , ǫ n ) have the same distri-bution, taking expectation on both sides completes the proof. (cid:4) We start with a lemma, whose proof is elementary and thus omitted. Recall thedefinition of e ψ β in Subsection 1.2. Lemma 5.4.
For any β > , e ψ β ( · ) is strictly increasing, convex, and e ψ β (0) = 0 .Further, for any β > , e ψ β ( x ) e x β e ψ β ( x ) + e /β , and consequently e ψ − β ( m ) log /β (cid:16) m + e /β (cid:17) . Now we state the maximal inequality with explicit constants.
Lemma 5.5.
Fix β ∈ (0 , . Consider a sequence of non-negative random vari-ables { Z j : 1 j d } , and assume that there exists some real number ∆ > such that E [ e ψ β ( Z j / ∆)] , for j d . Then E (cid:20) max j d Z j (cid:21) ∆ log /β (2 d + e /β ) . Proof.
By monotonicity and convexity, e ψ β (cid:18) E (cid:20) max j d ( Z j / ∆) (cid:21)(cid:19) E (cid:20) e ψ β (cid:18) ∆ − max j d Z j (cid:19)(cid:21) = E (cid:20) max j d e ψ β ( Z j / ∆) (cid:21) X j d E h e ψ β ( Z j / ∆) i = 2 d. Then the proof is complete by Lemma 5.4. (cid:4)
The goal is to establish an exponential moment bound (i.e., Bonami inequality)of Rademacher chaos of order k . Based on the well-known hyper-contractivity of . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Rademacher chaos variables in literature (cf. [13, Corollary 3.2.6]), our Lemma 5.6below provides an exponential moment bound with an explicit dependence onthe order.
Lemma 5.6 (Exponential moment of Rademacher chaos) . Fix k > , β = 2 /k and let { x ι : ι ∈ I n,k } be a collection of real numbers. Consider the followinghomogeneous chaos of order k : Z = X ι ∈ I n,k x ι ǫ i · · · ǫ i k , where ǫ , . . . , ǫ n are i.i.d. Rademacher random variables. Then E h e ψ β ( | Z | / ∆ n ) i , where ∆ n = 7 k/ s X ι ∈ I n,k x ι . Proof.
Denote κ = p E [ Z ], c = √ n = c k κ . Observe that β βk = 2. From [13, Theorem 3.2.2], we have for any q > E | Z | q (cid:16) q q/β _ (cid:17) κ q ( q q/β + 1) κ q . Here, the first inequality clearly holds for q
2, and we use [13, Theorem 3.2.2]for q >
2. Then using the fact that e x P ∞ ℓ =1 | x | ℓ /ℓ ! and by Lemma 5.4,we have E e ψ β ( | Z | / ∆ n ) E exp (cid:0) ( | Z | / ∆ n ) β (cid:1) ∞ X ℓ =1 E | Z | βℓ / ( ℓ !∆ βℓn ) ∞ X ℓ =1 ( βℓ ) ℓ κ βℓ ℓ !∆ βℓn + ∞ X ℓ =0 κ βℓ ℓ !∆ βℓn = ∞ X ℓ =1 β ℓ ℓ ℓ ℓ ! c ℓ + ∞ X ℓ =0 ℓ ! c ℓ . Using the fact that ℓ ℓ e ℓ ℓ !, we have E e ψ β ( | Z | / ∆ n ) ∞ X ℓ =1 (cid:18) βec (cid:19) ℓ + ∞ X ℓ =0 ℓ ! c ℓ ∞ X ℓ =1 (cid:16) ec (cid:17) ℓ + ∞ X ℓ =0 ℓ ! c ℓ . Since c = 7 > e , we have E e ψ β ( | Z | / ∆ n ) ec − e + e c − < , which completes the proof. (cid:4) Now we are in position to prove Theorem 5.1. Recall that we assume θ = 0.First, for each 2 k r and 1 j d , define Z k,j = E ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k π k h j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics where π k h is defined in (15), and ǫ , . . . , ǫ n are i.i.d. Rademacher random vari-ables. Define∆ k,j = X ι ∈ I n,k ( π k h j ( X ι )) = X ι ∈ I n,k E X ′ k X ℓ =0 ( − k − ℓ X J ∈ I k,ℓ e π k h j ( X ι J , X ′ ι \ ι J ) . By Jensen’s inequality and the fact that ( P ni =1 z n ) n P ni =1 z n , we have forany 1 j d ,∆ k,j k E X ′ X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ (cid:16)e π k h j ( X ι J , X ′ ι \ ι J ) (cid:17) k E X ′ X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J ) . Then by Lemma 5.6, E ǫ (cid:20) e ψ /k (cid:18) | Z k,j | k/ ∆ k,j (cid:19)(cid:21) . Further, by Lemma 5.5 with β = 2 /k , we have E ǫ max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k ǫ i · · · ǫ i k π k h j ( X i , . . . , X i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k/ max j d (∆ k,j ) log k/ (2 d + e k/ ) k/ log k/ (2 d + e k/ ) vuuut E X ′ X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J ) . Then by Lemma 5.2 and Jensen’s inequality, we have E max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ι ∈ I n,k π k h j ( X ι ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k/ log k/ (2 d + e k/ ) E vuuut E X ′ X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J ) k/ log k/ (2 d + e k/ ) vuuut E X ι ∈ I n,k k X ℓ =0 X J ∈ I k,ℓ F k ( X ι J , X ′ ι \ ι J ) = 56 k/ log k/ (2 d + e k/ ) s(cid:18) nk (cid:19) k E [ F k ( X , . . . , X k )] . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Now we bound E [ F k ( X , . . . , X k )]. By the definition of e π k h j , condition (C3),Lemma 5.4 and Jensen’s inequality, we have E h e ψ q ( | e π k h j ( X , . . . , X k ) | /D n ) i = E h e ψ q ( | E X ′ [ h j ( X , . . . , X k , X ′ k +1 , . . . , X ′ r )] | /D n ) i E h e ψ q ( | h j ( X , . . . , X k , X ′ k +1 , . . . , X ′ r ) | /D n ) i E [ ψ q ( | h j ( X , . . . , X k , X k +1 , . . . , X r ) | /D n )] + 1 . Since e ψ q (0) = 0, by Jensen’s inequality, we have k e π k h j ( X , . . . , X k ) |k e ψ q D n .Then by the standard maximal inequality (e.g., see [29, Lemma 2.2.2]), thereexists a constant C , depending only on q , such that for 1 k r , p E | F k ( X , . . . , X k ) | C log /q ( d ) D n . Thus we obtain that E " max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U n,j − rn n X i =1 g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r X k =2 (cid:18) rk (cid:19) E (cid:20) max j d (cid:12)(cid:12)(cid:12) U ( k ) n ( π k h j ) (cid:12)(cid:12)(cid:12)(cid:21) r X k =2 (cid:0) rk (cid:1)q(cid:0) nk (cid:1) (112) k/ log k/ (2 d + e k/ ) q E F k ( X , . . . , X k ) C log /q ( d ) D n r X k =2 (cid:0) rk (cid:1)q(cid:0) nk (cid:1) (112) k/ log k/ (2 d + e k/ ) . Observe that if r n , we have for any 1 i rr − i √ n − i r √ n ⇒ (cid:0) rk (cid:1)q(cid:0) nk (cid:1) √ k ! (cid:18) r n (cid:19) k/ . Further, for any x, y >
2, log k/ ( x + y ) k/ (log k/ ( x ) + log k/ ( y )). Now, take c = 1 / r n . Then E " max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U n,j − rn n X i =1 g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C log /q ( d ) D n r X k =2 (cid:18) r n (cid:19) k/ (log k/ (2 d ) + 1 √ k ! ( k/ k/ ) . For the first term, by geometric series formula, I = C log /q ( d ) D n r X k =2 (cid:18) r log(2 d ) n (cid:19) k/ C r log /q ( d ) D n n . For the second term, since for any ℓ > ℓ ℓ e ℓ ℓ !, we have II = C log /q ( d ) D n r X k =2 (cid:18) e r n (cid:19) k/ C r log /q ( d ) D n n , which completes the proof of Theorem 5.1. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Appendix A: Proofs
A.1. Tail probabilities
In this section, we collect and prove some results regarding tail probabilities forsum of independent random vectors, U -statistics, and U -statistics with randomkernels. For each type of statistics, we present two versions, one for non-negativerandom variables and the other for general cases.These inequalities are used in bounding the effects due to sampling (Subsec-tion A.4.3), and also in controlling the k · k ∞ distance between the bootstrapcovariance matrices and their targets (Section A.5). A.1.1. Tail probabilities for sum of independent random vectors
In this subsection, m, n, d > Lemma A.1.
Let Z , . . . , Z m be independent R d -valued random vectors and β ∈ (0 , . Assume that Z ij > , k Z ij k ψ β u n , for all i = 1 , . . . , m, and j = 1 , . . . , d. Then there exists some constant C that only depends on β such that P max j d m X i =1 Z ij > C max j d E " m X i =1 Z ij + u n log /β ( dm ) (cid:16) log( dm ) + log /β ( n ) (cid:17)!! /n. Proof.
See Subsection A.7.1. (cid:4)
Lemma A.2.
Let Z , . . . , Z m be independent R d -valued random vectors and β ∈ (0 , . Assume that E [ Z ij ] = 0 , k Z ij k ψ β u n , for all i = 1 , . . . , m, and j = 1 , . . . , d. Then there exists some constant C that only depends on β such that P max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C (cid:16) σ log / ( dn ) + u n log /β ( dm ) (cid:16) log( dm ) + log /β ( n ) (cid:17)(cid:17)! /n, where σ := max j d P mi =1 E [ Z ij ] .Proof. See Subsection A.7.2 (cid:4)
Lemma A.3.
Let Z , . . . , Z m be independent and identical distributed Bernoullirandom variables with success probability p n , i.e., P ( Z i = 1) = 1 − P ( Z i = 0) = . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics p n for i m . Further, let a , . . . , a m be deterministic R d vectors. Thenthere exists an absolute constant C such that P max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 ( Z i − p n ) a ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C (cid:16)p p n (1 − p n ) σ log / ( dn ) + M log( dn ) (cid:17)! /n, where σ := max j d P mi =1 a ij and M = max i m, j d | a ij | .Proof. See Subsection A.7.3 (cid:4)
A.1.2. Tail probabilities for U -statistics Lemma A.4.
Let X , . . . , X n be i.i.d. random variables taking value in ( S, S ) and fix β ∈ (0 , . Let f : ( S r , S r ) → R d be a measurable, symmetric functionsuch that for all j = 1 , . . . , d , f j ( X , . . . , X r ) > a.s. , E [ f j ( X , . . . , X r )] v n , k f j ( X , . . . , X r ) k ψ β u n . Define U n := | I n,r | − P ι ∈ I n,r f ( X ι ) . Then there exists a constant C that onlydepends on β such that P (cid:18) max j d U n,j > C (cid:16) v n + n − r log /β +1 ( dn ) log /β − ( n ) u n (cid:17)(cid:19) n . Clearly, we can replace v n by u n .Proof. See Subsection A.7.4. (cid:4)
Lemma A.5.
Let X , . . . , X n be i.i.d. random variables taking value in ( S, S ) and fix β ∈ (0 , . Let f : ( S r , S r ) → R d be a measurable, symmetric functionsuch that E [ f j ( X , . . . , X r )] = 0 , k f j ( X , . . . , X r ) k ψ β u n for all j = 1 , . . . , d. Define U n := | I n,r | − P ι ∈ I n,r f ( X ι ) and σ := max j d E [ f j ( X r )] . Then thereexists a constant C that only depends on β such that P (cid:18) max j d | U n,j | > C (cid:16) n − / r / log / ( dn ) σ + n − r log /β +1 ( dn ) log /β − ( n ) u n (cid:17)(cid:19) n . Clearly, we can replace σ by u n .Proof. See subsection A.7.5. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics A.1.3. Tail probabilities for U -statistics with random kernel Let X , . . . , X n be i.i.d. random variables taking value in ( S, S ) and W, { W ι , ι ∈ I n,r } be i.i.d. random variables taking value in ( S ′ , S ′ ), that are independent of X n . In this subsection, we consider a measurable function F : S r × S ′ → R d that is symmetric in the first r variables, and fix some β ∈ (0 , f ( x , . . . , x r ) := E [ F ( x , . . . , x r , W )] ,b j ( x , . . . , x r ) := k F j ( x , . . . , x r , W ) − f j ( x , . . . , x r ) k ψ β for all j = 1 , . . . , d. We first consider the non-negative random kernels.
Lemma A.6.
Consider Z := max j d | I n,r | − P ι ∈ I n,r F j ( X ι , W ι ) . Assumethat for all j = 1 , . . . , d , F j ( · ) > , and that there exists u n > such that k b j ( X r ) k ψ β u n , k f j ( X r ) k ψ β u n , for all j = 1 , . . . , d. Then there exists some constant C that only depends on β such that with prob-ability at least − /n , Z C max j d E [ f j ( X r )] + Cn − r log /β +1 ( dn ) log /β − ( n ) u n + C | I n,r | − r /β log /β +1 ( dn ) log /β − ( n ) u n . Proof.
See subsection A.7.6. (cid:4)
Next, we consider centered random kernels.
Lemma A.7.
Consider Z := max j d (cid:12)(cid:12)(cid:12) | I n,r | − P ι ∈ I n,r ( F j ( X ι , W ι ) − f j ( X ι )) (cid:12)(cid:12)(cid:12) . Assume there exists u n > such that for all j = 1 , . . . , d , k b j ( X , . . . , X r ) k ψ β u n . Then there exists some constant C that only depends on β such that with prob-ability at least − /n , Z Cu n | I n,r | − / r / log / ( dn ) (cid:16) n − / r / log /β +1 / ( dn ) log /β − / ( n ) (cid:17) + Cu n | I n,r | − r /β log /β +1 ( dn ) log /β − ( n ) . Proof.
See subsection A.7.7. (cid:4)
A.2. Additional lemmas
The following Lemma concerns Gaussian approximation for sum of independentvectors. It replaces the k · k ψ condition in Proposition 2.1 of [10] by k · k ψ q . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Lemma A.8.
Let Z , . . . , Z n be independent R d -valued random vectors. Assumethat for some absolute constant σ > , and q > , n − n X i =1 E (cid:2) Z ij (cid:3) > σ , n − n X i =1 E (cid:2) Z kij (cid:3) D kn for j = 1 , . . . , d, k = 1 , , k Z ij k ψ q D n , for i = 1 , . . . , n, j = 1 , . . . , d. Then there exists some constant C that only depends on q and σ such that ρ ( n − / n X i =1 ( Z i − E [ Z i ]) , Y ) C (cid:18) D n log q ∗ ( dn ) n (cid:19) / , where q ∗ = (6 /q + 1) ∨ , Y ∼ N (0 , Σ) , and Σ := n − P ni =1 E [ Z i Z ′ i ] .Proof. See Subsection A.8. (cid:4)
The following lemmas are elementary, but used repeatedly.
Lemma A.9.
Let β > . There exits a constant C , only depending on β , suchthat for any positive integers r, n such that r √ n , n r β C k I n,r k . Proof.
Fix β . If r → ∞ , n r β / k I n,r k →
0. Thus there exits M such that if r > M , n r β k I n,r k . For r < M , the inequality holds with C = M β . (cid:4) Lemma A.10.
Let β, k > . For any random variable X , k X k k ψ β = k X k kψ kβ . Proof.
Observe that E (cid:20) exp (cid:16) | X | k / k X k kψ kβ (cid:17) β (cid:21) = E h exp (cid:0) | X | / k X k ψ kβ (cid:1) kβ i , which implies that k X k k ψ β k X k kψ kβ . The reverse direction is similar. (cid:4) For β < k · k ψ β is not a norm , but the usual triangle inequality andmaximal inequality hold up to a multiplicative constant. Lemma A.11.
Fix β ∈ (0 , .(i) For any random variables X and Y , k X + Y k ψ β /β (cid:0) k X k ψ β + k Y k ψ β (cid:1) . (ii) Let ξ , . . . , ξ n be a sequence of random variables such that k ξ i k ψ β D for i n , and n > . Then there exists a constant C depending only on β such that k max i n ξ i k ψ β C log /β ( n ) D. Proof.
See Subsection A.8. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics A.3. Proofs in Section 2.1
We first prove Corollary 2.2 and then prove Theorem 2.1.
Proof of Corollary 2.2.
Let c be the constant in Theorem 5.1. Without loss ofgenerality, we assume r D n log q ∗ ( dn ) σ g n c, and θ = 0 , (17)since ρ ( · , · ) h ( · ) − θ instead. Recall that q ∗ =(6 /q + 1) ∨ R = [ a, b ] ∈ R , where a, b ∈ R d and a b . Define˜ a = r − Λ − / g a, ˜ b = r − Λ − / g b, ˜ U n = r − Λ − / g U n , ˜ G i = Λ − / g g ( X i ) . Denote ξ n := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ U n,j − n n X i =1 ˜ G i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Then by Theorem 5.1, E [ ξ n ] r − σ − g max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) U n,j − rn n X i =1 g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . σ − g n − r log /q ( d ) D n . For any t >
0, by Markov inequality and definition, P ( √ nU n ∈ R ) = P ( −√ nU n − a ∩ √ nU n b ) = P ( −√ n ˜ U n − ˜ a ∩ √ n ˜ U n ˜ b ) P ( −√ n ˜ U n − ˜ a ∩ √ n ˜ U n ˜ b ∩ √ nξ n t ) + P ( √ nξ n > t ) P ( − √ n n X i =1 ˜ G i − ˜ a + t ∩ √ n n X i =1 ˜ G i ˜ b + t ) + Ct − σ − g n − / r log /q ( d ) D n . Due to assumptions (C2), (C3) and Cauchy-Schwarz inequality, E [ ˜ G i,j ] = 1 , for 1 i n, j d, E [ ˜ G i,j ] ( σ − g,j D n ) ( σ − g D n ) , for 1 i n, j d, E [ | ˜ G i,j | ] q E [ ˜ G i,j ] E [ ˜ G i,j ] σ − g D n , for 1 i n, j d, k ˜ G i,j k ψ q σ − g,j D n σ − g D n , for 1 i n, j d. Then due to Lemma A.8, we have P ( √ nU n ∈ R ) P ( − Λ − / g Y A − ˜ a + t ∩ Λ − / g Y A ˜ b + t )+ C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + Ct − σ − g n − / r log /q ( d ) D n . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Further, by anti-concentration inequality [10, Lemma A.1], P ( √ nU n ∈ R ) P ( − Λ − / g Y A − ˜ a ∩ Λ − / g Y A ˜ b ) + Ct p log( d )+ C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + Ct − σ − g n − / r log /q ( d ) D n . Finally, taking t = (cid:16) σ − g n − r log /q ( d ) D n (cid:17) / and due to convention (17),we have P ( √ nU n ∈ R ) P ( rY A ∈ R ) + C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + C (cid:16) σ − g n − r log /q ( d ) D n (cid:17) / P ( rY A ∈ R ) + C (cid:0) σ − g n − D n log q ∗ ( dn ) (cid:1) / + C (cid:0) σ − g n − r log q ∗ ( d ) D n (cid:1) / P ( rY A ∈ R ) + C (cid:0) σ − g n − r D n log q ∗ ( dn ) (cid:1) / . Likewise, we can show the lower inequality P ( √ nU n ∈ R ) > P ( rY A ∈ R ) − C (cid:0) σ − g n − r D n log q ∗ ( dn ) (cid:1) / , which completes the proof. (cid:4) Proof of Theorem 2.1.
As before, without loss of generality, we assume θ = 0 , and r D n log q ∗ ( dn ) σ g n c , (18)for some sufficiently small c ∈ (0 , ι = ( i , . . . , i r ) ∈ I n,r , H ι := H ( X i , . . . , X i r , W ι ) − h ( X i , . . . , X i r ) := H ( X ι , W ι ) − h ( X ι ) . Then by definition, b U n = R n + U n , where R n := | I n,r | − X ι ∈ I n,r H ι . Step 1 . We first show that E (cid:20) max j d | R n,j | (cid:21) . D n log / /q ( dn ) n . (19)Note that conditional on X n , R n is an average of independent random vectors.Thus by [9, Lemma 8], E | X n (cid:20) max j d | I n,r | | R n,j | (cid:21) . s log( d ) max j d X ι ∈ I n,r E | X n h H j ( X ι , W ι ) i + log( d ) s E | X n (cid:20) max ι ∈ I n,r max j d H j ( X ι , W ι ) (cid:21) . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics By definition (6) and maximal inequality ([29, Lemma 2.2.2] and Lemma A.11), E | X n h H j ( X ι , W ι ) i B n,j ( X ι ) for all ι ∈ I n,r , s E | X n (cid:20) max ι ∈ I n,r max j d H j ( X ι , W ι ) (cid:21) r /q log /q ( dn ) max ι ∈ I n,r max j d B n,j ( X ι ) . Define Z := max j d | I n,r | X ι ∈ I n,r B n,j ( X ι ) max ι ∈ I n,r max j d B n,j ( X ι ) := M . Under the assumption (C4) and again maximal inequality ([29, Lemma 2.2.2]and Lemma A.11), we have k M k ψ q r /q log /q ( dn ) D n . Then, we have E (cid:20) max j d | R n,j | (cid:21) . s log( d ) | I n,r | E [ M ] + r /q log /q ( dn ) | I n,r | E [ M ] . s log( d ) | I n,r | + r /q log /q ( dn ) | I n,r | ! r /q log /q ( dn ) D n . Then due to Lemma A.9 and (18), we have E (cid:20) max j d | R n,j | (cid:21) . D n log / /q ( dn ) n . Step 2 . We finish the proof by a similar argument as in the proof of Corol-lary 2.2.Fix any rectangle R = [ a, b ] ∈ R , where a, b ∈ R d and a b . Define˜ a = r − Λ − / g a, ˜ b = r − Λ − / g b, ˜ Y A = Λ − / g Y A . where we recall that Λ g is defined in (5). Recall that b U n = U n + R n . For any t >
0, by Markov inequality, the result from Step 1, and Corollary 2.2, P ( √ n b U n ∈ R ) = P ( −√ n b U n − a ∩ √ n b U n b ) P ( −√ n b U n − a ∩ √ n b U n b ∩ √ n k R n k ∞ t ) + P ( √ n k R n k ∞ > t ) P ( −√ nU n − a + t ∩ √ nU n b + t ) + Ct − n − / D n log / /q ( dn ) P ( − rY A − a + t ∩ rY A b + t ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / + Ct − n − / D n log / /q ( dn ) P ( − ˜ Y A − ˜ a + tr − σ − g ∩ ˜ Y A ˜ b + tr − σ − g ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / + Ct − n − / D n log / /q ( dn ) . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Observe that E [ ˜ Y A,j ] = 1 for 1 j d . By anti-concentration inequality [10,Lemma A.1], P ( √ n b U n ∈ R ) P ( − ˜ Y A − ˜ a ∩ ˜ Y A ˜ b ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / + Ctr − σ − g p log( d ) + Ct − n − / D n log / /q ( dn ) . Finally, taking t = (cid:16) σ g n − r log /q ( dn ) D n (cid:17) / and due to convention (18),we have P ( √ n b U n ∈ R ) P ( rY A ∈ R ) + C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / . By a similar argument, we can show P ( √ n b U n ∈ R ) > P ( rY A ∈ R ) − C (cid:18) r D n log q ∗ ( dn ) σ g n (cid:19) / . which completes the proof. (cid:4) A.4. Proofs in Section 2.2
In this subsection, without loss of generality, we assume θ = 0. Recall thedefinition Λ H in (5). Further, define a function ˜ H : S r ∗ S ′ → R d by ˜ H ( x r , w ) =Λ − / H H ( x r , w ) for any x r ∈ S r , w ∈ S ′ , andΓ ˜ H := Cov( ˜ H ( X r , W )) = Λ − / H Γ H Λ − / H , b Γ ˜ H := 1 | I n,r | X ι ∈ I n,r ˜ H ( X ι , W ι ) ˜ H ( X ι , W ι ) T . (20)Clearly, if (C5) holds, then E | ˜ H j ( X r , W ) | k ( σ − H,j D n ) k ( σ − H D n ) k , for 1 j d, k = 1 , , (21)where again we applied CauchySchwarz inequality for k = 1. A.4.1. Bounding b N/N
The following lemma follows from an application of Bernstein’s inequality andis proved in the Step 5 of the proof of [8, Theorem 3.1]. It is included here foreasy reference.
Lemma A.12.
Assume p log( n ) /N / . Then P (cid:16) | b N /N − | > p log( n ) /N (cid:17) n − , P (cid:16) | N/ b N − | > p log( n ) /N (cid:17) n − . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics A.4.2. Bounding the normalized covariance estimator
Lemma A.13.
Assume (C3’) , (C4) and (C5) hold. Then there exists a constant C , depending only on q , such that with probability at least − /n , k b Γ ˜ H − Γ ˜ H k ∞ C (cid:16) σ − H n − / r / log / ( dn ) D n + σ − H n − r log /q +1 ( dn ) log /q − ( n ) D n (cid:17) + Cσ − H D n | I n,r | − / r / log / ( dn ) (cid:16) n − / r / log /q +1 / ( dn ) log /q − / ( n ) (cid:17) + Cσ − H D n | I n,r | − r /q log /q +1 ( dn ) log /q − ( n ) . Proof.
Define v ( x r ) := E [ ˜ H ( x r , W ) ˜ H ( x r , W ) T ] , b V := | I n,r | − P ι ∈ I n,r v ( X ι ).Observe that k b Γ ˜ H − Γ ˜ H k ∞ k b Γ ˜ H − b V k ∞ + k b V − Γ ˜ H k ∞ . We will bound these two terms separately.Step 0. We first make a few observations. Clearly, E [ v ( X r )] = Γ ˜ H , and for all1 j, k d , by Jensen’s inequality for conditional expectation and (21), E | v jk ( X r ) | E [ ˜ H j ( X r , W ) ˜ H k ( X r , W )] E [ ˜ H j ( X r , W )] + E [ ˜ H k ( X r , W )] . σ − H D n . (22)Further, by definition | v jk ( x r ) | E [ ˜ H j ( x r , W )] + E [ ˜ H k ( x r , W )] . σ − H (cid:0) B n,j ( x r ) + h j ( x r ) + B n,k ( x r ) + h k ( x r ) (cid:1) . As a result, by the assumptions (C4) and (C3’), and Lemma A.10,max j,k d k v jk ( X r ) k ψ q/ . σ − H max j d (cid:0) k B n,j ( X r ) k ψ q/ + k h j ( X r ) k ψ q/ (cid:1) = σ − H max j d (cid:16) k B n,j ( X r ) k ψ q + k h j ( X r ) k ψ q (cid:17) . ( σ − H D n ) . (23) Step 1 . We bound k b Γ ˜ H − b V k ∞ using Lemma A.7 with F = ˜ H ˜ H T and ψ q/ . For1 j, k d , define b jk ( x r ) := k ˜ H j ( x r , W ) ˜ H k ( x r , W ) − v jk ( x r ) k ψ q/ . Observe that due to Lemma A.10 and A.11, b jk ( x r ) . k ˜ H j ( x r , W ) k ψ q/ + k ˜ H k ( x r , W ) k ψ q/ + v jk ( x r )= k ˜ H j ( x r , W ) k ψ q + k ˜ H k ( x r , W ) k ψ q + v jk ( x r ) . σ − H ( h j ( x r ) + B n,j ( x r ) + h k ( x r ) + B n,k ( x r )) + v jk ( x r ) . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Then due to (23) and the assumptions (C4) and (C3’), k b jk ( X r ) k ψ q/ . ( σ − H D n ) , for all 1 j, k d. Now we apply Lemma A.7, with probability at least 1 − /n , k b Γ ˜ H − b V k ∞ . σ − H D n | I n,r | − / r / log / ( dn ) (cid:16) n − / r / log /q +1 / ( dn ) log /q − / ( n ) (cid:17) + σ − H D n | I n,r | − r /q log /q +1 ( dn ) log /q − ( n ) . Step 2 . We bound k b V − Γ ˜ H k ∞ using Lemma A.5 with ψ q/ . By (22) and (23),with probability at least 1 − /n , k b V − Γ ˜ H k ∞ . n − / r / log / ( dn ) σ − H D n + n − r log /q +1 ( dn ) log /q − ( n ) σ − H D n . Then the proof is complete by combining step 1 and 2. (cid:4)
A.4.3. Bounding the effect of sampling
The following quantity will appear in the proof of Theorem 2.4: √ N ζ n := 1 p | I n,r | X ι ∈ I n,r Z ι − p n p p n (1 − p n ) ˜ H ( X ι , W ι ) := 1 p | I n,r | X ι ∈ I n,r e Z ι , (24)The next lemma establishes conditional Gaussian approximation for √ N ζ n . Lemma A.14.
Suppose the assumptions in Theorem 2.4 hold. There exists aconstant C , depending on q , such that with probability at least − C/n , ρ R| X,W ( √ N ζ n , Λ − / H Y B ) := sup R ∈R (cid:12)(cid:12)(cid:12) P | X,W (cid:16) √ N ζ n ∈ R (cid:17) − P (Λ − / H Y B ∈ R ) (cid:12)(cid:12)(cid:12) C̟ n , where we recall that Y B ∼ N (0 , Γ H ) , and we abbreviate P | X,W for P | X n , { W ι : ι ∈ I n,r } .Proof. Consider conditionally independent (conditioned on
X, W ) R d -valuedrandom vectors { b Y ι : ι ∈ I n,r } such that b Y ι | X, W ∼ N (0 , ˜ H ( X ι , W ι ) ˜ H ( X ι , W ι ) T ) , b Y := | I n,r | − / X ι ∈ I n,r b Y ι . Clearly, b Y | X, W ∼ N (0 , b Γ ˜ H ). Further, define ρ R| X,W ( √ N ζ n , b Y ) := sup R ∈R (cid:12)(cid:12)(cid:12) P | X,W (cid:16) √ N ζ n ∈ R (cid:17) − P | X,W ( b Y ∈ R ) (cid:12)(cid:12)(cid:12) ,ρ R| X,W ( b Y , Λ − / H Y B ) := sup R ∈R (cid:12)(cid:12)(cid:12) P | X,W (cid:16) b Y ∈ R (cid:17) − P (Λ − / H Y B ∈ R ) (cid:12)(cid:12)(cid:12) . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics By triangle inequality, it then suffices to show that each of the following eventshappens with probability at least 1 − C/n , ρ R| X,W ( √ N ζ n , b Y ) C̟ n , ρ R| X,W ( b Y , Λ − / H Y B ) C̟ n , (25)on which we now focus. Without loss of generality, since σ g
1, we assume r q D n log q ∗ ( dn ) σ g n ∧ N c , and r q D n log q ∗ ( dn ) n ∧ N c . (26)for some sufficiently small constant c ∈ (0 ,
1) that is to be determined. Recallthat q = 2 ∨ (2 /q ) and q ∗ = (6 /q + 1) ∨ Step 0 . By Lemma A.13 and A.9, P k b Γ ˜ H − Γ ˜ H k ∞ C r log ∨ (2 /q − ( dn ) D n σ H n ! / > − n . (27)In particular, since Γ ˜ H,jj = 1, if we take c small enough such that Cc / / P (cid:16) min j d b Γ ˜ H,jj > / (cid:17) > − /n . Step 1 . The goal is to show that the first event in (25), ρ R| X,W ( √ N ζ n , b Y ) C̟ n ,holds with probability at least 1 − C/n . Step 1.1.
Define b L n := max j d | I n,r | − X ι ∈ I n,r E | X,W h | e Z ι,j | i . (28)Further, c M n ( φ ) := c M n,X ( φ ) + c M n,Y ( φ ), where c M n,X ( φ ) := | I n,r | − X ι ∈ I n,r E | X,W " max j d | e Z ι,j | ; max j d | e Z ι,j | > p | I n,r | φ log d , c M n,Y ( φ ) := | I n,r | − X ι ∈ I n,r E | X,W " max j d | b Y ι,j | ; max j d | b Y ι,j | > p | I n,r | φ log d , (29)By Theorem 2.1 in [10], there exist absolute constants K and K such thatfor any real numbers L n and M n , we have ρ R| X,W ( √ N ζ n , b Y ) K L n log ( d ) | I n,r | ! / + M n L n with φ n := K L n log ( d ) | I n,r | ! − / , on the event E n := { b L n L n } ∩ { c M n ( φ n ) M n } ∩ { min j d b Γ ˜ H,jj > / } . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics In Step 0, we have shown P (cid:16) min j d b Γ ˜ H,jj > / (cid:17) > − /n . In Step1.2-1.4, we select proper L n and M n such that the first two events happen withprobability at least 1 − C/n . In Step 1.5, we plug in these values.
Step 1.2: Select L n . Since p n / E | Z ι − p n | Cp n , and thus b L n Cp − / n Z , where Z := max j d | I n,r | X ι ∈ I n,r (cid:12)(cid:12)(cid:12) ˜ H j ( X ι , W ι ) (cid:12)(cid:12)(cid:12) . We will apply Lemma A.6 with F ( · ) = | ˜ H ( · ) | and β = q/
3. Thus for 1 j d ,define f j ( x r ) := E (cid:20)(cid:12)(cid:12)(cid:12) ˜ H j ( x r , W ) (cid:12)(cid:12)(cid:12) (cid:21) , b j ( x r ) := (cid:13)(cid:13)(cid:13)(cid:13)(cid:12)(cid:12)(cid:12) ˜ H j ( x r , W ) (cid:12)(cid:12)(cid:12) − f j ( x r ) (cid:13)(cid:13)(cid:13)(cid:13) ψ q/ . First, by iterated expectation and due to (21), E [ f j ( X r )] = E (cid:20)(cid:12)(cid:12)(cid:12) ˜ H j ( X r , W ) (cid:12)(cid:12)(cid:12) (cid:21) σ − H D n , for 1 j d. Second, observe that σ H,j f j ( x r ) . E (cid:2) | H j ( x r , W ) − h j ( x r ) | (cid:3) + | h j ( x r ) | . B n,j ( x r ) + | h j ( x r ) | , and thus due to (C3), (C4) and Lemma A.10 and A.11, k f j ( X r ) k ψ q/ . σ − H,j (cid:0) k B n,j ( X r ) k ψ q/ + k h j ( X r ) k ψ q/ (cid:1) = σ − H,j (cid:16) k B n,j ( X r ) k ψ q + k h j ( X r ) k ψ q (cid:17) . ( σ − H D n ) . Further, observe that by Lemma A.11, σ H,j b j ( x r ) . k | H j ( x r , W ) − h j ( x r ) | k ψ q/ + | h j ( x r ) | + σ H,j f j ( x r )= B n,j ( x r ) + | h j ( x r ) | + σ H,j f j ( x r ) . Thus by the same argument, k b j ( X r ) k ψ q/ . ( σ − H D n ) . Then by Lemma A.6,with probability at least 1 − n − , Z C (cid:16) σ − H D n + n − r log /q ( dn ) σ − H D n + | I n,r | − r /q log /q ( dn ) σ − H D n (cid:17) . Due to Lemma A.9 and assumption (26), P ( b L n Cσ − H p − / n D n ) > − /n .Thus there is a constant C , depending on q , such that if L n := C σ − H p − / n r /q D n , (30)then P ( b L n L n ) > − /n . Step 1.3: bounding c M n,X ( φ n ) . Since Z ι is a Bernoulli random variable, it isclear that c M n,X ( φ n ) = 0 on the event M := max ι ∈ I n,r max j d | ˜ H j ( X ι , W ι ) | √ N φ n log( d ) = 4 − K − C / (cid:18) r /q D n Nσ H log( d ) (cid:19) / , . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics where we use the value (30) for L n .By assumption (C3’) and Lemma A.11, k M k ψ q C ′ σ − H r /q D n log /q ( dn ) ⇒ P (cid:16) M C ′ σ − H r /q D n log /q ( dn ) (cid:17) > − /n. Due to (26), (cid:18) r /q D n Nσ H log( d ) (cid:19) / > c − / σ − H r /q D n log /q ( dn ) ,φ − n = K − C / (cid:18) r /q D n log ( d ) σ H N (cid:19) / K − C / c / . Thus if we take c in (26) to be sufficiently small such that c − / − K − C / > C ′ and K − C / c / . then P ( c M n,X ( φ n ) = 0) > − /n and φ n > Step 1.4: select M n . From Step 1.3, we have shown that P ( E ′ n ) > − /n, where E ′ n := (cid:26) M := max ι ∈ I n,r max j d | ˜ H j ( X ι , W ι ) | C ′ σ − H r /q D n log /q ( dn ) (cid:27) . Then by the same argument as in Step 1.4 of the proof of [8, Theorem 3.1] anddue to (26) and φ n >
1, on the event E ′ n , for any ι ∈ I n,r , E | X,W " max j d | b Y ι,j | ; max j d | b Y j,ι | > p | I n,r | φ n log d C p | I n,r | φ n log d + CM log / ( d ) ! exp − p | I n,r | CM φ n log / d ! Cn r/ exp − | I n,r | / Cσ − / H r / q D / n log /q +5 / ( dn ) ! Cn r/ exp (cid:16) −| I n,r | / /C (cid:17) . Thus there exists an absolute constant C such that if we set M n := C n r/ exp (cid:16) −| I n,r | / /C (cid:17) , (31)then P ( c M n,Y ( φ n ) M n ) > − /n . Step 1.5: plug in L n and M n . Recall the definition L n and M n in (30) and (31).With these selections, we have shown that P ( E n ) > − C/n , where we recallthat E n := { b L n L n } ∩ { c M n ( φ n ) M n } ∩ { min j d b Γ ˜ H,jj > / } . Further, . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics on the event E n , ρ R| X,W ( √ N ζ n , b Y ) . L n log ( d ) | I n,r | ! / + M n L n . (cid:18) r /q D n log ( d ) σ H N (cid:19) / + p / n σ H D n r /q n r/ exp (cid:16) −| I n,r | / /C (cid:17) . C̟ n , which completes the proof of Step 1. Step 2.
The goal is to show that the second event in (25), ρ R| X,W ( b Y , Λ − / H Y B ) C̟ n , holds with probability at least 1 − C/n .Observe that Cov(Λ − / H Y B ) = Γ ˜ H and Γ ˜ H,jj = 1 for 1 j d . By theGaussian comparison inequality [8, Lemma C.5], ρ R| X,W ( b Y , Λ − / H Y B ) . ∆ / log / ( d ) , on the event that {k b Γ ˜ H − Γ ˜ H k ∞ ∆ } . From (27) in Step 0, P (cid:16) k b Γ ˜ H − Γ ˜ H k ∞ C ( σ − H n − r log ∨ (2 /q − ( dn ) D n ) / (cid:17) > − /n. Thus if we set ∆ = C ( σ − H n − r log ∨ (2 /q − ( dn ) D n ) / , then with probabilityat least 1 − C/n , ρ R| X,W ( b Y , Y B ) C r log ∨ (2 /q +3) ( dn ) D n σ H n ! / C̟ n . (cid:4) A.4.4. Proof of Theorem 2.4
Without loss of generality, we assume that r q D n log q ∗ ( dn ) σ g n ∧ N . (32)Observe that U ′ n,N = N b N N X ι ∈ I n,r ( Z ι − p n )Λ / H ˜ H ( X ι , W ι ) + 1 | I n,r | X ι ∈ I n,r H ( X ι , W ι ) = N b N (cid:16)p − p n Λ / H ζ n + b U n (cid:17) := N b N Φ n , . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics where we recall that b U n and ζ n is defined in Section 2.1 and in (24) respectively.Denote Y := rY A + α / n Y B .Step 1: the goal is to show that ρ (cid:16) √ n Φ n , rY A + α / n Y B (cid:17) . ̟ n . For any rectangle R ∈ R , observe that P ( √ n (cid:16) b U n + p − p n Λ / H ζ n (cid:17) ∈ R )= E " P | X,W √ Nζ n ∈ p α n (1 − p n ) Λ − / H R − s N − p n Λ − / H b U n !! . By Lemma A.14, since n − . ̟ n , we have P ( √ n (cid:16) b U n + p − p n Λ / H ζ n (cid:17) ∈ R ) E " P | X,W Λ − / H Y B ∈ p α n (1 − p n ) Λ − / H R − s N − p n Λ − / H b U n !! + C̟ n = P (cid:16) √ n b U n ∈ h R − p α n (1 − p n ) Y B i(cid:17) + C̟ n , where we recall that Y B is independent of all other random variables. Further,by Theorem 2.1, P ( √ n (cid:16) b U n + p − p n Λ / H ζ n (cid:17) ∈ R ) E h P | Y B (cid:16) √ n b U n ∈ h R − p α n (1 − p n ) Y B i(cid:17)i + C̟ n E h P | Y B (cid:16) rY A ∈ h R − p α n (1 − p n ) Y B i(cid:17)i + C̟ n , = P (cid:16) Λ − / g ( rY A + p α n (1 − p n ) Y B ) ∈ Λ − / g R (cid:17) + C̟ n . Observe that E [( σ − g,j rY A,j ) ] = r > j d , k Γ H k ∞ . D n due to (C3’), and α n p n = n/ | I n,r | . n − . Then by the Gaussian comparisoninequality [8, Lemma C.5] and due to (32) P ( √ n Φ n ∈ R ) P (cid:16) Λ − / g ( rY A + √ α n Y B ) ∈ Λ − / g R (cid:17) + C̟ n + C (cid:18) D n log ( d ) σ g n (cid:19) / P ( rY A + √ α n Y B ∈ R ) + C̟ n . Similarly, we can show P ( √ n Φ n ∈ R ) > P (cid:0) rY A + √ α n Y B ∈ R (cid:1) − C̟ n . Thusthe proof of Step 1 is complete.Step 2: we show that with probability at least 1 − C̟ n , k ( N b N − √ N Φ n k ∞ Cν n , where ν n := s log ( dn ) r D n n ∧ N . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Clearly, E [ Y j ] = r σ g,j + α n σ H,j . Then due to (C3’), E [ Y j ] ( r + α n ) D n . Since Y is a multivariate Gaussian, max j d k Y j k ψ p ( r + α n ) D n . Then by themaximal inequality [29, Lemma 2.2.2] k max j d | Y j |k ψ C p ( r + α n ) D n log( d ),which further implies that P (cid:18) max j d | Y j | > C p ( r + α n ) D n log( d ) log( n ) (cid:19) n − . Since n − . ̟ n , and from the result in Step 1, we have P (cid:16) k√ n Φ n k ∞ > C p ( r + α n ) D n log( d ) log( n ) (cid:17) C̟ n . Finally, due to Lemma A.12 and (32), we have with probability at least 1 − C̟ n , k ( N / b N − √ N Φ n k ∞ C q ( r + α n ) D n log( d ) log ( n ) N − α − n . Since ( r + α n ) N − α − n = r n − + N − r ( n ∧ N ) − , the proof is complete.Step 3: final step. Recall that √ N U ′ n,N = √ N Φ n + ( N/ b N − √ N Φ n and ν n is defined in Step 2. For any rectangle R = [ a, b ] with a b , by Step 2, P (cid:16) √ N U ′ n,N ∈ R (cid:17) P (cid:16) √ N U ′ n,N ∈ R ∩ k ( N/ b N − √ N Φ n k ∞ Cν n (cid:17) + C̟ n P (cid:16) √ N Φ n − a + Cν n ∩ √ N Φ n b + Cν n (cid:17) + C̟ n . Then by the result in Step 1, we have P (cid:16) √ N U ′ n,N ∈ R (cid:17) P (cid:16) α − / n Y − a + Cν n ∩ α − / n Y b + Cν n (cid:17) + C̟ n P (cid:16) α − / n ˜ Y − ˜ a + Cσ − H ν n ∩ α − / n ˜ Y ˜ b + Cσ − H ν n (cid:17) + C̟ n , where ˜ Y = Λ − H Y , ˜ a = Λ − H a and ˜ b = Λ − H b . Observe that E [( α − / n ˜ Y j ) ] > E [( σ − H,j Y B,j ) ] = 1 for 1 j d , and thus by anti-concentration inequality [10,Lemma A.1], P (cid:16) √ N U ′ n,N ∈ R (cid:17) P (cid:16) α − / n ˜ Y − ˜ a ∩ α − / n Y ˜ b (cid:17) + Cσ − H ν n log / ( d ) + C̟ n = P (cid:16) α − / n Y ∈ R (cid:17) + s log( d ) log ( dn ) r D n σ H n ∧ N + C̟ n P (cid:16) α − / n Y ∈ R (cid:17) + C̟ n , where the last inequality is due to (32). Similarly, we can show P (cid:16) √ N U ′ n,N ∈ R (cid:17) > P (cid:16) α − / n Y ∈ R (cid:17) − C̟ n , and thus ρ ( √ N U ′ n,N , α − / n Y ) . ̟ n , which completes the proof. . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics A.5. Proofs in Section 3
In this subsection, without loss of generality, we assume q A.5.1. Proof of Theorem 3.1Proof.
Without loss of generality, we can assume θ = E [ H ( X r , W )] = 0, sinceotherwise we can center H first. Recall the definition of Λ H in (5), ˜ H ( · ) =Λ − / H H ( · ), and Γ ˜ H , b Γ ˜ H in (20). Observe that for any integer k , there existssome constant C that depends only on k and ζ such thatlog k ( n ) n − ζ C. (33)Step 0. Define ˜ U ′ n,N := Λ − / H U ′ n,N and b ∆ B := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b N X ι ∈ I n,r Z ι (cid:16) ˜ H ( X ι , W ι ) − ˜ U ′ n,N (cid:17) (cid:16) ˜ H ( X ι , W ι ) − ˜ U ′ n,N (cid:17) T − Γ ˜ H (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Since Γ ˜ H,jj = 1 for 1 j d , by Gaussian comparison inequality [8, C.5],sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n,B ∈ R (cid:17) − P ( Y B ∈ R ) (cid:12)(cid:12)(cid:12) = sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) Λ − / H U n,B ∈ R (cid:17) − P (Λ − / H Y B ∈ R ) (cid:12)(cid:12)(cid:12) . (cid:16) b ∆ B log ( d ) (cid:17) / . Thus it suffices to show that with probability at least 1 − C/n , b ∆ B log ( d ) . n − ζ/ . Define b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N − X ι ∈ I n,r ( Z ι − p n ) ˜ H ( X ι , W ι ) ˜ H ( X ι , W ι ) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)b Γ ˜ H − Γ ˜ H (cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := | N/ b N − | k Γ ˜ H k ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N − X ι ∈ I n,r Z ι ˜ H ( X ι , W ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Then clearly b ∆ B | N/ b N | (cid:16) b ∆ B, + b ∆ B, (cid:17) + b ∆ B, + ( N/ b N ) b ∆ B, .Without loss of generality, we can assume C n − ζ /
16, since we can alwaystake C to be large enough. Then by Lemma A.12, P ( | N/ b N | C ) > − n − ,and thus it suffices to show that P (cid:16) b ∆ B,i log ( d ) Cn − ζ/ (cid:17) > − C/n, for all i = 1 , . . . , , . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics on which we now focus.Step 1: bounding b ∆ B, . Conditional on { X ι , W ι : ι ∈ I n,r } , by Lemma A.3, P (cid:16) N b ∆ B, C (cid:16)p N V n log( dn ) + M log( dn ) (cid:17)(cid:17) > − C/n, where V n := max j,k d | I n,r | − X ι ∈ I n,r ˜ H j ( X ι , W ι ) ˜ H k ( X ι , W ι ) , M := max ι ∈ I n,r max j d ˜ H j ( X ι , W ι ) . First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) anddue to (C3’) and Lemma A.10 and A.11, k M k ψ q/ . σ − H r /q log /q ( dn ) max ι ∈ I n,r max j d k H j ( X ι , W ι ) k ψ q/ . σ − H r /q log /q ( dn ) D n . As a result, P (cid:16) M Cσ − H r /q D n log /q ( n ) log /q ( dn ) (cid:17) > − /n .Second, we will apply Lemma A.6 to bound V n with F jk ( · ) = ˜ H j ( · ) ˜ H k ( · ) and β = q/
4. Note that by Lemma A.11, for 1 j, k d , σ H,j σ H,k f jk ( x r ) := E (cid:2) H j ( x r , W ) H k ( x r , W ) (cid:3) . E (cid:2) H j ( x r , W ) + H k ( x r , W ) (cid:3) . h j ( x r ) + B n,j ( x r ) + h k ( x r ) + B n,k ( x r ) ,σ H,j σ H,k b jk ( x r ) := k H j ( x r , W ) H k ( x r , W ) − σ H,j σ H,k f jk ( x r ) k ψ q/ . h j ( x r ) + B n,j ( x r ) + h k ( x r ) + B n,k ( x r ) + σ H,j σ H,k f jk ( x r ) . As a result, due to (C5), (C3) and (C4) E [ f jk ( X r )] . ( σ − H D n ) , k f jk ( X r ) k ψ q/ . ( σ − H D n ) , k b jk ( X r ) k ψ q/ . ( σ − H D n ) . Then by Lemma A.6 and A.9, and due to (8) and (33) P ( V n Cσ − H D n ) > − /n. Finally, putting the two results together and again by (33), we have P (cid:16) b ∆ B, C (cid:16) N − / log / ( dn ) σ − H D n + N − r /q log /q ( n ) log /q +1 ( dn ) σ − H D n (cid:17)(cid:17) > − C/n.
Then by (8), P (cid:16) b ∆ B, Cσ − H N − / r /q log / ( dn ) D n (cid:17) > − C/n , which im-plies that with probability at least 1 − C/n , b ∆ B, log ( d ) Cn − ζ/ . Step 2: bounding b ∆ B, . By Lemma A.13 and A.9, and due to assumptions (8)and (33) P (cid:16) b ∆ B, Cσ − H n − / r / log / ( dn ) D n (cid:17) > − /n, . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics which implies P ( b ∆ B, log ( d ) Cn − ζ/ ) > − /n .Step 3: bounding b ∆ B, . By definition, k Γ ˜ H k ∞ = 1. Then by Lemma A.12 and (8), b ∆ B, log ( d ) N − / log / ( n ) log ( d ) Cn − ζ/ , with probability at least 1 − n − .Step 4: bounding b ∆ B, . Define b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N − X ι ∈ I n,r ( Z ι − p n ) ˜ H ( X ι , W ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I n,r | − X ι ∈ I n,r ˜ H ( X ι , W ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Clearly, b ∆ B, (cid:16) b ∆ B, + b ∆ B, (cid:17) . In the next two sub-steps, we will boundthese two terms separately.Step 4.1: bounding b ∆ B, . Conditional on { X ι , W ι : ι ∈ I n,r } , by Lemma A.3, P (cid:18) N b ∆ B, C (cid:18)q N e V n log( dn ) + f M log( dn ) (cid:19)(cid:19) > − C/n, where e V n := max j d | I n,r | − P ι ∈ I n,r ˜ H j ( X ι , W ι ) , f M := max ι ∈ I n,r max j d | ˜ H j ( X ι , W ι ) | .First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) anddue to (C3’), k f M k ψ q . σ − H r /q log /q ( dn ) D n . As a result, P (cid:16) f M Cσ − H r /q D n log /q ( n ) log /q ( dn ) (cid:17) > − /n .Second, we will apply Lemma A.6 to bound e V n with F j ( · ) = ˜ H j ( · ) and β = q/
2. Define for 1 j d , f j ( x r ) := E h ˜ H j ( x r , W ) i , b j ( x r ) := k ˜ H j ( x r , W ) − f j ( x r ) k ψ q/ . By the similar argument as in Step 1, E [ f j ( X r )] = 1 , k f j ( X r ) k ψ q/ . ( σ − H D n ) , k b j ( X r ) k ψ q/ . ( σ − H D n ) . Then by Lemma A.6 and A.9, and due to (8) and (33) we have P ( e V n C ) > − /n .Finally, putting the two results together, we have P (cid:16) b ∆ B, C (cid:16) N − log( dn ) + σ − H N − r /q log /q +2 ( dn ) log /q ( n ) D n (cid:17)(cid:17) > − C/n.
Then by (8), P (cid:16) b ∆ B, CN − log( dn ) (cid:17) > − C/n , which implies that withprobability at least 1 − C/n , b ∆ B, log ( d ) Cn − ζ holds. . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Step 4.2: bounding b ∆ B, . Observe that b ∆ B, b ∆ B, + b ∆ B, , where b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I n,r | − X ι ∈ I n,r Λ − / H ( H ( X ι , W ι ) − h ( X ι )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ , b ∆ B, := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I n,r | − X ι ∈ I n,r Λ − / H h ( X ι ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . By directly applying Lemma A.7 with β = q , due to (8) and Lemma A.9, P (cid:16) b ∆ B, Cσ − H D n n − log / ( dn ) (cid:17) > − /n. By directly applying Lemma A.5 with β = q and due to (8), P (cid:16) b ∆ B, Cσ − H n − / r / log / ( dn ) D n (cid:17) > − /n. Thus P (cid:16) b ∆ B, log ( d ) Cn − ζ (cid:17) > − C/n .Combining sub-step 4.1 and 4.2, we have P (cid:16) b ∆ B, log ( d ) Cn − ζ (cid:17) > − C/n . And combining Step 0-4, we finish the proof. (cid:4)
A.5.2. Proof of Lemma 3.2Proof.
Without loss of generality, we can assume θ = E [ H ( X r , W )] = 0. Recallthe definition Λ g is (5). By definition, E [( σ − g,j Y A,j ) ] = 1 for 1 j d . Thenby the Gaussian comparison inequality [8, Lemma C.5],sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) U n ,A ∈ R (cid:17) − P ( Y A ∈ R ) (cid:12)(cid:12)(cid:12) = sup R ∈R (cid:12)(cid:12)(cid:12) P |D n (cid:16) Λ − / g U n ,A ∈ R (cid:17) − P (Λ − / g Y A ∈ R ) (cid:12)(cid:12)(cid:12) . ( b ∆ A log ( d )) / , where b ∆ A := max j,k d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ g,j σ g,k n X i ∈ S ( G i ,j − G j )( G i ,k − G k ) − σ g,j σ g,k Γ g,jk (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . By the same argument as in the proof of [8, Theorem 4.2], b ∆ A . b ∆ / A, + b ∆ A, + b ∆ A, + b ∆ , where b ∆ A, is defined in (9), and b ∆ A, := max j,k d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ g,j σ g,k n X i ∈ S ( g j ( X i ) g k ( X i ) − Γ g,jk ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , b ∆ A, := max j,k d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ g,j n X i ∈ S g j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Step 1: bounding b ∆ A, . By the second part of (11), we have P (cid:16) b ∆ / A, log ( d ) C / n − ζ / (cid:17) − Cn − , P (cid:16) b ∆ A, log ( d ) C n − ζ (cid:17) − Cn − . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Step 2: bounding b ∆ A, . We apply Lemma A.2 with β = q/ m = n and notethat n n : P (cid:16) b ∆ A, > C (cid:16) n − σ log / ( dn ) + n − u n log /q +1 ( dn ) log /q ( n ) (cid:17)(cid:17) n − . where σ = max j,k d σ − g,j σ − g,k P i ∈ S E [( g j ( X i ) g k ( X i ) − Γ g,jk ) ] and u n = k σ − g,j σ − g,k ( g j ( X i ) g k ( X i ) − Γ g,jk ) k ψ q/ . By Lemma A.11, (C2) and (C3’), σ n (cid:0) σ − g D n (cid:1) , u n (cid:0) σ − g D n (cid:1) . Thus P (cid:16) b ∆ A, > C (cid:16) n − / σ − g D n log / ( dn ) + n − σ − g D n log /q +1 ( dn ) log /q ( n ) (cid:17)(cid:17) n − . Then due to the first part of (11) and (33), P ( b ∆ A, log ( d ) > Cn − ζ / ) Cn − .Step 3: bounding b ∆ A, . We apply Lemma A.2 with β = q , m = n : P (cid:16) b ∆ A, > C (cid:16) n − / log / ( dn ) + n − σ − g D n log ( dn ) log( n ) (cid:17)(cid:17) n − . Then due to the first part of (11) and (33), P ( b ∆ A, log ( d ) > Cn − ζ ) Cn − . (cid:4) A.5.3. Proof of Theorem 3.3Proof.
Without loss of generality, we can assume θ = E [ H ( X r , W )] = 0.Step 1. Let ζ := ζ , ζ := ζ − /ν . Due to Theorem 3.1, Lemma 3.2 and usingthe same argument as in the Step 3 of the proof of [8, Theorem 4.2], it sufficesto show the second part of (11) holds. From the definition (9), b ∆ A, σ − g max j d n X i ∈ S ( G i ,j − g j ( X i )) := σ − g ∆ A, . In Step 2, we will show that E h ∆ νA, i . (cid:16) n − rD n log /q +1 ( d ) (cid:17) ν . (34)Then by Markov inequality and (12), P (cid:16) b ∆ A, log ( d ) > C n − ζ (cid:17) . n ζ ν σ − νg log ν ( d ) (cid:16) n − rD n log /q +1 ( d ) (cid:17) ν = n − (cid:16) n ζ n − σ − g rD n log /q +5 ( d ) (cid:17) ν . n − , which completes the proof. . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Step 2. The goal is to show (34). Define F ( x r , w ) := max j d | H j ( x r , w ) | , g ( i ,k ) ( X i ) := H ( X S ( i ,k , W S ( i ,k ) for i ∈ S , k = 1 , . . . , K. By Jensen’s inequality, E [∆ νA, ] n X i ∈ S E (cid:20) max j d | G i ,j − g j ( X i ) | ν (cid:21) , and for each i ∈ S , conditional on X i , by Hoffmann-Jorgensen inequality [29,A.1.6.], E | Xi (cid:20) max j d | G i ,j − g j ( X i ) | ν (cid:21) . I i + II i := (cid:18) E | X i (cid:20) max j d | G i ,j − g j ( X i ) | (cid:21)(cid:19) ν + K − ν E | X i (cid:20) max k K max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21) . Step 2.1: bounding II i . Observe that for each 1 k K , E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21) = E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − E | X i h g ( i ,k ) j ( X i ) i(cid:12)(cid:12)(cid:12) ν (cid:21) . E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21) = E | X i (cid:20) F ν ( X S ( i ,k , W S ( i ,k ) (cid:21) = E | X i h F ν ( X S ( i , , W S ( i , ) i := b ( X i ) . Thus II i . K − ν +1 b ( X i ).Step 2.2: bounding I i . Observe that for each i ∈ S ,max j d K X k =1 E | X i (cid:20)(cid:16) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:17) (cid:21) . K E | X i h F ( X S ( i , , W S ( i , ) i := K e b ( X i ) . Further, by Jensen’s inequality, E | X i (cid:20) max k K max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) (cid:21) K X k =1 E | X i (cid:20) max j d (cid:12)(cid:12)(cid:12) g ( i ,k ) j ( X i ) − g j ( X i ) (cid:12)(cid:12)(cid:12) ν (cid:21)! /ν . K /ν b /ν ( X i ) , where b ( X i ) is defined in Step 1. Then by the same argument as in the proofof [8, Proposition 4.4], I i . K − ν log ν ( d ) e b ν ( X i ) + K − ν +1 log ν ( d ) b ( X i ) . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Step 2.3: combining 2.1 and 2.2. By Jensen’s inequality, assumption (C3’) andby the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) E he b ν ( X i ) i log ν/q ( d ) D νn , E [ b ( X i )] log ν/q ( d ) D νn . Thus combining the results from 2.1 and 2.2, we have E (cid:20) max j d | G i ,j − g j ( X i ) | ν (cid:21) . K − ν D νn log ν ( d ) (cid:0) K − ν +1 log ν ( d ) (cid:1) . (cid:16) n − rD n log /q +1 ( d ) (cid:17) ν , where the second inequality is due to (12) and that ν > / K = ⌊ ( n − / ( r − ⌋ . (cid:4) A.5.4. Proof of Corollary 3.5Proof.
We have shown in Step 0 of the proof (Subsection A.5.1) for Theorem 3.1that P (cid:18) max j d | b σ H,j /σ H,j − | log ( d ) . Cn − ζ/ (cid:19) > − Cn − , Further, if we take ν = 7 /ζ in Theorem 3.3, then in the proof for Theorem 3.2and Theorem 3.3, we have shown that P (cid:18) max j d | b σ g,j /σ g,j − | log ( d ) . Cn − ζ/ (cid:19) > − Cn − . The rest of the proof is the same as the proof for [8, Corollary A.1], and thusomitted. (cid:4)
A.6. Proof of Lemma 4.1
Proof.
Clearly, the inequality is for each dimension, and thus without loss ofgenerality, we assume d = 1 and omit the dependence on j .We denote E β and Cov β the expectation and covariance when X , . . . , X r have densities f β . Further, define g β ( x ) = E β [ h ( x , X , . . . , X r )] for x ∈ S and by definition g ( · ) = g ( · ).First, note that by interchanging the order of integration and differentiation E β [Ψ( β )] = Z r X i =1 ∇ ln f β ( x i ) ! r Y i =1 f β ( x i ) µ ( dx i ) = Z ∇ r Y i =1 f β ( x i ) ! r Y i =1 µ ( dx i ) = 0 . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics Further, by a similar argument,Cov β ( g β ( X ) , Ψ( β )) = Z g β ( x ) r X i =1 ∇ ln f β ( x i ) ! r Y i =1 f β ( x i ) µ ( dx i )= Z g β ( x ) ∇ ln f β ( x ) f β ( x ) µ ( dx )= Z Z h ( x , x , . . . , x r ) r Y i =2 f β ( x i ) µ ( dx i ) ! ( ∇ ln f β ( x )) f β ( x ) µ ( dx )= Z h ( x , x , . . . , x r ) ∇ ln f β ( x ) r Y i =1 f β ( x i ) µ ( dx i ) , which implies that r X i =1 Cov β ( g β ( X i ) , Ψ( β )) = Z h ( x , x , . . . , x r ) r X i =1 ∇ ln f β ( x i ) ! r Y i =1 f β ( x i ) µ ( dx i )= Z h ( x , x , . . . , x r ) ∇ r Y i =1 f β ( x i ) ! r Y i =1 µ ( dx i ) = ∇ θ ( β ) . Finally, observe that0 Var β r X i =1 g β ( X i ) − ∇ θ ( β ) T ( r J ( β )) − Ψ( β ) ! = r X i =1 Var β ( g β ( X i )) − r − Cov β r X i =1 g β ( X i ) , ∇ θ ( β ) T J ( β ) − Ψ( β ) ! + r − Var β (cid:0) ∇ θ ( β ) T J ( β ) − Ψ( β ) (cid:1) = r Var β ( g β ( X )) − r − ∇ θ ( β ) T J ( β ) − ∇ θ ( β ) , which completes the proof. (cid:4) A.7. Proofs of tail probabilities in Section A.1
A.7.1. Proof of Lemma A.1Proof.
We first define S := max j d m X i =1 Z ij , M := max i m max j d Z ij . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dm ).By [10, Lemma E.4], P ( S > E [ S ] + t ) − (cid:18) tC k M k ψ β (cid:19) β ! . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics The right hand side is 3 /n if t = C k M k ψ β log /β ( n ) Cu n log /β ( n ) log /β ( dm ) . Further by [10, Lemma E.3], E [ S ] . max j d E " m X i =1 Z ij + log( d ) E [ M ] . max j d E " m X i =1 Z ij + u n log /β +1 ( dm ) . Combining two parts finishes the proof. (cid:4)
A.7.2. Proof of Lemma A.2Proof.
We first define S := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , M := max i m max j d | Z ij | . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dm ).By [10, Lemma E.2], P ( S > E [ S ] + t ) exp( − t / (3 σ )) + 3 exp − (cid:18) tC k M k ψ β (cid:19) β ! . The right hand side is 4 /n if t = √ σ log / ( n ) + C k M k ψ β log /β ( n ) C (cid:16) σ log / ( n ) + log /β ( dm ) log /β ( n ) u n (cid:17) . Further by [10, Lemma E.1], E [ S ] . σ log / ( d ) + log( d ) p E [ M ] . σ log / ( d ) + log /β +1 ( dm ) u n . Combining two parts finishes the proof. (cid:4)
A.7.3. Proof of Lemma A.3Proof.
We first define S := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 ( Z i − p n ) a ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , ˜ M := max i m max j d | ( Z ij − p n ) a ij | max i m max j d | a ij | ˜ σ := max j d m X i =1 E [( Z i − p n ) a ij ] p n (1 − p n ) max j d m X i =1 a ij . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics By [10, Lemma E.2], P ( S > E [ S ] + t ) exp( − t / (3˜ σ )) + 3 exp − tC k ˜ M k ψ ! . The right hand side is 4 /n if t = √ σ log / ( n ) + C k ˜ M k ψ log( n ) C (cid:16)p p n (1 − p n ) σ log / ( n ) + M log( n ) (cid:17) . Further by [10, Lemma E.1], E [ S ] . ˜ σ log / ( d ) + log( d ) q E [ ˜ M ] . p p n (1 − p n ) σ log / ( d ) + M log( d ) . Combining two parts finishes the proof. (cid:4)
A.7.4. Proof of Lemma A.4Proof.
Let m = ⌊ n/r ⌋ , and define the following quantity Z := max j d m X i =1 f j ( X ir ( i − r +1 ) , M := max i m max j d f j ( X ir ( i − r +1 ) . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dn ).By [6, Lemma E.3], P (cid:18) m max j d U n,j > E [ Z ] + t (cid:19) − (cid:18) tC k M k ψ β (cid:19) β ! . The right hand side is 3 /n if we set t = C k M k ψ β log /β ( n ) Cu n log /β ( dn ) log /β ( n ) , Further, by [9, Lemma 9], E [ Z ] C max j d E " m X i =1 f j ( X ir ( i − r +1 ) + log( d ) E [ M ] ! C (cid:16) mv n + u n log /β +1 ( dn ) (cid:17) . Putting two parts together, we have P (cid:18) max j d U n,j > C (cid:16) v n + n − ru n log /β +1 ( dn ) + n − ru n log /β ( dn ) log /β ( n ) (cid:17)(cid:19) n , which completes the proof. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics A.7.5. Proof of Lemma A.5Proof.
Let m = ⌊ n/r ⌋ , and define the following quantity Z := max j d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X i =1 f j ( X ir ( i − r +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , M := max i m max j d (cid:12)(cid:12)(cid:12) f j ( X ir ( i − r +1 ) (cid:12)(cid:12)(cid:12) . Then by the maximal inequality [29, Lemma 2.2.2], k M k ψ β Cu n log /β ( dn ).By [8, Lemma C.3], P (cid:18) m max j d | U n,j | > E [ Z ] + t (cid:19) exp (cid:18) − t mσ (cid:19) + 3 exp − (cid:18) tC k M k ψ β (cid:19) β ! . The right hand side is 4 /n if we take t = σ √ m log / ( n ) + C k M k ψ β log /β ( n ) C (cid:16) σm / log / ( n ) + u n log /β ( dn ) log /β ( n ) (cid:17) . Further, by [9, Lemma 8], E [ Z ] . p log( d ) mσ + q E [ M ] log( d ) . m / log / ( d ) σ + u n log /β +1 ( dn ) . Putting two parts together completes the proof. (cid:4)
A.7.6. Proof of Lemma A.6Proof.
First, observe that k F j ( x r , W ) k ψ β . f j ( x r ) + b j ( x r ). Denote Z := max j d | I n,r | X ι ∈ I n,r f j ( X ι ) , M := max ι ∈ I n,r max j d ( f j ( X ι ) + b j ( X ι )) , Then conditional on X n , by Lemma A.1, P | X n (cid:16) Z > C (cid:16) Z + | I n,r | − M r /β log /β +1 ( dn ) log /β − ( n ) (cid:17)(cid:17) | I n,r | n . By Lemma A.4, P (cid:18) Z > C (cid:18) max j d E [ f j ( X r )] + n − r log /β +1 ( dn ) log /β − ( n ) u n (cid:19)(cid:19) n . Further, by maximal inequality [29, Lemma 2.2.2] k M k ψ β Cr /β log /β ( dn ) u n ⇒ P (cid:16) M > Cr /β log /β ( n ) log /β ( dn ) u n (cid:17) n . Then the proof is complete by combining above results. (cid:4) . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics A.7.7. Proof of Lemma A.7Proof.
First, we define σ := max j d X ι ∈ I n,r E | X n h ( F j ( X ι , W ι ) − f j ( X ι )) i . max j d X ι ∈ I n,r b j ( X ι ) ,M := max ι ∈ I n,r max j d b j ( X ι ) . Then by first conditional on X n and by Lemma A.2, P (cid:16) | I n,r | Z > C ( σr / log / ( dn ) + M r /β log /β +1 ( dn ) log /β − ( n )) (cid:17) | I n,r | n . Observe that k b j ( X r ) k ψ β/ = k b j ( X r ) k ψ β u n . Then by Lemma A.4 with ψ β/ , P (cid:18) σ | I n,r | > Cu n (cid:16) n − r log /β +1 ( dn ) log /β − ( n ) (cid:17)(cid:19) n . Further, by maximal inequality [29, Lemma 2.2.2] k M k ψ β Cr /β log /β ( dn ) u n ⇒ P ( M > Cr /β log /β ( dn ) log /β ( n ) u n ) n . Then the proof is complete by combining above results. (cid:4)
A.8. Proofs of additional lemmas
The following lemma is similar to [10, Lemma C.1], and is needed in provingLemma A.8.
Lemma A.15.
Let q ∈ (0 , , and ξ be a non-negative random variable suchthat k ξ k ψ q D . Then there exists a constant C , depending only on q , such that E (cid:2) ξ ; ξ > t (cid:3) C ( t + D ) e − ( t/D ) q , for t > . Proof.
Since k ξ k ψ q D , we have for x > P ( ξ > x ) e − ( x/D ) q E h e − ( ξ/D ) q i e − ( x/D ) q . . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics By change of variable, we have E (cid:2) ξ ; ξ > t (cid:3) t P ( ξ > t ) + 3 Z ∞ t P ( ξ > x ) x d x . t e − ( t/D ) q + D Z ∞ ( t/D ) q e − u u /q − d u . t e − ( t/D ) q + D e − ( t/D ) q Z ∞ e − u ( u + ( t/D ) q ) /q − d u . t e − ( t/D ) q + D e − ( t/D ) q Z ∞ e − u (cid:16) u /q − + ( t/D ) − q (cid:17) d u . t e − ( t/D ) q + D e − ( t/D ) q Z ∞ e − u (cid:16) u /q − + ( t/D ) − q (cid:17) d u . (cid:0) t + D + t − q D q (cid:1) e − ( t/D ) q . (cid:0) t + D (cid:1) e − ( t/D ) q . (cid:4) Proof of Lemma A.8.
For q >
1, it has been established by [10, Proposition 2.1].For q <
1, the proof is almost identical to that for [10, Proposition 2.1], exceptthat we replace [10, Lemma C.1] by Lemma A.15. (cid:4)
Proof of Lemma A.11. (i). Without loss of generality, we assume 0 < x := k X k ψ β < ∞ , and 0 < y := k Y k ψ β < ∞ . Observe that E " exp (cid:18) | X + Y | /β ( x + y ) (cid:19) β E (cid:20) exp (cid:18) | X | β + | Y | β x + y ) β (cid:19)(cid:21) E (cid:20)
12 exp (cid:18) | X | β ( x + y ) β (cid:19)(cid:21) + E (cid:20)
12 exp (cid:18) | Y | β ( x + y ) β (cid:19)(cid:21) . (ii). From Lemma 5.4, for 1 i n , E (cid:20) e ψ β (cid:18) | ξ i | D (cid:19)(cid:21) E (cid:20) ψ β (cid:18) | ξ i | D (cid:19)(cid:21) + 1 , which, by the convexity of e ψ β and the fact e ψ β (0) = 0, implies k ξ i k e ψ β D . Bythe standard maximal inequality (e.g., see [29, Lemma 2.2.2]) and Lemma 5.4, k max i n ξ i k e ψ β C log /β ( n ) D . Thus by Lemma 5.4, E exp max i n ξ i C log /β ( n ) D ! β E " ψ β max i n ξ i C log /β ( n ) D ! + e /β e /β . Now we let m > (cid:0) e /β (cid:1) /m
2. Then by Jensen’s inequality( E [ X /m ] ( E [ X ]) /m for X > E exp max i n ξ i Cm /β log /β ( n ) D ! β , . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics which implies that k max i n ξ i k e ψ β . log /β ( n ) D . (cid:4) Acknowledgements
X. Chen is supported in part by NSF DMS-1404891, NSF CAREER AwardDMS-1752614, and UIUC Research Board Awards (RB17092, RB18099).
References [1] Gunnar Blom. Some properties of incomplete U -statistics. Biometrika ,63(3):573–580, 1976.[2] Yu. V. Borovskikh.
U-Statistics in Banach Spaces . V.S.P. Intl Science,1996.[3] Leo Breiman. Bagging predictors.
Machine Learning , 24:123–140, 1996.[4] Leo Breiman. Random forests.
Machine Learning , 45:5–32, 2001.[5] B.M. Brown and D.G. Kildea. Reduced U -statistics and the Hodges-Lehmann estimator. Annals of Statistics , 6:828–835, 1978.[6] Xiaohui Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics and their applications.
The Annals of Statistics ,46(2):642–678, 2018.[7] Xiaohui Chen and Kengo Kato. Jackknife multiplier bootstrap: finite sam-ple approximations to the U -process supremum with applications. 2017.arXiv:1708.02705.[8] Xiaohui Chen and Kengo Kato. Randomized incomplete u -statisticsin high dimensions. The Annals of Statistics, accepted (available atarXiv:1712.00771) , 2018+.[9] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Comparisonand anti-concentration bounds for maxima of gaussian random vectors.
Probability Theory and Related Fields , 162(1-2):47–70, 2015.[10] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limittheorems and bootstrap in high dimensions.
Ann. Probab. , 45(4):2309–2352, 07 2017.[11] Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Gaussianapproximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.
The Annals of Statistics , 41(6):2786–2819,2013.[12] St´ephan Cl´emen¸con, G´abor Lugosi, and Nicolas Vayatis. Ranking and em-pirical minimization of u -statistics. Annals of Statistics , 36(2), 844-874.[13] Victor De la Pena and Evarist Gin´e.
Decoupling: from dependence to inde-pendence . Springer Science & Business Media, 2012.[14] Edward W. Frees. Infinite order u-statistics.
Scandinavian Journal ofStatistics , 16(1):29–45, 1989.[15] Edward W Frees. Estimating densities of functions of observations.
Journalof the American Statistical Association , 89(426):517–525, 1994. . Song, X. Chen, K. Kato/High-dimensional infinite-order U -statistics [16] Karl O Friedrich. A berry-esseen bound for functions of independent ran-dom variables. The Annals of Statistics , pages 170–183, 1989.[17] Evarist Gin´e, David M Mason, et al. On local u-statistic processes and theestimation of densities of functions of several sample variables.
The Annalsof Statistics , 35(3):1105–1145, 2007.[18] Charles Heilig and Deborah Nolan. Limit theorems for the infinite-degree u -process. Statistica Sinica , 11:289–302, 2001.[19] Wassily Hoeffding. A class of statistics with asymptotically normal distri-bution.
The Annals of Mathematical Statistics , 19(3):293–325, 1948.[20] Svante Janson. The asymptotic distributions of incomplete U -statistics. Z,Wahrscheinlichkeitstheorie verw. Gebiete , 66:495–505, 1984.[21] Alan J. Lee.
U-Statistics: Theory and Practice . Statistics: A Series ofTextbooks and Monographs (Book 110). CRC Press, 1990.[22] P. Major. Asymptotic distributions for weighted U-statistics.
Annals ofProbability , 21(2):1514–1535, 1994.[23] Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forestsvia confidence intervals and hypothesis tests.
The Journal of MachineLearning Research , 17(1):841–881, 2016.[24] K.A. O’Neil and R.A. Redner. Asymptotic distributions of weighted U -statistics of degree 2. Annals of Probability , 21(2):1159–1169, 1993.[25] M. Rifi and F. Utzet. On the asymptotic behavior of weighted U-statistics.
Journal of Theoretical Probability , 13(1):141–167, 2000.[26] C.P. Shapiro and L. Hubert. Asymptotic normality of permutation statis-tics derived from weighted sums of bivariate functions.
Annals of Statistics ,7(4):788–794, 1979.[27] Robert P Sherman. Maximal inequalities for degenerate u-processes withapplications to optimization estimators.
The Annals of Statistics , pages439–459, 1994.[28] Grace S. Shieh. Infinite-order v -statistics. Statistics & Probability Letters ,20:75–80, 1994.[29] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In
Weakconvergence and empirical processes , pages 16–28. Springer, 1996.[30] A. J. van Es and R. Helmers. Elementary symmetric polynomials of in-creasing order.
Probability Theory and Related Fields , 80(1):21–35, Dec1988.[31] van W.R. Zwet. A berry-esseen bound for symmetric statistics.