DDESIGN BASED INCOMPLETE U-STATISTICS
Xiangshun Kong , Wei Zheng Beijing Institute of Technology and University of Tennessee
Abstract:
U-statistics are widely used in fields such as economics, machine learning, and statis-tics. However, while they enjoy desirable statistical properties, they have an obvious drawbackin that the computation becomes impractical as the data size n increases. Specifically, the num-ber of combinations, say m , that a U-statistic of order d has to evaluate is O ( n d ). Many effortshave been made to approximate the original U-statistic using a small subset of combinationssince Blom (1976), who referred to such an approximation as an incomplete U-statistic. To thebest of our knowledge, all existing methods require m to grow at least faster than n , albeit moreslowly than n d , in order for the corresponding incomplete U-statistic to be asymptotically effi-cient in terms of the mean squared error. In this paper, we introduce a new type of incompleteU-statistic that can be asymptotically efficient, even when m grows more slowly than n . In somecases, m is only required to grow faster than √ n . Our theoretical and empirical results bothshow significant improvements in the statistical efficiency of the new incomplete U-statistic. Key words and phrases:
Asymptotically efficient, BIBD, big data, design of experiment, sub-sampling. a r X i v : . [ m a t h . S T ] A ug . Introduction The U-statistic has been a fundamental statistical estimator since the work of Ho-effding (1948), who studied its theoretical properties and established central limit theo-rems for non-degenerate U-statistics. Eagleson (1979) derived asymptotic distributionsof some degenerate U-statistics of order two, which were then extended to all degenerateU-statistics by Lee (1979). Other extensions include a variant of U-statistics called V-statistics by von Mises (1948), U-statistics for stationary processes by Enqvist (1985),and multi-sample U-statistics by Lehmann (1951) and Sen (1974, 1977).The theory of U-statistics admits a minimum variance unbiased estimator of anestimable parameter for a large class of probability distributions, hence its popularityin applications. However, U-statistics can also be time consuming to compute. For aU-statistic of order d , the number of combinations, say m , to be evaluated is (cid:0) nd (cid:1) , thatis O ( n d ), where n is the data size. Suppose n = 10 and d = 3. Then, listing the (cid:0) (cid:1) combinations requires 667 GB of memory and a computing time of approximately 100hours on a Macbook Pro with Intel Core i7 2.9 GHz CPU. With n = 10 and d = 4, therequired memory is roughly 16 . , O ( n ) byexploiting the structure of the kernel function, especially when the data are univariateand consist of one sample. However, in practice, such a computational reduction isoften not feasible. Note that we do not focus here on which U-statistics are candidatesfor a reduction in the original computational complexity of O ( n d ) because our goal is tostudy a generic scheme for the fast approximation of U-statistics. A natural remedy isto take a sample of size m (cid:28) (cid:0) nd (cid:1) from all possible combinations. Blom (1976) referredto the resulting estimator as an incomplete U-statistic . The problem of identifying agood incomplete U-statistic is related to the design of the sampling scheme. Of the vari-ous options, the vanilla scheme of simple random sampling by Blom (1976) has receivedmuch attention in the literature. Janson (1984) established the asymptotic distribu-tions of incomplete U-statistics based on random sampling (ICUR), Herrndorf (1986)established the invariance principle for the statsitics, and Chen and Kengo (2019) stud-ied the vector- and matrix-valued ICUR. For a more detailed discussion on incompleteU-statistics, refer to Wang (2012) and Wang and Lindsay (2014).First, we introduce some required notation. For α >
0, we use m ≺ n α , m (cid:16) n α ,and m (cid:31) n α to mean m/n α →
0, 0 < lim m/n α ≤ lim m/n α < ∞ , and m/n α →∞ , respectively. For a given incomplete U-statistic, say U , its efficiency is defined interms of the mean squared error (MSE): Eff( U ) = MSE( U ) / MSE( U ), where U is the3omplete U-statistic. An incomplete U-statistic is said to be asymptotically efficient ifEff( U ) → n → ∞ . Note that the ICUR is asymptotically efficient for the non-degenerate case when m (cid:31) n ; see (2.6) for a theoretical verification, and Table 1 forempirical evidence.Blom (1976) also proposed sampling schemes based on the design of an experiment.In particular, balanced incomplete block designs (BIBDs) have been examined by Brownand Kildea (1978) and Lee (1982). The latter also proved that incomplete U-statisticsbased on BIBDs achieve the minimum variance among all unbiased estimators for agiven m . By Raghavarao (1971), a BIBD exists whenever n = 6 a + 3 for any positiveinteger a . Unfortunately, the optimality of the BIBD does not make it practicallyattractive because its construction requires m (cid:16) n ; see Table 1. The same issue existsfor the permanent design of Rempala and Wesolowski (2003) and the rectangular designof Rempala and Srivastav (2004). For the case of m/n →
1, Blom (1976) proposed usinga Latin square and a Graeco-Latin square to guide the sampling scheme. However, theefficiency of the estimator derived in this way is essentially asymptotically the same asthat of the ICUR. Moreover, the limit of the efficiency does not exceed d/ (1 + d ) as n → ∞ ; see (2.6) and the follow-up discussion.Another method recently proposed in the literature is the divide and conquer (DC)strategy of Lin and Xi (2010), which randomly divides the data into many groups,calculates the complete U-statistic within each group, and then takes the average of4hese complete U-statistics. Unfortunately, the DC is even less efficient than the ICUR.Moreover, it is not available when m ≤ n ; see Table 1.We conclude that the ICUR is still the most viable of the existing choices of incom-plete U-statistics. It performs as well as a design-based method when a design exists.It also possesses several advantages, such as a flexible choice of m , the availability ofasymptotic properties, and being extendable to multi-sample cases.In this paper, we introduce a new type of incomplete U-statistic that is substantiallymore efficient than the ICUR, while maintaining the latter’s aforementioned advantages.It has three main steps: ( i ) Divide the data into L ( (cid:28) n ) groups of homogeneous units.( ii ) Judiciously select a collection of the combinations of the groups based on a designstructure called an orthogonal array (OA). ( iii ) Randomly select a combination of inputsfrom each selected group combination. We call the derived estimator the incompleteU-statistic based on division and an orthogonal array (ICUDO). Our first exampleprovides a snapshot of the performance of the major incomplete U-statistics mentionedso far. Example 1. (The symmetry of distribution). The kernel function g ( x , x , x ) =sign(2 x − x − x ) + sign(2 x − x − x ) + sign(2 x − x − x ) has mean zero whenthe distribution of the data is symmetric. The data consists of n = 10 independentand identically distributed (i.i.d.) observations generated iid from the standard normaldistribution. The performance of the ICUR, BIBD, DC, and ICUDO is measured by5heir efficiency at different values of m .Table 1: Comparison of efficiencies in Example 1. m m/n m/ (cid:0) n (cid:1) ICUR BIBD DC ICUDO1 . × . . × − . × . . × − . × . . × − . × . . × − . × . . × − . × . . × − m ≤ n , and the BIBD does not exist in mostcases, except for m = 166167. For m ≤ m .Here, we briefly explain why our ICUDO performs so well. Note that existingdesign-based methods focus on the arrangement of indices of units, without referringto their actual values. The ICUDO method exploits the fact that replacing a unit by6nother one with a similar value does not change the value of the kernel function g toomuch. For example, suppose the first six numbers of the data are (1 , , , , , m (cid:31) n to be asymptoticallyefficient, the ICUDO requires a substantially smaller m ; sometimes even m (cid:31) √ n willsuffice. See Theorem 2 for the latter case. When the U-statistic is degenerate, bothmethods require larger m , but the ICUDO still requires a substantially smaller m thanthat of the ICUR.The rest of the paper is organized as follows. Section 2 introduces the construction ofthe ICUDO for univariate data and derives its asymptotic properties. Section 3 discussesthe debiasing issues of the ICUDO for the degenerate case. Section 4 constructs adebiased ICUDO for multi-dimensional data. Simulations are presented in each sectionto support the theoretical results. Section 5 concludes the paper and points out somefuture research topics. All proofs are postponed to the Appendix. Additional theoremsare given in the online Supplementary Material.7 . ICUDO based on univariate data Let X , . . . , X n be a random sample of size n from a univariate distribution, say F . For a given symmetric kernel function, say g : R d → R , of order d , the uniformlyminimum variance unbiased estimator (UMVUE) of the parameter Θ = (cid:82) g ( x , . . . , x d ) dF ( x ) . . . dF ( x d ) is given by the U-statistic U = (cid:18) nd (cid:19) − (cid:88) η ∈ S n,d g ( X η ) , (2.2)where S n,d = { η = ( η , . . . , η d ) : 1 ≤ η < η < . . . < η d ≤ n } and X η = ( X η , . . . , X η d ).When S n,d is replaced with the set of all n d ordered combinations, the correspondingaverage in (2.2) is called a V-Statistic (von Mises (1948)). The main difference is thatV-statistics include combinations with duplicated units, such as (1 , , Eg ( X , . . . , X d ) < ∞ . Unless there is some special structure of g that can be exploited to reduce the com-putational burden, in general, (2.2) becomes impractical to compute as n increases. Toaddress this problem, Blom (1976) proposed using the following incomplete U-statisticas a fast approximation: U = 1 m (cid:88) η ∈ S g ( X η ) , (2.3)where S ⊂ S n,d , with its cardinality m = | S | being only a fraction of (cid:0) nd (cid:1) . The statisticin (2.3) becomes an ICUR when S is a simple random sample, which we denote as U RND . 8ere, we briefly review the properties of U and U RND . For arbitrary positiveintegers N and p , define Z N = { , . . . , N } and Z pN = { ( z . . . . , z p ) : z j ∈ Z N , ≤ j ≤ p } . Following Hoeffding (1948), for u ⊆ Z d and x = ( x , . . . , x d ), denote g u ( x ) = (cid:82) g ( x ) dF u c , with u c = Z d \ u and dF u = (cid:81) j ∈ u dF ( x j ). With the conventions g ∅ ( x ) = Θand h ∅ ( x ) = 0, we recursively define the projection h u ( x ) = g u ( x ) − (cid:88) v ⊆Z d : v ⊂ u h v ( x ) . Because g is symmetric, we have Eg v = Eg u and Eh v = Eh u for any pair u , v ⊆ Z d ,with | v | = | u | . Hence, we can now define σ j = Var( g u ) and δ j = Var( h u ) , with | u | = j. Following Hoeffding (1948) and Blom (1976), we haveMSE( U ) = (cid:18) nd (cid:19) − d (cid:88) j =1 (cid:18) dj (cid:19)(cid:18) n − dd − j (cid:19) σ j = d (cid:88) j =1 (cid:18) dj (cid:19) (cid:18) nj (cid:19) − δ j , (2.4)MSE( U RND ) = MSE( U ) + σ d m + O (cid:18) nm (cid:19) = MSE( U ) + 1 m d (cid:88) j =1 (cid:18) dj (cid:19) δ j + O (cid:18) nm (cid:19) . (2.5)In (2.4) and (2.5), the MSEs are expressed in terms of both σ j and δ j . The equivalencesare established by σ j = (cid:80) jj (cid:48) =1 (cid:0) jj (cid:48) (cid:1) δ j (cid:48) , for 1 ≤ j ≤ d . The U-statistic and the kernelfunction g are called non-degenerate if δ = σ >
0, and are called order-q degenerate if σ q = 0 and σ q +1 >
0, or equivalently δ = · · · = δ q = 0 and δ q +1 >
0. For the9on-degenerate case, we have Var( U ) (cid:16) n − , which together with (2.5) yieldsEff( U RND ) = − O ( n/m ) , m (cid:31) n nm σ dd δ + O (1 /n ) , m (cid:16) nO ( m/n ) , m ≺ n. (2.6)As a result, we have Eff( U RND ) → m (cid:31) n , Eff( U RND ) → m ≺ n ,and Eff( U RND ) → (cid:16) σ d cd σ (cid:17) − when m/n → c , for a constant c >
0. With c = 1,Blom (1976) proposed using Latin squares and Graeco-Latin squares to construct theincomplete U-statistics. In such a case, we can verify that its efficiency is asymptoticallythe same as that of U RND , and lim n →∞ Eff( U RND ) ≤ d/ (1 + d ), from (2.6) and σ d ≤ dσ .In contrast, Theorem 1 shows that the ICUDO is asymptotically efficient when m (cid:16) n .Stronger results are stated in Theorem 2 in Section 2.1 and in similar theorems in theSupplementary Material under various conditions on g and F . Recall that δ j = Var( h u ), for | u | = j , 1 ≤ j ≤ d , and note that the coefficientof δ j in (2.4) is O ( n − j ). Hence, it is more important to capture the variability of g inits lower-dimensional projected space. This idea matches perfectly with the projectiveproperty of the OA. An OA denoted by OA ( m, d, L, t ), is an m by d array with entriesfrom { , . . . , L } , arranged in such a way that for any m by t subarray, all ordered t -tuples of the entries from { , . . . , L } appear λ = m/L t times in the rows. The number t is called the strength of the OA; see the matrix A defined in (2.8) as an example of10 A (9 , , , { ( i , i ) : 1 ≤ i , i ≤ } . Considerany two columns of A , we can see that all these ordered 2-tuples appear once, that is, λ = 1. For sets S , . . . , S q , define (cid:81) qi =1 S i = { ( s , . . . , s q ) : s i ∈ S i } . The ICUDO isconstructed as follows. For ease of illustration, we assume n is a multiple of L . Actually,throughout the manuscript, we assume that L (cid:28) n . Thus, we may randomly draw an n (cid:48) = (cid:98) n/L (cid:99) · L subsample as the new data set. The information loss in this process isnegligible compared with the original size n .Step 1. Let A be an OA ( m, d, L, t ). Apply random level permutations { π , . . . , π d } tocolumns of A independently. Specifically, for l ∈ Z L , change all elements l in the j th column of A to π j ( l ). The new OA is denoted by A = ( a ij ) m × d .Step 2. Create the partition Z n = (cid:83) Ll =1 G l such that | G l | = n/L for l ∈ Z L , and X i ≤ X i for any i ∈ G l , i ∈ G l , with l < l .Step 3. For i = 1 , . . . , m , independently draw an element, say η i , uniformly from (cid:81) dj =1 G a ij .the ICUDO based on the OA A is defined as U oa = 1 m m (cid:88) i =1 g ( X η i ) . (2.7)The level permutation in step 1 ensures that each row of A takes each d -tuple withequal probability. At the same time, the projective uniformity of the beginning OA, A ,carries over to A . Here, we ensure that A is free of a coincidence defect, which means11o two rows are the same in any m × ( t + 1) subarray. This property is necessary for therelevant theorems to hold. Step 2 divides the data into homogeneous groups. Step 3is built on the first two steps. It chooses representative elements from selected groups,and the selection of groups is guided by the structure of A . Note (2.7) is in the formof (2.3) by taking S as S oa = { η , . . . , η m } . We now give a toy example of choosing η i ,for i = 1 , . . . , m . Suppose d = 4, n = 9, and X ≤ X ≤ X ≤ X ≤ X ≤ X ≤ X ≤ X ≤ X . Then, we have L = 3 groups listed as G = { , , } , G = { , , } , and G = { , , } . An example of OA ( m = 9 , d = 4 , L = 3 , t = 2) in step 1 is given as follows in transpose: A T = . (2.8)The fourth row of A , namely (2 , , , η from G × G × G × G . One possible outcome for η could be (4 , , , A ,12e could possibly have the X η i , for i = 1 , . . . ,
9, used in the construction as follows: {X η , . . . , X η } = X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X . (2.9)To proceed with the asymptotic properties of U oa , we define R ( t ) = (cid:88) j>t (cid:18) dj (cid:19) δ j . (2.10) Theorem 1.
For any ( g, F ) , using OA ( m, d, L, t ) in step 1 of the ICUDO algorithm,we have MSE( U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) + O (cid:18) n (cid:19) . (2.11)We now explain the meanings of the three terms in (2.11) generated in the processof approximating the complete U-statistic U using U oa . The term O ( n − ) is the biassquare of U oa due to the inclusion of combinations with duplicate units, such as thefirst column of (2.9). Essentially, U oa is approximating the V-statistic, which is biasedfor Θ itself. The term o ( m − ) is due to the sampling variability when we draw onepoint from each selected group, that is, step 3 of the algorithm. The term R ( t ) /m isdue to the usage of the OA structure in place of a complete enumeration of all groupcombinations. Compared with the second term in (2.5) for the ICUR, R (0) /m , we areable to eliminate all δ j with j ≤ t owing to the projective uniformity of the OA in all13 -dimensional projected spaces. If δ j = 0 for d (cid:48) ≤ j ≤ d , an OA with strength t ≥ d (cid:48) yields R ( t ) = 0. We discuss the hidden benefit of using a lower strength OA in Example2. In the non-degenerate case, recall the MSE( U ) (cid:16) n − and lim n →∞ Eff( U RND ) ≤ d/ (1 + d ) for the ICUR when m (cid:16) n . Under the same situation, Theorem 1 implies that U oa is asymptotically efficient by simply taking t = d . In fact, stronger results can bederived for the ICUDO so that m is allowed to grow more slowly than n under variousconditions. We give Theorem 2 here as one example; additional results can be found inthe Supplementary Material. Theorem 2.
Suppose ( i ) the kernel function g is Lipschitz continuous, and ( ii ) F hasdensity function f ( x ) > c for some fixed c > and x ∈ [ a, b ] , and f ( x ) = 0 otherwise.For U oa based on OA ( m, d, L, t ) with L ≤ n (log n ) − , we have MSE( U oa ) = MSE( U ) + R ( t ) m + O (cid:18) mL (cid:19) + O (cid:18) n (cid:19) . (2.12)For t = d = 2, we automatically have R ( t ) = 0. If the conditions in Theorem 2hold, we only need m (cid:31) √ n to achieve Eff( U oa ) →
1, while the ICUR requires m (cid:31) n .In general, R ( t ) decreases in t and could vanish if we take t large enough so that δ j = 0,for all j > t . Without knowledge of δ j , simply taking t = d will eliminate R ( t ) too.On the other hand, the term O (cid:0) mL (cid:1) in (2.12) is decreasing in L , meaning the moregroups we use to divide the data, the more homogeneous the units we could have in14ach group. However, L and t are subject to the constraint m = λL t , where λ is thenumber of replicates of each t -tuple in OA and is equal to one in all examples presentedhere. As a result, L and t cannot be increased simultaneously. To gain insight to thetrade-off between L and t , we need to determine the constant term for O (cid:0) mL (cid:1) . Forthis, we derive the following theorem. A more detailed discussion on how to choose L and t , given m , is provided in the Supplementary Material. Denote by U (0 ,
1) theuniform distribution on [0 , Theorem 3.
Suppose g has a continuous first-order derivative on [0 , d , X ∼ U (0 , ,and there exists some c ∈ (0 , ) , such that L (cid:22) n c . For U oa based on OA ( m, d, L, t ) , MSE( U oa ) = MSE( U ) + R ( t ) m + d mL Eγ ( X , . . . , X d ) + o (cid:18) mL (cid:19) , (2.13) where γ ( x , . . . , x d ) = ∂g∂x ( x , . . . , x d ) . The assumption of a uniform distribution for X is not as strict as it seems. To seethis, for X ∼ F , let Z = F ( X ) ∼ U (0 , g F ( Z , . . . , Z d ) := g ( F − ( Z ) , . . . , F − ( Z d )) = g ( X , . . . , X d ), we have the following corollary. Corollary 1.
Suppose g F has a continuous first-order derivative on [0 , d , and thereexists some c ∈ (0 , ) , such that L (cid:22) n c . Then, (2.13) still holds. The term Eγ in (2.13) provides a nice interpretation of the trade-off between t and L . When the kernel function g has a large variability (large Eγ ), it is more challenging15o make each group as homogeneous as possible, which enforces larger values of L . Onthe other hand, if g is quite flat on the domain (small Eγ ), we prefer fewer groups toimprove the strength of the OA. Example 2.
The kernel function g ( x , x , x ) = x x x estimates µ , where µ = E ( X ).We compare the performance of three methods: U RND ; U oa based on OA ( m, , √ m, t = 2; and U oa based on OA ( m, , m / , t = 3. Thedata consist of n = 10 i.i.d. observations simulated from N ( µ, µ takes thevalues of 0 . m/n µ = 0 . µ = 2Eff( U RND ) Eff( U oa ) Eff( U oa ) Eff( U RND ) Eff( U oa ) Eff( U oa )0.005 0.133% 0.171% % 1.110% % 2.323%0.01 0.290% 0.464% % 2.485% % 8.455%0.05 1.291% 2.448% % 10.31% % 51.71%0.1 2.936% 4.527% % 20.13% % 76.80%0.5 12.58% 21.89% % 50.78% % 98.53%1.0 21.05% 33.26% % 67.51% % 99.64%In Table 2, both U oa and U oa outperform U RND significantly. The advantage of16he ICUDO over the ICUR is discussed below in additional examples. Furthermore,we find that the winning strategy changes from U oa to U oa as we increase the mean µ of the distribution. This observation well illustrates the comments after Theorem 3 onthe relevance of Eγ in determining the optimal value of the strength t . That is, forlarger Eγ , we are more inclined to choose a smaller strength. This is validated by oursecond observation together with Eγ = ( µ + 1) , which increases in µ ( > φ : R → R be aone-to-one mapping. Denote by F φ the distribution of the transformed random variable Z = φ ( X ), which leads to the following representation: g φ ( z , . . . , z d ) := g ( φ − ( z ) , . . . , φ − ( z d )) = g ( x , . . . , x d ) . If ( g φ , F φ ) satisfies the conditions in these theorems, corresponding results also hold forthe pair ( g, F ). For example, suppose g ( x , x ) = x − a x − a and F is a Pareto distributionwith shape and scale parameters a and b , respectively. The Pareto distribution is neitherlight-tailed nor bounded, and hence violates the conditions in Theorem 2. By taking φ ( x ) = 1 − ( b/x ) a , we have φ ( X ) ∼ U (0 , g φ , F φ ). For k = 1 , . . . , K , let X ( k )1 , . . . , X ( k ) n k be a random sample of size n k from the distri-17ution F k . The UMVUE ofΘ = (cid:90) g ( x (1)1 , . . . , x (1) d , · · · , x ( K )1 , . . . , x ( K ) d K ) dF ( x (1)1 ) . . . dF K ( x ( K ) d K )is given by the generalized U-statistic U = K (cid:89) k =1 (cid:18) n k d k (cid:19) − (cid:88) η ∈ (cid:81) Kk =1 S nk,dk g ( X η ) ,S n k ,d k = { η k = ( η k, , . . . , η k,d k ) : 1 ≤ η k, < η k, < . . . < η k,d k ≤ n k } , X η = ( X η , . . . , X η K ) = ( X (1) η , , . . . , X (1) η ,d , · · · , X ( K ) η K, , . . . , X ( K ) η K,dK ) . The d (= (cid:80) Kk =1 d k )-dimensional kernel function g is symmetric about any d k -dimensionalsub-input { x ( k )1 , . . . , x ( k ) d k } . The generalized U-statistic reduces to the traditional U-statistic when K = 1. An incomplete generalized U-statistic is given by U = 1 m (cid:88) η ∈ S g ( X η ) , (2.14)where S ⊂ (cid:81) Kk =1 S n k ,d k and m = | S | . We construct the multi-sample ICUDO as follows.For ease of illustration, we assume n k ’ is a multiple of L .Step 1. Let A be an OA ( m, d, L, t ). Adopt random level permutations { π , . . . , π d } ofcolumns of A independently. Specifically, for each l ∈ Z L , change all elements l in the j th column of A to π j ( l ). The m rows of the resulting array A are denotedby { a i = ( a i , . . . , a iK ) : i = 1 , . . . , m ; a ik ∈ Z d k L , k = 1 , . . . , K } .18tep 2. For each k = 1 , . . . , K , create the partition Z n k = (cid:83) Ll =1 G ( k ) l , such that | G ( k ) l | = n k L − for l ∈ Z L , and X ( k ) i ≤ X ( k ) i for any i ∈ G ( k ) l , i ∈ G ( k ) l , with l < l . Forany a = ( a , . . . , a K ) with a k = ( a k, , . . . , a k,d k ) ∈ Z d k L , define G a = K (cid:89) k =1 d k (cid:89) j =1 G ( k ) a k,j . (2.15)Step 3. For i = 1 , . . . , m , independently draw an element η i uniformly from G a i , where a i is the i th row of A : U oa = 1 m m (cid:88) i =1 g ( X η i ) . (2.16)An example is given in the Supplementary Material. For any j k, , . . . , j k,d k ∈ Z d k and k ∈ Z K , assume Eg (cid:16) X (1) j , , . . . , X (1) j ,d , · · · , X ( K ) j K, , . . . , X ( K ) j K,dK (cid:17) < ∞ . Let n min = min { n , . . . , n K } and n max = max { n , . . . , n K } . Here, we assume n min (cid:16) n max and L ≺ n min . Let u = ( u , . . . , u K ), where u k ⊆ Z d k . Define dF u = (cid:81) Kk =1 (cid:81) j ∈ u k dF k ( x ( k ) j ).For any u and x = ( x (1)1 , . . . , x (1) d , · · · , x ( K )1 , . . . , x ( K ) d K ), we recursively define g u ( x ) = (cid:90) g ( x ) dF u c h u ( x ) = g ( x ) − (cid:88) v ⊂ u h v ( x ) , where u c = ( u c , . . . , u cK ) = ( Z d \ u , . . . , Z d K \ u K ), g ∅ ( x ) = Θ and h ∅ ( x ) = 0, v = ( v , . . . , v K ), and v ⊂ u means v k ⊆ u k ( v (cid:54) = u ).19or u , we can define σ u = Var( g u ) and δ u = Var( h u ). The MSE of the completegeneralized U-statistic is given by Sen (1974) asMSE( U ) = K (cid:89) k =1 (cid:18) n k d k (cid:19) − (cid:88) u =( u ,..., u K ) (cid:40) K (cid:89) k =1 (cid:18) d k | u k | (cid:19)(cid:18) n k − d k d k − | u k | (cid:19)(cid:41) σ u . Let | u | = (cid:80) Kk =1 | u k | . The generalized U-statistic and the kernel function are called order-q degenerate if σ u = (cid:80) v ∈ u δ v = 0, for all | u | ≤ q , and there exists u (cid:48) such that σ u (cid:48) > | u (cid:48) | = q + 1. We have MSE( U ) = O ( n − ( q +1) ) in this case. For the non-degenerate case q = 0, we have MSE( U ) (cid:16) n − . With a slight abuse of notation, let σ ( j ,...,j K ) = σ u and δ ( j ,...,j K ) = δ u , for u = ( u , . . . , u K ), with | u k | = j k , k = 1 , . . . , K .For the ICUR, we haveMSE( U RND ) = MSE( U ) + R (0) m + O (cid:18) mn min (cid:19) ,R ( t ) = (cid:88) u : | u | >t δ u = d (cid:88) j =0 · · · d K (cid:88) j K =0 I ( j + · · · + j K > t ) K (cid:89) k =1 (cid:18) d k j k (cid:19) δ j ,...,j K ) . The last term above reduces to the form of R ( t ) for the one-sample case, but the secondterm yields a parsimonious presentation for the multi-sample case. The correspondingproperties of U oa are given as follows. Theorem 4.
For U oa based on OA ( m, d, L, t ) , for any pair of ( g, F ) , we have MSE( U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) + O (cid:18) n (cid:19) . (2.17)Theorem 4 is basically a multi-sample version of Theorem 1, and its result can be20trengthened in the same way. The details are omitted here to conserve space. Weconclude this section with a machine learning example. Example 3. (Ranking measure, Chen et al. (2009)). The ranking measure is an im-portant topic in machine learning research. In the commonly used pairwise approach,the loss for a given classifier score function f is given by L ( f ) = (cid:88) ≤ i 3. Debiased ICUDO for degenerate cases Recall the ICUDO procedure is actually biased owing to the inclusion of combi-nations with duplicate units. The bias square is O ( n − ) for any pair ( g, F ), which isnegligible compared to Var( U ) (cid:16) n − in the non-degenerate case. One can see that it21 000 4000 6000 8000 10000 . . . . . . Subsample size (m) E ff i c i en cy OA(t=2)RND . . . . . . Subsample size (m) E ff i c i en cy OA(t=2)RND Figure 1: Comparison of efficiencies of ˜ U oa and U RND with respect to subsample size m for loss function ( i ) (left) and ( ii ) (right).is no longer negligible in the degenerate case. In this section, we propose a debiasedversion of the ICUDO.We provide details for the multi-sample cases, where the one-sample cases areachieved by taking K = 1. To proceed, Let S ∗ = { ( η , . . . , η K ) : η k = ( η k, , . . . , η k,d k ) ∈Z d k n k , η k,j (cid:54) = η k,j for any j (cid:54) = j } . The debiased ICUDO is constructed in the same wayas the original, except that step 3 changes as follows:Step 3 (cid:48) . For i = 1 , . . . , m , independently draw η i from the uniform distribution on G a i ∩ S ∗ .Adopting (2.14) with S ∗ oa = { η , . . . , η m } , we have the debiased ICUDO as˜ U oa = 1 m m (cid:88) i =1 ω η i g ( X η i ) , (3.3)where ω η i = L d |G a i ∩ S ∗ | / | S ∗ | . heorem 5. ˜ U oa based on OA ( m, d, L, t ) is an unbiased estimator, and MSE( ˜ U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) . (3.4)Theorem 5 is analogous to Theorems 1 and 4 for the one-sample and multi-samplecases, respectively, except that the bias square term O ( n − ) and O ( n − ) are eliminated.Now, for an order- q degenerate U-statistic, the debiased ICUDO can be asymptoticallyefficient with m (cid:16) n q +1 , while the ICUR requires m (cid:31) n q +1 . Moreover, we could allow m to grow more slowly for the debiased ICUDO under some mild conditions on ( g, F ).For example, when d = 2, q = 1, and the conditions of Theorem 2 hold, the debiasedICUDO only needs m (cid:31) n to be asymptotically efficient, while the ICUR requires m (cid:31) n . For the general order q of degeneration, we have m ∗ oa = ( m ∗ RND ) dd +1 , for all d ,under the conditions in Theorem 2. Here, m ∗ oa and m ∗ RND represent the minimum m required for the ICUDO and ICUR, respectively, to be asymptotically efficient.We conclude this section with the following multi-sample example. The kernelfunction is degenerate, and hence favors a debiased ICUDO. However, the highest order δ -value vanishes, which encourages a lower strength of OA. The comparison is madebetween the ICUR and different versions of the ICUDO. Example 4. Let K = 2, d = d = 2, d = 4, and g ( x (1)1 , x (1)2 , x (2)1 , x (2)2 ) = I ( x (1)1 < x (2)1 , x (1)2 < x (2)1 ) + I ( x (2)1 < x (1)1 , x (2)2 < x (1)1 ) . The construction of U oa and the debiased ˜ U oa is based on OA ( m, , m / , 3) and OA ( m, , / , F and F , it can be verified that Eg ( X (1)1 , X (1)2 , X (2)1 , X (2)2 ) = 23 + (cid:90) ( F ( x ) − F ( x )) d ( F ( x ) + F ( x )) / , which indicates the similarity of F and F . The null hypothesis of F = F is rejectedwhen the U-statistic is significantly larger than 2 / 3. Note that the corresponding U-statistic is degenerate under the null hypothesis. See Table 3 for the simulation resultswhen both samples are simulated from N (0 , 1) with sample sizes n = n = 10 .Note that in the g function of Example 4, the two separate parts are all functionsof three inputs. Thus, R (4) = 0, and we can claim that t = 3 works better than t = 4,which is verified by the results in Table 3.Table 3: Result of Example 4. m/ (cid:0) n (cid:1) U oa ) 0.836% 10.9% Eff( U oa ) 0.861% 9.50% 12.9% 25.2% 28.3% 29.8% 36.3% 39.0%Eff( U oa ) 0.450% 4.96% 6.78% 10.6% 10.7% 11.9% 14.5% 15.6%Eff( U RND ) 0.179% 0.701% 1.50% 2.93% 4.19% 7.84% 10.9% 13.1% 4. ICUDO for multi-dimensional data Note that step 2 of the ICUDO algorithm in Section 2 does not apply to multi-dimensional data because it relies on ordering the univariate data. To remedy this, we24dopt a clustering algorithm to divide the data into homogeneous groups. In this regard,the clustered group sizes may vary. This will necessitate a re-weighting procedure similarto the debiasing step in Section 3. To save space, we focus on the debiased ICUDO andadopt the notation of the multi-sample U-statistics in the study of multi-dimensionaldata. For k = 1 , . . . , K , let X ( k )1 , . . . , X ( k ) n k be a random sample of size n k from themulti-dimensional distribution F k . The algorithm is given as follows.Step 1. Let A be an OA ( m, d, L, t ). Adopt random level permutations { π , . . . , π d } ofcolumns of A independently. Specifically, for l ∈ Z L , change all elements l in the j th column of A to π j ( l ). The m rows of the resulting array A are denoted by { a i = ( a i , . . . , a iK ) : i = 1 , . . . , m ; a ik ∈ Z d k L , k = 1 , . . . , K } .Step 2. Let P ( k ) = { G ( k )1 , . . . , G ( k ) L } denote an L -group partition from the clustering of { X ( k )1 , . . . , X ( k ) n k } . For any a = ( a , . . . , a K ), with a k = ( a k, , . . . , a k,d k ) ∈ Z d k L ,define G a = K (cid:89) k =1 d k (cid:89) j =1 G ( k ) a k,j . (4.4)Step 3. For i = 1 , . . . , m , independently draw an element η i uniformly from G a i , where a i is the i th row of A . Let ω η i = L d |G a i ∩ S ∗ | / | S ∗ | .˜ U oa = 1 m m (cid:88) i =1 ω η i g ( X η i ) . (4.5)An example of the construction is given in the Supplementary Material.25 heorem 6. Suppose ω η i → uniformly as n, L → ∞ . For ˜ U oa based on OA ( m, d, L, t ) ,we have MSE( ˜ U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) . (4.6)The R ( t ) in (4.6) is given by (2.10), except that the univariate distribution F ischanged to a multi-dimensional distribution. The assumption in Theorem 6 naturallyholds if we force balance the group size in the clustering process. By applying the fullstrength t = d OA to Theorem 6, we have the following corollary. Corollary 2. For ˜ U oa based on OA ( m, d, L, d ) , for any pair of ( g, F ) , we have MSE( ˜ U oa ) = MSE( U ) + o ( m − ) . (4.7)The choice of t has been discussed and is illustrated in Examples 2 and 4. We donot compare different t in the following examples because d = 2 always holds, and so t ≤ 2. We always take t = 2, L = 10 , , . . . , m = L t . Example 5. (Kendall’s tau, Chen and Kengo (2019)). The Kernel function h (( x , y ) , ( x ,y )) = 2 I ( x < x , y < y ) + 2 I ( x < x , y < y ) − 1. For simplicity, we assume that( X, Y ) follows a normal distribution, with µ = (0 , 0) and Σ = diag(3 , n = 10 .The MSE when estimating the Kendall correlation using U RND and ˜ U oa is shown inTable 4. As a reference, we have MSE( U ) = 8 . × − .26able 4: Result of Example 5. m 100 400 900 1600 2500 3600 4900 6400 8100 10000MSE( U RND ) .765 .191 .0903 .0515 .0260 .0195 .0167 .0137 .0098 .0089MSE( ˜ U oa ) .075 .0096 .0032 .0015 .00063 .00035 .00023 .00014 .00011 .00009 Example 6. (Testing stochastic monotonicity, Lee et al. (2009)). Let ( X, Y ) be a real-valued random vector, and denote by F Y | X ( y | x ) the conditional distribution functionof Y , given X . Consider the problem of testing the stochastic monotonicity hypothesis H : F Y | X ( y | x ) ≤ F Y | X ( y | x (cid:48) ) , ∀ y ∈ R and whenever x ≥ x (cid:48) . This essentially tests where an increase in X would induce an increase in Y (e.g.,income vs. expenditure in a household). Lee et al. (2009) proposed the followingtesting statistic: U n ( x, x (cid:48) ) = 1 n ( n − (cid:88) ≤ i (cid:54) = j ≤ n ( I { Y i ≤ x (cid:48) } − I { Y j ≤ x (cid:48) } )sign( X i − X j ) K ( x − X i ) K ( x − X j ) , (4.8)where K ( x ) = 0 . − x ). We simulate ( X, Y ) from a normal distribution with µ = (0 , 0) and Σ = diag(3 , x, x (cid:48) ) = (0 , n = 10 ,the comparison between ˜ U oa and U RND is given in Table 5. As a reference, we haveMSE( U ) = 2 . m 100 400 900 1600 2500 3600 4900 6400 8100 10000MSE( U RND ) 302.7 69.01 38.01 17.45 12.86 8.613 7.438 6.273 4.886 4.327MSE( ˜ U oa ) 33.18 15.73 8.848 4.252 3.524 3.168 2.732 2.662 2.630 2.602 Example 7. (Clustering performance evaluation, Papa et al. (2015)). For a givendistance D : X × X → R defined on X , the performance of a partition P can beevaluated from the data X , . . . , X n ∈ X using W ( P ) = (cid:88) ≤ i 0) and Σ = diag(1 , U RND and ˜ U oa when estimating W ( P ) for different m is shown in Table 6.As a reference, we have MSE( U ) = 1 . × − . 5. Conclusion To tackle the computational issue of U-statistics, we have introduced a new typeof incomplete U-statistic called the ICUDO, which has much higher efficiency thanexisting methods. The required computational burden, as indexed by the number of28able 6: Result of Example 7. m 100 400 900 1600 2500 3600 4900 6400 8100 10000MSE( U RND ) .216 .0625 .0346 .0171 .0064 .0047 .0038 .0021 .0017 .0010MSE( ˜ U oa ) .011 .0064 .0038 .0019 .00056 .00051 .00038 .00027 .00013 .00012combinations m for the ICUDO to be statistically equivalent to the complete U-statistic,is of smaller magnitude than existing methods. This was validated theoretically andempirically for degenerate and non-degenerate one- and multi-sample U-statistics onunivariate and multi-dimensional data. In fact, m is allowed to grow more slowly thanthe data size n in the non-degenerate case.The OA plays a critical role in the construction of the ICUDO, in light of itsprojective uniformity. Other space-filling design schemes exist with similar properties,such as the OA-based Latin hypercube by Tang (1993), and the strong orthogonal arrayby He and Tang (2012), which is used frequently in the design of computer experiments.By exhaustive simulations, we find the improvement of the efficiency by these designschemes over that of the ICUDO to be within 1%. However, this improvement isnot sufficient to advocate using these structures, owing to the extra complexity of thecomputation. Other improvements over the OA are based on optimal criteria, such asthe generalized minimum aberration OA. However, no theoretical results are available29or these fixed structures.Lastly, the following offer potential future research directions. ( i ) For high-dimensionaldata, dimension-reduction techniques need to be integrated into our current algorithm.( ii ) For multi-sample cases, we may divide different samples into different numbers ofgroups in some optimal way. This will induce more complicated OA structures. ( iii )For the purpose of statistical inference, it would be of interest to study the asymptoticdistributions of the ICUDO under different conditions. ( iv ) The dimension of the kernelfunctions is fixed at d as n increases, and all data are generated independently. In oneimportant type of U-statistic based on stochastic processes, d increases with n and thedata can be dependent. These topics will involve quite different methodologies, andhence are left to future work. Supplementary Material The online Supplementary Material generalizes the result of Theorem 2 under ad-ditional conditions. It also provides details on how to choose the combination of L and t and illustrates the generation of the ICUDO for multi-sample and multi-dimensionalcases. Acknowledgments Dr. Kong’s research was partially supported by NSFC grant 11801033 and the Bei-30ing Institute of Technology Research Fund Program for Young Scholars. Dr. Zheng’sresearch was partially supported by the National Science Foundation, DMS-1830864. Appendix. Proof of Theorems Lemmas 1–3 contribute to the proof of Theorem 1. Theorems 4 and 6 can be provedsimilarly as Theorem 1, but only with more tedious analysis, and hence they are omitteddue to the limit of space. For any a ∈ Z dL , we call the set G a = (cid:81) dj =1 G a j a grid . Let F n be the empirical distribution of { X , . . . , X n } and define V = (cid:82) g ( x , . . . , x d ) dF n ( x ) . . .dF n ( x d ) . For given F n and η ∈ G a , define¯ g ( X η ) = |G a | − (cid:88) η (cid:48) ∈G a g ( X η (cid:48) ) . For the same S oa = { η , . . . , η m } in generating U oa , define¯ V = 1 m m (cid:88) i =1 ¯ g ( X η i ) . Lemma 1. Some properties of V and ¯ V are listed as follows. ( i ) ¯ V is an unbiased estimator of V . ( ii ) The bias of V is of order O ( n − ) and MSE( V ) = MSE( U ) + O ( n − ) . ( iii ) U oa is an unbiased estimator of V and so also has bias O ( n − ) .Proof. ( i ) follows the unbiasedness of orthogonal arrays. ( ii ) can be found in Proposition3.5 in Shao (2007) (page 211). ( iii ) follows from Owen (1992). (cid:3) emma 2. E ( ¯ V − V ) ≤ m (cid:88) u : | u | >t (cid:0) δ u + O ( n − ) (cid:1) . Proof. Let δ u = δ | u | and σ u = σ | u | .Change the F in section 2.2 to F n , we can define dF n, u , g n, u , h n, u , σ n, u and δ n, u analogously and sequentially. Again, by substituting ¯ g for g , with F n , we define ¯ g n, u , ¯ h n, u , ¯ σ n, u and ¯ δ n, u . Adopt (3.5) in Owen (1992) to ¯ g , wehave E [( ¯ V − V ) | F n ] ≤ m (cid:88) u : | u | >t ¯ δ n, u ≤ m (cid:88) u : | u | >t δ n, u , which leads to E ( ¯ V − V ) = E ( E [( ¯ V − V ) | F n ]) ≤ m (cid:80) u : | u | >t Eδ n, u . Consider σ n, u = (cid:82) g n, u ( x , . . . , x d ) dF n ( x ) . . . dF n ( x n ), which can be further written as (cid:82) (cid:0)(cid:82) g n, u dF n, u c (cid:1) dF n, u .This integer can be viewed as a V-statistic with the new kernel g ( x , . . . , x | u | , x | u | +1 , . . . , x d ) · g ( x , . . . , x | u | , x d +1 , . . . , x d −| u | ), which estimates σ u with bias O ( n − ). (cid:3) Lemma 3. (Lusin’s theorem) For any measurable function g on R d and arbitrary (cid:15) > , there exists a continuous g (cid:15) defined on R d with compact support such that E | g − g (cid:15) | < (cid:15) . Proof of Theorem 1. Define g F ( Z , . . . , Z d ) = g ( F − ( Z ) , . . . , F − ( Z d )) suchthat Z ∼ U (0 , 1) and F − ( Z ) ∼ F . With this new kernel g F , the distribution ofrandom variables X is assumed to be the uniform distribution on [0 , U oa − Θ as ( U oa − ¯ V ) + ( ¯ V − V ) + ( V − Θ). Simple analysis reveals thefollowing relationships among of V oa , ¯ V and V . Conditional on F n , V is constant and32o E ( U oa − ¯ V )( V − Θ) = 0, E ( ¯ V − V )( V − Θ) = 0 since E ( U oa − ¯ V ) = E ( ¯ V − V ) = 0.Conditional on both V and ¯ V , E ( U oa − ¯ V ) = 0 which indicates E ( U oa − ¯ V )( ¯ V − V ) = 0.Thus, MSE( U oa ) = E ( U oa − ¯ V ) + E ( ¯ V − V ) + MSE( V ) (6.6)whose last two terms have been addressed by Lemma 2 and Lemma 1. So we need toprove E ( U oa − ¯ V ) = o ( m − ). Since U oa and ¯ V always use the same S oa = { η , . . . , η m } , E ( U oa − ¯ V ) = E (cid:32) m m (cid:88) i =1 g ( X η i ) − ¯ g ( X η i ) (cid:33) . For i (cid:54) = i ( i , i ∈ Z m ), E ( g ( X η i ) − ¯ g ( X η i ))( g ( X η i ) − ¯ g ( X η i )) = 0. Denote η ∼ η (cid:48) if η and η (cid:48) belong to the same grid. E ( U oa − ¯ V ) ≤ m − E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] . (6.7)For any M > 0, define g ( x , M ) = max { min { g ( x ) , M } , − M } . Obviously, we havelim M →∞ g ( x , M ) = g ( x ), and dominated convergence theorem indicates E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] = lim M →∞ E [( g ( X η , M ) − g ( X η (cid:48) , M )) | η ∼ η (cid:48) ] . (6.8)Thus, for arbitrary (cid:15) > 0, we can find M (cid:15) such that E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] ≤ E [( g ( X η , M (cid:15) ) − g ( X η (cid:48) , M (cid:15) )) | η ∼ η (cid:48) ] + (cid:15). (6.9)Note that { X , . . . , X n } are random, so is X η . Note that Eg ( X , . . . , X d ) < ∞ . Wehave Eg ( X η ) < ∞ and so Eg ( X η ) < ∞ , which indicates Eg ( X η , M (cid:15) ) < ∞ and33 g ( X η , M (cid:15) ) < ∞ . From Lusin’s theorem, there exists a continuous g ∗ (cid:15),M (cid:15) with compactsupport such that E | g ( X η , M ) − g ∗ (cid:15),M (cid:15) ( X η ) | < (cid:15)M − (cid:15) . Since | g ( X η , M (cid:15) ) | ≤ M (cid:15) , E [( g ( X η , M (cid:15) ) − g ( X η (cid:48) , M (cid:15) )) | η ∼ η (cid:48) ] ≤ M (cid:15) E [ | g ( X η , M (cid:15) ) − g ( X η (cid:48) , M (cid:15) ) || η ∼ η (cid:48) ] ≤ M (cid:15) E | g ( X η , M (cid:15) ) − g ∗ (cid:15),M (cid:15) ( X η ) | + 2 M (cid:15) E | g ( X η (cid:48) , M (cid:15) ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) | +2 M (cid:15) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) ] ≤ (cid:15) + 2 M (cid:15) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) ] (6.10)Note that g ∗ (cid:15),M (cid:15) has compact support and so is uniformly continuous. There exists∆( M − (cid:15) (cid:15) ) such that | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) | ≤ (cid:15)M − (cid:15) as long as ||X η −X η (cid:48) || ≤ ∆( M − (cid:15) (cid:15) ).Define A = {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) for some j ∈ Z d } , with P ( A ) ≤ (cid:80) dj =1 P {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } , and ||X η − X η (cid:48) || ≤ ∆( M − (cid:15) (cid:15) ) on A c . 2 M (cid:15) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) ]= 2 M (cid:15) P ( A c ) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) , A c ]+2 M (cid:15) P ( A ) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) , A ] ≤ (cid:15) + 4 M (cid:15) d (cid:88) k =1 P {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } (6.11)34ow we give the relationship among several events. For j ∈ Z d and η ∼ η (cid:48) , {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } = {|X η j − F n ( X η j ) + F n ( X η j ) − F n ( X η (cid:48) j ) + F n ( X η (cid:48) j ) − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) }⊆ { sup x ∈ (0 , | x − F n ( x ) | ≥ d ∆( M − (cid:15) (cid:15) ) } ∪ { F n ( X η j ) − F n ( X η (cid:48) j ) ≥ d ∆( M − (cid:15) (cid:15) ) } Note that η ∼ η (cid:48) , as L → ∞ , P ( { F n ( X η j ) − F n ( X η (cid:48) j ) ≥ d ∆( M − (cid:15) (cid:15) ) } ) → 0. Dvoretzky-Kiefer-Wolfowitz inequality reveals P (cid:0) sup x ∈ (0 , | F n ( x ) − x | ≥ (cid:15) (cid:1) ≤ exp( − n(cid:15) ). So weimmediately have P ( {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } ) → n, L → ∞ , and we can find n (cid:15) and L (cid:15) such that P ( {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } ) ≤ (4 dM (cid:15) ) − (cid:15) (6.12)as long as n ≥ n (cid:15) and L ≥ L (cid:15) .Finally, by combining (6.8)-(6.12), we know that for arbitrary (cid:15) > 0, we can find n (cid:15) and L (cid:15) such that E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] ≤ (cid:15) , as long as n ≥ n (cid:15) and L ≥ L (cid:15) .That means E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] → n, L → ∞ . Theorem 1 is concluded by submitting (6.13) into (6.7) and combining(6.7) with (6.6), Lemma 1( ii ) and Lemma 2. (cid:3) Proof of Theorem 2. There exists c > f ( · ) > c on [ a, b ], and | F ( x ) − F ( x ) | ≥ c | x − x | for x , x ∈ [ a, b ]. In (6.6), we only analyze35 ( U oa − ¯ V ) since the rest two terms are given by Lemma 1( ii ) and Lemma 2. Dvoretzky-Kiefer-Wolfowitz inequality reveals P (sup x ∈ R | F n ( x ) − F ( x ) | ≥ (cid:15) ) ≤ exp( − n(cid:15) ). Bytaking (cid:15) = [log( n ) n − ] / , we have P ( A ) ≤ exp( − n ) = O ( n − ) , where A = { sup x ∈ R | F n ( x ) − F ( x ) | ≥ n − / log / ( n ) } . Since g is continuous and F isbounded, we can find M > | g | ≤ M and so | U oa | , | ¯ V | ≤ M . E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ]= P ( A ) E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A ] + P ( A c ) E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A c ] ≤ M n − + E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A c ]The analysis of E [( U oa − ¯ V ) |A c ] is as follows. On A c , we have, for 1 ≤ k , k ≤ nL − , c | X (( l − nL − + k ) − X (( l − nL − + k ) |≤ | F ( X (( l − nL − + k ) ) − F ( X (( l − nL − + k ) ) |≤ | F n ( X (( l − nL − + k ) ) − F n ( X (( l − nL − + k ) ) | + 2 n − / log / n ≤ L − + 2 n − / log / n. Since g is Lipschitz continuous, we know ( g ( X η ) − g ( X η (cid:48) )) = O ( L − + n − log n ) forany η ∼ η (cid:48) . Then we have E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A c ] = O ( L − + n − log n ). Withthis equation, Theorem 8 is the direct result of (6.6) (6.7), Lemma 1( ii ), Lemma 2. (cid:3) roof of Theorem 3. For convenience, we simply write g F as g in this proof.In (6.6), we only analyze E ( U oa − ¯ V ) since the rest two terms are given by Lemma 1( ii ) and Lemma 2. Each row of the matrix A generated in step 1 follows the uniformdistribution on Z dL since the permutation in each column of A is independent. Thus, E ( U oa − ¯ V ) = E (cid:32) m m (cid:88) i =1 g ( X η i ) − ¯ g ( X η i ) (cid:33) = 1 mL d (cid:88) a ∈Z dL E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a ] . Analysis is now focused on E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a ] for every a ∈ Z dL . Let X (0) = 0and X ( n +1) = 1. For l ∈ Z L , given X (( l − nL − ) and X ( lnL − +1) , X (( l − nL − +1) , . . . , X ( lnL − ) has the same distribution as the order statistic of L samples following the uniformdistribution on [ X (( l − nL − ) , X ( lnL − +1) ]. For A = { sup x ∈ R | F n ( x ) − F ( x ) | ≥ n − − c } ,Dvoretzky-Kiefer-Wolfowitz inequality reveals P ( A ) = exp( − n c ). On A c , we have( X ( lnL − +1) − X (( l − nL − ) ) /L → n → ∞ . The analysis is now focused on E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a , A c ]. For this given a , define X = ( X , , . . . , X ,d ) where X ,j = Ln (cid:80) η ∈ G aj X η and so (cid:80) η ∈ G aj ( X η − X ,j ) = 0. Adopt the Taylor expansion on X , wehave g ( X η ) = g ( X ) + d (cid:88) j =1 ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j ( X η j − X ,j ) + O ( L − ) and ¯ g ( X η ) = g ( X ) + O ( L − ) . [( g ( X η ) − ¯ g ( X η )) | η ∈ G a , A c ]= E (cid:32) d (cid:88) j =1 ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j · ( X η j − X ,j ) + O ( L − ) (cid:33) | η ∈ G a , A c = o ( L − ) + d (cid:88) j =1 E (cid:32) ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j · ( X η j − X ,j ) (cid:33) | η ∈ G a , A c = o ( L − ) + d (cid:88) j =1 (cid:32) ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j (cid:33) L . And then we have E ( U oa − ¯ V ) = 1 mL d (cid:88) a ∈Z dL E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a ]= 112 mL d (cid:88) j =1 L d (cid:88) a ∈Z dL (cid:32) ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j (cid:33) + o (cid:18) mL (cid:19) = 112 mL d (cid:88) j =1 E (cid:18) ∂g∂x j (cid:19) + o (cid:18) mL (cid:19) , Then Theorem 3 is the direct result of (6.6), Lemma 1( ii ) and Lemma 2. (cid:3) Proof of Theorem 5. Consider the m rows of A , a , . . . , a m , generated in thestep 1 of the construction in section 2.1. For any a ∈ Z dL , the random permutation ingenerating a , . . . , a m reveals that P ( a = a ) = L − d . Given F n , E ( ˜ U oa | F n ) = E m m (cid:88) i =1 ω η i g ( X η i ) = Eω η g ( X η )= (cid:88) a ∈Z dL L − d E η ∈G a ω η g ( X η ) = (cid:88) a ∈Z dL |G a i ∩ S ∗ || S ∗ | E η ∈G a g ( X η )= (cid:88) a ∈Z dL |G a i ∩ S ∗ || S ∗ | (cid:32) |G a i ∩ S ∗ | (cid:88) η ∈G a g ( X η ) (cid:33) = 1 | S ∗ | (cid:88) η ∈ S ∗ g ( X η ) = U . U is unbiased, so is ˜ U oa . This proves the unbiasedness of ˜ U oa . The MSE of ˜ U oa can be similar analyzed as Theorem 1, and so is omitted here. (cid:3) References Bickel, P. J. and Freedman, D.(1981). Some asymptotic theory for the bootstrap. Annals of Statistics ,1196–1217.Blom, G. (1976). Some properties of incomplete U-statistic. Biometrika , 573–580.Bretagnolle, J. (1983). Lois limits du Bootstrap de certaines functionnelles. Annales de l’Institut Henri PoincareSection B XIX(3), 281-296.Brown, B. M. and Kildea, D. G. (1978). Reduced U-statistic and the Hodges-Lehmann estimator. Annals ofStatistics , 828–835.Chen, X. H. and Kengo Kato (2019). Randomized incomplete U-statistic in high dimensions. Annals ofStatistics , To appear.Chen, W., Liu, T. Y., Lan, Y. Y., Ma, Z. M. and Li, H. (2009). Ranking Measures and Loss Functions inLearning to Rank. Advances in Neural Information Processing Systems 22 (NIPS 2009) Dehling, H. and Mikosch, T. (1994). Random quadratic forms and the bootstrap for U-statistic. Journal ofMultivariate Analysis , 392–413.Enqvist, E. (1985). A note on incomplete U-statistic for stationary absolutely regular processes. Contributionsto Probability and Statistics in Honour of Gunnar Blom. University of Lund , 97–103.Eagleson, G. K. (1979). Orthogonal expansions and U-statistic. Australian & New Zealand Journal of Statistics , 221–237.Herrndorf, N. (1986). An invariance principle for reduced U-statistic. Metrika , 179–188.Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of MathematicalStatistics , 293–325.He, Y. Z. and Tang, B. X. (2012). Strong orthogonal arrays and associated Latin hypercubes for computerexperiments. Biometrika , 254–260.Hilbert, M. and L´opez, P. (2011). The World’s Technological Capacity to Store, Communicate, and ComputeInformation. Science , 60–65.Janson, S. (1984). The asymptotic distributions of incomplete U-statistic. Zeitschrift fr Wahrscheinlichkeits-theorie und Verwandte Gebiete , 1159–1169.Lee, A. J. (1979). On the asymptotic distribution of U-statistic. Institute of Statistics Mimeo Series ,University of North Carolina at Chapel Hill.Lee, A. J. (1982). On incomplete U-statistic having minimum variance. Australian & New Zealand Journal ofStatistics , 275–282.Lee, A. J. (1990). U-statistic: Theory and Practice. CRC Press.Lee, S., Linton, O. and Whang, Y. (2009). Testing for Stochastic Monotonicity. Econometrica , 585–602.Lehmann, E. L. (1951). Consistency and unbiasedness of certain nonparametric tests. Annals of Statistics , 165–179.Lin, N. and Xi, R. (2010). Fast surrogates of U-statistic. Computational Statistics and Data Analysis , Annals of Statistics , 1811–1823.Marie Huskova and Paul Janssen (1993b). Generalized bootstrap for studentized U-statistic: a rank statisticapproach. Statistics and Probability Letters , 225–233.Owen, A. B. (1992). Orthogonal arrays for computer experiments. Statistica Sinica , 439–452.Papa, G., Cl´emen¸con, S. and Bellet, A. (2015) SGD Algorithms based on Incomplete U-statistics: Large-ScaleMinimization of Empirical Risk. Advances in Neural Information Processing Systems 28 (NIPS 2015) Rempala, G. and Wesolowski, J. (2003). Incomplete U-statistic of permanent design. Journal of NonparametricStatistics , 221–236.Rempala, G. and Srivastav, S. (2004). Minimum variance rectangular designs for U-statistic. Journal ofStatistical Planning and Inference , 103–118.Sen, P. K. (1974). Weak convergence of generalized U-statistic. Annals of Probability , 90–102.Sen, P. K. (1977). Almost sure convergence of generalized U-statistic. Annals of Probability , 287–290.Shao, J. (2007). Mathematical statistics (second edition). Springer.Tang, B. X. (1993). Orthogonal array based Latin hypercubes. Journal of the American Statistical Association , 1392–1397.von Mises, R. (1948). On the asymptotic distribution of differentiable statistical functions. Annals of Mathe-matical Statistics , 309–348. ang Q. (2012). Investigation of topics in U-statistics and their applications in risk estimation and cross-validation, Ph.D. dissertation, Penn State University library electronic resource, 1–187.Wang Q and Lindsay B. (2014). Variance estimation of a general U-statistic with application to cross-validation[J]. Statistica Sinica , , 1117–1141.School of Mathematics and Statistics, Beijing Institute of TechnologyE-mail: [email protected] of Business Analytics and Statistics, University of TennesseeE-mail: [email protected] ESIGN BASED INCOMPLETE U-STATISTICS Xiangshun Kong , Wei Zheng Beijing Institute of Technology and University of Tennessee Generalization of Theorem 2. The following conditions on g or F will be needed by Theorem 2 in Section 2 andTheorems 7–9 in this section.( g. Lipschitz continuous : The function, g : R d → R , is said to be Lipschitz continuousif there exists a constant c > | g ( a ) − g ( a ) | ≤ c || a − a || for any a , a ∈ R d . Example: First-order polynomial functions.( g. Order- p continuous : The function, g : R d → R , is said to be order- p continuousif there exists a constant c > φ p ( a − a ) ≤ c + max p ( || a || , || a || ) for any a , a ∈ R d such that | g ( a ) − g ( a ) | ≤ φ ( a , a ) || a − a || for any a , a ∈ R d .Example: All polynomial functions.( g. Uniformly bounded-variation : For a real valued function f : R → R , the totalvariation of f is defined as V R ( f ) = sup p> sup −∞ Suppose F is light-tailed. Let X max = max {| X | , . . . , | X n |} . Then, forarbitrary a > with n → ∞ , we have EX a max = O (log n ) a . Proof. Since the distribution is light-tailed, we have P ( | X | > x ) ≤ e − cx for any | x | > c ,where c and c are two fixed positive numbers. E ( X max ) a = (cid:90) x> ax a − P ( X max > x ) dx ≤ (cid:90) c − log n ax a − dx + (cid:90) ∞ c − log n ax a − P ( X max > x ) dx = O (log n ) a + (cid:90) ∞ c − log n ax a − P ( X max > x ) dx = O (log n ) a + (cid:90) ∞ c − log n ax a − ne − cx dx = O (log n ) a + O (1) . (cid:3) Lemma 5. Suppose ( i ) g is order- p continuous, and ( ii ) F is light-tailed. We have E ( U oa − ¯ V ) = O (cid:18) mL (log n ) p +2 (cid:19) . roof. Let X max = max {| X | , . . . , | X n |} . For l ∈ Z L , define d l = max {| X i − X i | : i , i ∈ G l } . Since g is order- p continuous, for η ∼ η (cid:48) in G a , | g ( X η ) − g ( X η (cid:48) ) | ≤ ( c + X p max ) d / d l , and so | g ( X η ) − g ( X η (cid:48) ) | ≤ ( c + X p max ) · d · (cid:80) dj =1 d a j .Since U oa and ¯ V always use the same S oa = { η , . . . , η m } , we have E ( U oa − ¯ V ) = E (cid:32) m m (cid:88) i =1 ( g ( X η i ) − ¯ g ( X η i )) (cid:33) . For i (cid:54) = i , E ( g ( X η i ) − ¯ g ( X η i ))( g ( X η i ) − ¯ g ( X η i )) = 0. E ( U oa − ¯ V ) = m − E m (cid:88) i =1 ( g ( X η i ) − ¯ g ( X η i )) ≤ m − E m (cid:88) i =1 ( c + X p max ) · d · d (cid:88) j =1 d a ij Since (cid:80) Ll =1 d l ≤ X max , we have (cid:80) Ll =1 d l ≤ X . Using Lemma 4, we have E ( U oa − ¯ V ) ≤ m − dE (cid:32) ( c + X p max ) m (cid:88) i =1 d (cid:88) j =1 d a ij (cid:33) = m − dE (cid:32) ( c + X p max ) d (cid:88) j =1 m (cid:88) i =1 d a ij (cid:33) = m − dE (cid:32) ( c + X p max ) d (cid:88) j =1 mL − X (cid:33) = O (cid:18) mL (log n ) p +2 (cid:19) . (cid:3) Theorem 7. Suppose ( i ) The kernel function g is order- p continuous, and ( ii ) F islight-tailed. For U oa based on OA ( m, d, L, t ) , we have MSE( U oa ) = MSE( U ) + R ( t ) m + O (cid:18) (log n ) p +2 mL (cid:19) + O (cid:18) n (cid:19) . (6.14) Proof . This is the direct result of (6.6), Lemma 1( ii ), Lemmas 2 and 5. (cid:3) heorem 8. Suppose the kernel function g has uniformly bounded variation. For U oa based on OA ( m, d, L, t ) , we have MSE( U oa ) = MSE( U ) + R ( t ) m + O (cid:18) mL (cid:19) + O (cid:18) n (cid:19) . (6.15) Proof. From (6.6), Lemma 1( ii ) and Lemma 2, we only need to prove E ( U oa − ¯ V ) = O ( m − L − ). First, we introduce some notations that will be used only in the proof ofthis theorem. Given the order statistic of { X , . . . , X n } denoted by X (1) , . . . , X ( n ) , for l = 1 , . . . , L and ( x , . . . , x d ) ∈ R d − , define D ( l | x , . . . , x k ) = max ( l − nL − 0. So we have (cid:88) a ∈Z dL (cid:88) η ∈G a (cid:88) η (cid:48) ∈G a D ( a | X η , . . . , X η d ) ≤ cL d |G a | /L. Similarly analyzing the D ( a | X η (cid:48) , X η , . . . , X η d ), . . . , D ( a d | X η (cid:48) , X η (cid:48) , . . . , X η (cid:48) d − , we have E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] = O ( L − ) and so E ( U oa − ¯ V ) = O ( m − L − ). Theorem 8is the direct result of (6.6), Lemma 1( ii ), Lemma 2. (cid:3) Theorem 9. Suppose ( i ) The kernel function g is a linear combination of some order- p continuous functions and some uniformly bounded-variation functions, and ( ii ) F islight-tailed. Then (6.14) still holds with L ≤ n (log n ) − .Proof . This is the direct result of Theorems 7 and 8. Choosing L and t . From Eq(2.13) of Theorem 3 in the manuscript and the relation m = λL t , weknow that the trade-off between L and t depends on the variance of each componentin the Heoffding’s decomposition, i.e., δ j , j = 1 , . . . , d . We shall give these variancesa estimator ˆ δ j . Using Eq(2.13) with R ( t ) and Eγ ( X , . . . , X d ) being estimated as afunction of ˆ δ j , we should choose the combination of L and t which minimizes φ ( L, t ) = ˆ R ( t ) m + d mL ˆ Eγ ( X , . . . , X d ) , where ˆ R ( t ) and ˆ Eγ ( X , . . . , X d ) are functions of ˆ δ j ’s.47ow we provide two methods for generating ˆ δ j . (1) When the Heoffding’s decom-position is easy to calculate, one can write down the analytical expression and give adirect estimation of δ j ’s. (2) We can use a bootstrap approach for ˆ δ j ’s. With a smallsample size n (cid:48) (cid:28) n , it is easy to bootstrap MSE( U ) (the complete U-statistic). Fordetails of the bootstrap approach, we may refer to Marie Huskova and Paul Janssen(1993a,b). Now, let us review the formula of MSE( U ):MSE( U ) = (cid:18) nd (cid:19) − d (cid:88) j =1 (cid:18) dj (cid:19)(cid:18) n − dd − j (cid:19) σ j = d (cid:88) j =1 (cid:18) dj (cid:19) (cid:18) nj (cid:19) − δ j . Usually, with at most d different n (cid:48) ( > d ), we can generate linear equations of δ j basedon the d different (cid:91) MSE( U ) based on the bootstrap approach. And the solution of theselinear equations can be used as the estimation of ˆ δ j ’s.For the second method, we now use the setup in Example 1 for illustration. Forconvenience, we set n = 10 and m = 10 . The two choices of the combination of L and t is ( L = 100 , t = 3) and ( L = 1000 , t = 2). We use bootstrap method to estimatethe variance of the complete U-statistic with n (cid:48) = 4 , , 6. The subsample size n (cid:48) isso small that the computational burden of the bootstrapped complete U-statistic, i.e., (cid:0) n (cid:48) (cid:1) is negligible. Simulation reveals that ˆ δ = 0 . δ = 0 . δ = 1 . t = 3 shall work better than t = 2, which is verified by thesimulation result. Actually, with m = 10 , the efficiency of U oa is 100 . 0% when t = 3and 97 . 88% when t = 2. Examples for multi-sample and multi-dimensional cases. Consider48he multi-sample case. Suppose d = d = 2, n = n = 9 and the two samples are X (1)6 ≤ X (1)8 ≤ X (1)2 ≤ X (1)4 ≤ X (1)7 ≤ X (1)5 ≤ X (1)3 ≤ X (1)9 ≤ X (1)1 .X (2)2 ≤ X (2)7 ≤ X (2)3 ≤ X (2)6 ≤ X (2)1 ≤ X (2)4 ≤ X (2)5 ≤ X (2)9 ≤ X (2)8 . Then we have L = 3 groups listed as G (1)1 = { , , } , G (1)2 = { , , } , G (1)3 = { , , } and G (2)1 = { , , } , G (2)2 = { , , } , G (2)3 = { , , } . An example of OA ( m = 9 , d =4 , L = 3 , t = 2) in step 1 is given as follows in transpose. A T = . Then we could possibly have the X η i , i = 1 , . . . , 9, used in the construction of 9-runmulti-sample construction as follows. {X η , . . . , X η } = X (1)8 X (1)2 X (1)6 X (1)4 X (1)4 X (1)5 X (1)9 X (1)1 X (1)9 X (1)6 X (1)7 X (1)3 X (1)8 X (1)7 X (1)1 X (1)6 X (1)7 X (1)3 X (2)7 X (2)1 X (2)5 X (2)4 X (2)8 X (2)3 X (2)9 X (2)2 X (2)6 X (2)3 X (2)6 X (2)9 X (2)5 X (2)3 X (2)1 X (2)6 X (2)8 X (2)3 . Consider the multi-dimensional case. Suppose X = (1 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . G = { , , } , G = { , , } , G = { , , } . The choosing of η i , i = 1 , . . . ,, . . . ,