[PDF] Design based incomplete U-statistics

Abstract

U-statistics are widely used in fields such as economics, machine learning, and statistics. However, while they enjoy desirable statistical properties, they have an obvious drawback in that the computation becomes impractical as the data size n increases. Specifically, the number of combinations, say m , that a U-statistic of order d has to evaluate is O( n d ) . Many efforts have been made to approximate the original U-statistic using a small subset of combinations since Blom (1976), who referred to such an approximation as an incomplete U-statistic. To the best of our knowledge, all existing methods require m to grow at least faster than n , albeit more slowly than n d , in order for the corresponding incomplete U-statistic to be asymptotically efficient in terms of the mean squared error. In this paper, we introduce a new type of incomplete U-statistic that can be asymptotically efficient, even when m grows more slowly than n . In some cases, m is only required to grow faster than n − − √ . Our theoretical and empirical results both show significant improvements in the statistical efficiency of the new incomplete U-statistic.

Full PDF

DDESIGN BASED INCOMPLETE U-STATISTICS

Xiangshun Kong , Wei Zheng Beijing Institute of Technology and University of Tennessee

Abstract:

U-statistics are widely used in ﬁelds such as economics, machine learning, and statis-tics. However, while they enjoy desirable statistical properties, they have an obvious drawbackin that the computation becomes impractical as the data size n increases. Speciﬁcally, the num-ber of combinations, say m , that a U-statistic of order d has to evaluate is O ( n d ). Many eﬀortshave been made to approximate the original U-statistic using a small subset of combinationssince Blom (1976), who referred to such an approximation as an incomplete U-statistic. To thebest of our knowledge, all existing methods require m to grow at least faster than n , albeit moreslowly than n d , in order for the corresponding incomplete U-statistic to be asymptotically eﬃ-cient in terms of the mean squared error. In this paper, we introduce a new type of incompleteU-statistic that can be asymptotically eﬃcient, even when m grows more slowly than n . In somecases, m is only required to grow faster than √ n . Our theoretical and empirical results bothshow signiﬁcant improvements in the statistical eﬃciency of the new incomplete U-statistic. Key words and phrases:

Asymptotically eﬃcient, BIBD, big data, design of experiment, sub-sampling. a r X i v : . [ m a t h . S T ] A ug . Introduction The U-statistic has been a fundamental statistical estimator since the work of Ho-eﬀding (1948), who studied its theoretical properties and established central limit theo-rems for non-degenerate U-statistics. Eagleson (1979) derived asymptotic distributionsof some degenerate U-statistics of order two, which were then extended to all degenerateU-statistics by Lee (1979). Other extensions include a variant of U-statistics called V-statistics by von Mises (1948), U-statistics for stationary processes by Enqvist (1985),and multi-sample U-statistics by Lehmann (1951) and Sen (1974, 1977).The theory of U-statistics admits a minimum variance unbiased estimator of anestimable parameter for a large class of probability distributions, hence its popularityin applications. However, U-statistics can also be time consuming to compute. For aU-statistic of order d , the number of combinations, say m , to be evaluated is (cid:0) nd (cid:1) , thatis O ( n d ), where n is the data size. Suppose n = 10 and d = 3. Then, listing the (cid:0) (cid:1) combinations requires 667 GB of memory and a computing time of approximately 100hours on a Macbook Pro with Intel Core i7 2.9 GHz CPU. With n = 10 and d = 4, therequired memory is roughly 16 . , O ( n ) byexploiting the structure of the kernel function, especially when the data are univariateand consist of one sample. However, in practice, such a computational reduction isoften not feasible. Note that we do not focus here on which U-statistics are candidatesfor a reduction in the original computational complexity of O ( n d ) because our goal is tostudy a generic scheme for the fast approximation of U-statistics. A natural remedy isto take a sample of size m (cid:28) (cid:0) nd (cid:1) from all possible combinations. Blom (1976) referredto the resulting estimator as an incomplete U-statistic . The problem of identifying agood incomplete U-statistic is related to the design of the sampling scheme. Of the vari-ous options, the vanilla scheme of simple random sampling by Blom (1976) has receivedmuch attention in the literature. Janson (1984) established the asymptotic distribu-tions of incomplete U-statistics based on random sampling (ICUR), Herrndorf (1986)established the invariance principle for the statsitics, and Chen and Kengo (2019) stud-ied the vector- and matrix-valued ICUR. For a more detailed discussion on incompleteU-statistics, refer to Wang (2012) and Wang and Lindsay (2014).First, we introduce some required notation. For α >

0, we use m ≺ n α , m (cid:16) n α ,and m (cid:31) n α to mean m/n α →

0, 0 < lim m/n α ≤ lim m/n α < ∞ , and m/n α →∞ , respectively. For a given incomplete U-statistic, say U , its eﬃciency is deﬁned interms of the mean squared error (MSE): Eﬀ( U ) = MSE( U ) / MSE( U ), where U is the3omplete U-statistic. An incomplete U-statistic is said to be asymptotically eﬃcient ifEﬀ( U ) → n → ∞ . Note that the ICUR is asymptotically eﬃcient for the non-degenerate case when m (cid:31) n ; see (2.6) for a theoretical veriﬁcation, and Table 1 forempirical evidence.Blom (1976) also proposed sampling schemes based on the design of an experiment.In particular, balanced incomplete block designs (BIBDs) have been examined by Brownand Kildea (1978) and Lee (1982). The latter also proved that incomplete U-statisticsbased on BIBDs achieve the minimum variance among all unbiased estimators for agiven m . By Raghavarao (1971), a BIBD exists whenever n = 6 a + 3 for any positiveinteger a . Unfortunately, the optimality of the BIBD does not make it practicallyattractive because its construction requires m (cid:16) n ; see Table 1. The same issue existsfor the permanent design of Rempala and Wesolowski (2003) and the rectangular designof Rempala and Srivastav (2004). For the case of m/n →

1, Blom (1976) proposed usinga Latin square and a Graeco-Latin square to guide the sampling scheme. However, theeﬃciency of the estimator derived in this way is essentially asymptotically the same asthat of the ICUR. Moreover, the limit of the eﬃciency does not exceed d/ (1 + d ) as n → ∞ ; see (2.6) and the follow-up discussion.Another method recently proposed in the literature is the divide and conquer (DC)strategy of Lin and Xi (2010), which randomly divides the data into many groups,calculates the complete U-statistic within each group, and then takes the average of4hese complete U-statistics. Unfortunately, the DC is even less eﬃcient than the ICUR.Moreover, it is not available when m ≤ n ; see Table 1.We conclude that the ICUR is still the most viable of the existing choices of incom-plete U-statistics. It performs as well as a design-based method when a design exists.It also possesses several advantages, such as a ﬂexible choice of m , the availability ofasymptotic properties, and being extendable to multi-sample cases.In this paper, we introduce a new type of incomplete U-statistic that is substantiallymore eﬃcient than the ICUR, while maintaining the latter’s aforementioned advantages.It has three main steps: ( i ) Divide the data into L ( (cid:28) n ) groups of homogeneous units.( ii ) Judiciously select a collection of the combinations of the groups based on a designstructure called an orthogonal array (OA). ( iii ) Randomly select a combination of inputsfrom each selected group combination. We call the derived estimator the incompleteU-statistic based on division and an orthogonal array (ICUDO). Our ﬁrst exampleprovides a snapshot of the performance of the major incomplete U-statistics mentionedso far. Example 1. (The symmetry of distribution). The kernel function g ( x , x , x ) =sign(2 x − x − x ) + sign(2 x − x − x ) + sign(2 x − x − x ) has mean zero whenthe distribution of the data is symmetric. The data consists of n = 10 independentand identically distributed (i.i.d.) observations generated iid from the standard normaldistribution. The performance of the ICUR, BIBD, DC, and ICUDO is measured by5heir eﬃciency at diﬀerent values of m .Table 1: Comparison of eﬃciencies in Example 1. m m/n m/ (cid:0) n (cid:1) ICUR BIBD DC ICUDO1 . × . . × − . × . . × − . × . . × − . × . . × − . × . . × − . × . . × − m ≤ n , and the BIBD does not exist in mostcases, except for m = 166167. For m ≤ m .Here, we brieﬂy explain why our ICUDO performs so well. Note that existingdesign-based methods focus on the arrangement of indices of units, without referringto their actual values. The ICUDO method exploits the fact that replacing a unit by6nother one with a similar value does not change the value of the kernel function g toomuch. For example, suppose the ﬁrst six numbers of the data are (1 , , , , , m (cid:31) n to be asymptoticallyeﬃcient, the ICUDO requires a substantially smaller m ; sometimes even m (cid:31) √ n willsuﬃce. See Theorem 2 for the latter case. When the U-statistic is degenerate, bothmethods require larger m , but the ICUDO still requires a substantially smaller m thanthat of the ICUR.The rest of the paper is organized as follows. Section 2 introduces the construction ofthe ICUDO for univariate data and derives its asymptotic properties. Section 3 discussesthe debiasing issues of the ICUDO for the degenerate case. Section 4 constructs adebiased ICUDO for multi-dimensional data. Simulations are presented in each sectionto support the theoretical results. Section 5 concludes the paper and points out somefuture research topics. All proofs are postponed to the Appendix. Additional theoremsare given in the online Supplementary Material.7 . ICUDO based on univariate data Let X , . . . , X n be a random sample of size n from a univariate distribution, say F . For a given symmetric kernel function, say g : R d → R , of order d , the uniformlyminimum variance unbiased estimator (UMVUE) of the parameter Θ = (cid:82) g ( x , . . . , x d ) dF ( x ) . . . dF ( x d ) is given by the U-statistic U = (cid:18) nd (cid:19) − (cid:88) η ∈ S n,d g ( X η ) , (2.2)where S n,d = { η = ( η , . . . , η d ) : 1 ≤ η < η < . . . < η d ≤ n } and X η = ( X η , . . . , X η d ).When S n,d is replaced with the set of all n d ordered combinations, the correspondingaverage in (2.2) is called a V-Statistic (von Mises (1948)). The main diﬀerence is thatV-statistics include combinations with duplicated units, such as (1 , , Eg ( X , . . . , X d ) < ∞ . Unless there is some special structure of g that can be exploited to reduce the com-putational burden, in general, (2.2) becomes impractical to compute as n increases. Toaddress this problem, Blom (1976) proposed using the following incomplete U-statisticas a fast approximation: U = 1 m (cid:88) η ∈ S g ( X η ) , (2.3)where S ⊂ S n,d , with its cardinality m = | S | being only a fraction of (cid:0) nd (cid:1) . The statisticin (2.3) becomes an ICUR when S is a simple random sample, which we denote as U RND . 8ere, we brieﬂy review the properties of U and U RND . For arbitrary positiveintegers N and p , deﬁne Z N = { , . . . , N } and Z pN = { ( z . . . . , z p ) : z j ∈ Z N , ≤ j ≤ p } . Following Hoeﬀding (1948), for u ⊆ Z d and x = ( x , . . . , x d ), denote g u ( x ) = (cid:82) g ( x ) dF u c , with u c = Z d \ u and dF u = (cid:81) j ∈ u dF ( x j ). With the conventions g ∅ ( x ) = Θand h ∅ ( x ) = 0, we recursively deﬁne the projection h u ( x ) = g u ( x ) − (cid:88) v ⊆Z d : v ⊂ u h v ( x ) . Because g is symmetric, we have Eg v = Eg u and Eh v = Eh u for any pair u , v ⊆ Z d ,with | v | = | u | . Hence, we can now deﬁne σ j = Var( g u ) and δ j = Var( h u ) , with | u | = j. Following Hoeﬀding (1948) and Blom (1976), we haveMSE( U ) = (cid:18) nd (cid:19) − d (cid:88) j =1 (cid:18) dj (cid:19)(cid:18) n − dd − j (cid:19) σ j = d (cid:88) j =1 (cid:18) dj (cid:19) (cid:18) nj (cid:19) − δ j , (2.4)MSE( U RND ) = MSE( U ) + σ d m + O (cid:18) nm (cid:19) = MSE( U ) + 1 m d (cid:88) j =1 (cid:18) dj (cid:19) δ j + O (cid:18) nm (cid:19) . (2.5)In (2.4) and (2.5), the MSEs are expressed in terms of both σ j and δ j . The equivalencesare established by σ j = (cid:80) jj (cid:48) =1 (cid:0) jj (cid:48) (cid:1) δ j (cid:48) , for 1 ≤ j ≤ d . The U-statistic and the kernelfunction g are called non-degenerate if δ = σ >

0, and are called order-q degenerate if σ q = 0 and σ q +1 >

0, or equivalently δ = · · · = δ q = 0 and δ q +1 >

0. For the9on-degenerate case, we have Var( U ) (cid:16) n − , which together with (2.5) yieldsEﬀ( U RND ) =  − O ( n/m ) , m (cid:31) n nm σ dd δ + O (1 /n ) , m (cid:16) nO ( m/n ) , m ≺ n. (2.6)As a result, we have Eﬀ( U RND ) → m (cid:31) n , Eﬀ( U RND ) → m ≺ n ,and Eﬀ( U RND ) → (cid:16) σ d cd σ (cid:17) − when m/n → c , for a constant c >

0. With c = 1,Blom (1976) proposed using Latin squares and Graeco-Latin squares to construct theincomplete U-statistics. In such a case, we can verify that its eﬃciency is asymptoticallythe same as that of U RND , and lim n →∞ Eﬀ( U RND ) ≤ d/ (1 + d ), from (2.6) and σ d ≤ dσ .In contrast, Theorem 1 shows that the ICUDO is asymptotically eﬃcient when m (cid:16) n .Stronger results are stated in Theorem 2 in Section 2.1 and in similar theorems in theSupplementary Material under various conditions on g and F . Recall that δ j = Var( h u ), for | u | = j , 1 ≤ j ≤ d , and note that the coeﬃcientof δ j in (2.4) is O ( n − j ). Hence, it is more important to capture the variability of g inits lower-dimensional projected space. This idea matches perfectly with the projectiveproperty of the OA. An OA denoted by OA ( m, d, L, t ), is an m by d array with entriesfrom { , . . . , L } , arranged in such a way that for any m by t subarray, all ordered t -tuples of the entries from { , . . . , L } appear λ = m/L t times in the rows. The number t is called the strength of the OA; see the matrix A deﬁned in (2.8) as an example of10 A (9 , , , { ( i , i ) : 1 ≤ i , i ≤ } . Considerany two columns of A , we can see that all these ordered 2-tuples appear once, that is, λ = 1. For sets S , . . . , S q , deﬁne (cid:81) qi =1 S i = { ( s , . . . , s q ) : s i ∈ S i } . The ICUDO isconstructed as follows. For ease of illustration, we assume n is a multiple of L . Actually,throughout the manuscript, we assume that L (cid:28) n . Thus, we may randomly draw an n (cid:48) = (cid:98) n/L (cid:99) · L subsample as the new data set. The information loss in this process isnegligible compared with the original size n .Step 1. Let A be an OA ( m, d, L, t ). Apply random level permutations { π , . . . , π d } tocolumns of A independently. Speciﬁcally, for l ∈ Z L , change all elements l in the j th column of A to π j ( l ). The new OA is denoted by A = ( a ij ) m × d .Step 2. Create the partition Z n = (cid:83) Ll =1 G l such that | G l | = n/L for l ∈ Z L , and X i ≤ X i for any i ∈ G l , i ∈ G l , with l < l .Step 3. For i = 1 , . . . , m , independently draw an element, say η i , uniformly from (cid:81) dj =1 G a ij .the ICUDO based on the OA A is deﬁned as U oa = 1 m m (cid:88) i =1 g ( X η i ) . (2.7)The level permutation in step 1 ensures that each row of A takes each d -tuple withequal probability. At the same time, the projective uniformity of the beginning OA, A ,carries over to A . Here, we ensure that A is free of a coincidence defect, which means11o two rows are the same in any m × ( t + 1) subarray. This property is necessary for therelevant theorems to hold. Step 2 divides the data into homogeneous groups. Step 3is built on the ﬁrst two steps. It chooses representative elements from selected groups,and the selection of groups is guided by the structure of A . Note (2.7) is in the formof (2.3) by taking S as S oa = { η , . . . , η m } . We now give a toy example of choosing η i ,for i = 1 , . . . , m . Suppose d = 4, n = 9, and X ≤ X ≤ X ≤ X ≤ X ≤ X ≤ X ≤ X ≤ X . Then, we have L = 3 groups listed as G = { , , } , G = { , , } , and G = { , , } . An example of OA ( m = 9 , d = 4 , L = 3 , t = 2) in step 1 is given as follows in transpose: A T =   . (2.8)The fourth row of A , namely (2 , , , η from G × G × G × G . One possible outcome for η could be (4 , , , A ,12e could possibly have the X η i , for i = 1 , . . . ,

9, used in the construction as follows: {X η , . . . , X η } =  X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X  . (2.9)To proceed with the asymptotic properties of U oa , we deﬁne R ( t ) = (cid:88) j>t (cid:18) dj (cid:19) δ j . (2.10) Theorem 1.

For any ( g, F ) , using OA ( m, d, L, t ) in step 1 of the ICUDO algorithm,we have MSE( U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) + O (cid:18) n (cid:19) . (2.11)We now explain the meanings of the three terms in (2.11) generated in the processof approximating the complete U-statistic U using U oa . The term O ( n − ) is the biassquare of U oa due to the inclusion of combinations with duplicate units, such as theﬁrst column of (2.9). Essentially, U oa is approximating the V-statistic, which is biasedfor Θ itself. The term o ( m − ) is due to the sampling variability when we draw onepoint from each selected group, that is, step 3 of the algorithm. The term R ( t ) /m isdue to the usage of the OA structure in place of a complete enumeration of all groupcombinations. Compared with the second term in (2.5) for the ICUR, R (0) /m , we areable to eliminate all δ j with j ≤ t owing to the projective uniformity of the OA in all13 -dimensional projected spaces. If δ j = 0 for d (cid:48) ≤ j ≤ d , an OA with strength t ≥ d (cid:48) yields R ( t ) = 0. We discuss the hidden beneﬁt of using a lower strength OA in Example2. In the non-degenerate case, recall the MSE( U ) (cid:16) n − and lim n →∞ Eﬀ( U RND ) ≤ d/ (1 + d ) for the ICUR when m (cid:16) n . Under the same situation, Theorem 1 implies that U oa is asymptotically eﬃcient by simply taking t = d . In fact, stronger results can bederived for the ICUDO so that m is allowed to grow more slowly than n under variousconditions. We give Theorem 2 here as one example; additional results can be found inthe Supplementary Material. Theorem 2.

Suppose ( i ) the kernel function g is Lipschitz continuous, and ( ii ) F hasdensity function f ( x ) > c for some ﬁxed c > and x ∈ [ a, b ] , and f ( x ) = 0 otherwise.For U oa based on OA ( m, d, L, t ) with L ≤ n (log n ) − , we have MSE( U oa ) = MSE( U ) + R ( t ) m + O (cid:18) mL (cid:19) + O (cid:18) n (cid:19) . (2.12)For t = d = 2, we automatically have R ( t ) = 0. If the conditions in Theorem 2hold, we only need m (cid:31) √ n to achieve Eﬀ( U oa ) →

1, while the ICUR requires m (cid:31) n .In general, R ( t ) decreases in t and could vanish if we take t large enough so that δ j = 0,for all j > t . Without knowledge of δ j , simply taking t = d will eliminate R ( t ) too.On the other hand, the term O (cid:0) mL (cid:1) in (2.12) is decreasing in L , meaning the moregroups we use to divide the data, the more homogeneous the units we could have in14ach group. However, L and t are subject to the constraint m = λL t , where λ is thenumber of replicates of each t -tuple in OA and is equal to one in all examples presentedhere. As a result, L and t cannot be increased simultaneously. To gain insight to thetrade-oﬀ between L and t , we need to determine the constant term for O (cid:0) mL (cid:1) . Forthis, we derive the following theorem. A more detailed discussion on how to choose L and t , given m , is provided in the Supplementary Material. Denote by U (0 ,

1) theuniform distribution on [0 , Theorem 3.

Suppose g has a continuous ﬁrst-order derivative on [0 , d , X ∼ U (0 , ,and there exists some c ∈ (0 , ) , such that L (cid:22) n c . For U oa based on OA ( m, d, L, t ) , MSE( U oa ) = MSE( U ) + R ( t ) m + d mL Eγ ( X , . . . , X d ) + o (cid:18) mL (cid:19) , (2.13) where γ ( x , . . . , x d ) = ∂g∂x ( x , . . . , x d ) . The assumption of a uniform distribution for X is not as strict as it seems. To seethis, for X ∼ F , let Z = F ( X ) ∼ U (0 , g F ( Z , . . . , Z d ) := g ( F − ( Z ) , . . . , F − ( Z d )) = g ( X , . . . , X d ), we have the following corollary. Corollary 1.

Suppose g F has a continuous ﬁrst-order derivative on [0 , d , and thereexists some c ∈ (0 , ) , such that L (cid:22) n c . Then, (2.13) still holds. The term Eγ in (2.13) provides a nice interpretation of the trade-oﬀ between t and L . When the kernel function g has a large variability (large Eγ ), it is more challenging15o make each group as homogeneous as possible, which enforces larger values of L . Onthe other hand, if g is quite ﬂat on the domain (small Eγ ), we prefer fewer groups toimprove the strength of the OA. Example 2.

The kernel function g ( x , x , x ) = x x x estimates µ , where µ = E ( X ).We compare the performance of three methods: U RND ; U oa based on OA ( m, , √ m, t = 2; and U oa based on OA ( m, , m / , t = 3. Thedata consist of n = 10 i.i.d. observations simulated from N ( µ, µ takes thevalues of 0 . m/n µ = 0 . µ = 2Eﬀ( U RND ) Eﬀ( U oa ) Eﬀ( U oa ) Eﬀ( U RND ) Eﬀ( U oa ) Eﬀ( U oa )0.005 0.133% 0.171% % 1.110% % 2.323%0.01 0.290% 0.464% % 2.485% % 8.455%0.05 1.291% 2.448% % 10.31% % 51.71%0.1 2.936% 4.527% % 20.13% % 76.80%0.5 12.58% 21.89% % 50.78% % 98.53%1.0 21.05% 33.26% % 67.51% % 99.64%In Table 2, both U oa and U oa outperform U RND signiﬁcantly. The advantage of16he ICUDO over the ICUR is discussed below in additional examples. Furthermore,we ﬁnd that the winning strategy changes from U oa to U oa as we increase the mean µ of the distribution. This observation well illustrates the comments after Theorem 3 onthe relevance of Eγ in determining the optimal value of the strength t . That is, forlarger Eγ , we are more inclined to choose a smaller strength. This is validated by oursecond observation together with Eγ = ( µ + 1) , which increases in µ ( > φ : R → R be aone-to-one mapping. Denote by F φ the distribution of the transformed random variable Z = φ ( X ), which leads to the following representation: g φ ( z , . . . , z d ) := g ( φ − ( z ) , . . . , φ − ( z d )) = g ( x , . . . , x d ) . If ( g φ , F φ ) satisﬁes the conditions in these theorems, corresponding results also hold forthe pair ( g, F ). For example, suppose g ( x , x ) = x − a x − a and F is a Pareto distributionwith shape and scale parameters a and b , respectively. The Pareto distribution is neitherlight-tailed nor bounded, and hence violates the conditions in Theorem 2. By taking φ ( x ) = 1 − ( b/x ) a , we have φ ( X ) ∼ U (0 , g φ , F φ ). For k = 1 , . . . , K , let X ( k )1 , . . . , X ( k ) n k be a random sample of size n k from the distri-17ution F k . The UMVUE ofΘ = (cid:90) g ( x (1)1 , . . . , x (1) d , · · · , x ( K )1 , . . . , x ( K ) d K ) dF ( x (1)1 ) . . . dF K ( x ( K ) d K )is given by the generalized U-statistic U = K (cid:89) k =1 (cid:18) n k d k (cid:19) − (cid:88) η ∈ (cid:81) Kk =1 S nk,dk g ( X η ) ,S n k ,d k = { η k = ( η k, , . . . , η k,d k ) : 1 ≤ η k, < η k, < . . . < η k,d k ≤ n k } , X η = ( X η , . . . , X η K ) = ( X (1) η , , . . . , X (1) η ,d , · · · , X ( K ) η K, , . . . , X ( K ) η K,dK ) . The d (= (cid:80) Kk =1 d k )-dimensional kernel function g is symmetric about any d k -dimensionalsub-input { x ( k )1 , . . . , x ( k ) d k } . The generalized U-statistic reduces to the traditional U-statistic when K = 1. An incomplete generalized U-statistic is given by U = 1 m (cid:88) η ∈ S g ( X η ) , (2.14)where S ⊂ (cid:81) Kk =1 S n k ,d k and m = | S | . We construct the multi-sample ICUDO as follows.For ease of illustration, we assume n k ’ is a multiple of L .Step 1. Let A be an OA ( m, d, L, t ). Adopt random level permutations { π , . . . , π d } ofcolumns of A independently. Speciﬁcally, for each l ∈ Z L , change all elements l in the j th column of A to π j ( l ). The m rows of the resulting array A are denotedby { a i = ( a i , . . . , a iK ) : i = 1 , . . . , m ; a ik ∈ Z d k L , k = 1 , . . . , K } .18tep 2. For each k = 1 , . . . , K , create the partition Z n k = (cid:83) Ll =1 G ( k ) l , such that | G ( k ) l | = n k L − for l ∈ Z L , and X ( k ) i ≤ X ( k ) i for any i ∈ G ( k ) l , i ∈ G ( k ) l , with l < l . Forany a = ( a , . . . , a K ) with a k = ( a k, , . . . , a k,d k ) ∈ Z d k L , deﬁne G a = K (cid:89) k =1 d k (cid:89) j =1 G ( k ) a k,j . (2.15)Step 3. For i = 1 , . . . , m , independently draw an element η i uniformly from G a i , where a i is the i th row of A : U oa = 1 m m (cid:88) i =1 g ( X η i ) . (2.16)An example is given in the Supplementary Material. For any j k, , . . . , j k,d k ∈ Z d k and k ∈ Z K , assume Eg (cid:16) X (1) j , , . . . , X (1) j ,d , · · · , X ( K ) j K, , . . . , X ( K ) j K,dK (cid:17) < ∞ . Let n min = min { n , . . . , n K } and n max = max { n , . . . , n K } . Here, we assume n min (cid:16) n max and L ≺ n min . Let u = ( u , . . . , u K ), where u k ⊆ Z d k . Deﬁne dF u = (cid:81) Kk =1 (cid:81) j ∈ u k dF k ( x ( k ) j ).For any u and x = ( x (1)1 , . . . , x (1) d , · · · , x ( K )1 , . . . , x ( K ) d K ), we recursively deﬁne g u ( x ) = (cid:90) g ( x ) dF u c h u ( x ) = g ( x ) − (cid:88) v ⊂ u h v ( x ) , where u c = ( u c , . . . , u cK ) = ( Z d \ u , . . . , Z d K \ u K ), g ∅ ( x ) = Θ and h ∅ ( x ) = 0, v = ( v , . . . , v K ), and v ⊂ u means v k ⊆ u k ( v (cid:54) = u ).19or u , we can deﬁne σ u = Var( g u ) and δ u = Var( h u ). The MSE of the completegeneralized U-statistic is given by Sen (1974) asMSE( U ) = K (cid:89) k =1 (cid:18) n k d k (cid:19) − (cid:88) u =( u ,..., u K ) (cid:40) K (cid:89) k =1 (cid:18) d k | u k | (cid:19)(cid:18) n k − d k d k − | u k | (cid:19)(cid:41) σ u . Let | u | = (cid:80) Kk =1 | u k | . The generalized U-statistic and the kernel function are called order-q degenerate if σ u = (cid:80) v ∈ u δ v = 0, for all | u | ≤ q , and there exists u (cid:48) such that σ u (cid:48) > | u (cid:48) | = q + 1. We have MSE( U ) = O ( n − ( q +1) ) in this case. For the non-degenerate case q = 0, we have MSE( U ) (cid:16) n − . With a slight abuse of notation, let σ ( j ,...,j K ) = σ u and δ ( j ,...,j K ) = δ u , for u = ( u , . . . , u K ), with | u k | = j k , k = 1 , . . . , K .For the ICUR, we haveMSE( U RND ) = MSE( U ) + R (0) m + O (cid:18) mn min (cid:19) ,R ( t ) = (cid:88) u : | u | >t δ u = d (cid:88) j =0 · · · d K (cid:88) j K =0 I ( j + · · · + j K > t ) K (cid:89) k =1 (cid:18) d k j k (cid:19) δ j ,...,j K ) . The last term above reduces to the form of R ( t ) for the one-sample case, but the secondterm yields a parsimonious presentation for the multi-sample case. The correspondingproperties of U oa are given as follows. Theorem 4.

For U oa based on OA ( m, d, L, t ) , for any pair of ( g, F ) , we have MSE( U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) + O (cid:18) n (cid:19) . (2.17)Theorem 4 is basically a multi-sample version of Theorem 1, and its result can be20trengthened in the same way. The details are omitted here to conserve space. Weconclude this section with a machine learning example. Example 3. (Ranking measure, Chen et al. (2009)). The ranking measure is an im-portant topic in machine learning research. In the commonly used pairwise approach,the loss for a given classiﬁer score function f is given by L ( f ) = (cid:88) ≤ i

3. Debiased ICUDO for degenerate cases

Recall the ICUDO procedure is actually biased owing to the inclusion of combi-nations with duplicate units. The bias square is O ( n − ) for any pair ( g, F ), which isnegligible compared to Var( U ) (cid:16) n − in the non-degenerate case. One can see that it21

000 4000 6000 8000 10000 . . . . . . Subsample size (m) E ff i c i en cy OA(t=2)RND . . . . . . Subsample size (m) E ff i c i en cy OA(t=2)RND

Figure 1: Comparison of eﬃciencies of ˜ U oa and U RND with respect to subsample size m for loss function ( i ) (left) and ( ii ) (right).is no longer negligible in the degenerate case. In this section, we propose a debiasedversion of the ICUDO.We provide details for the multi-sample cases, where the one-sample cases areachieved by taking K = 1. To proceed, Let S ∗ = { ( η , . . . , η K ) : η k = ( η k, , . . . , η k,d k ) ∈Z d k n k , η k,j (cid:54) = η k,j for any j (cid:54) = j } . The debiased ICUDO is constructed in the same wayas the original, except that step 3 changes as follows:Step 3 (cid:48) . For i = 1 , . . . , m , independently draw η i from the uniform distribution on G a i ∩ S ∗ .Adopting (2.14) with S ∗ oa = { η , . . . , η m } , we have the debiased ICUDO as˜ U oa = 1 m m (cid:88) i =1 ω η i g ( X η i ) , (3.3)where ω η i = L d |G a i ∩ S ∗ | / | S ∗ | . heorem 5. ˜ U oa based on OA ( m, d, L, t ) is an unbiased estimator, and MSE( ˜ U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) . (3.4)Theorem 5 is analogous to Theorems 1 and 4 for the one-sample and multi-samplecases, respectively, except that the bias square term O ( n − ) and O ( n − ) are eliminated.Now, for an order- q degenerate U-statistic, the debiased ICUDO can be asymptoticallyeﬃcient with m (cid:16) n q +1 , while the ICUR requires m (cid:31) n q +1 . Moreover, we could allow m to grow more slowly for the debiased ICUDO under some mild conditions on ( g, F ).For example, when d = 2, q = 1, and the conditions of Theorem 2 hold, the debiasedICUDO only needs m (cid:31) n to be asymptotically eﬃcient, while the ICUR requires m (cid:31) n . For the general order q of degeneration, we have m ∗ oa = ( m ∗ RND ) dd +1 , for all d ,under the conditions in Theorem 2. Here, m ∗ oa and m ∗ RND represent the minimum m required for the ICUDO and ICUR, respectively, to be asymptotically eﬃcient.We conclude this section with the following multi-sample example. The kernelfunction is degenerate, and hence favors a debiased ICUDO. However, the highest order δ -value vanishes, which encourages a lower strength of OA. The comparison is madebetween the ICUR and diﬀerent versions of the ICUDO. Example 4.

Let K = 2, d = d = 2, d = 4, and g ( x (1)1 , x (1)2 , x (2)1 , x (2)2 ) = I ( x (1)1 < x (2)1 , x (1)2 < x (2)1 ) + I ( x (2)1 < x (1)1 , x (2)2 < x (1)1 ) . The construction of U oa and the debiased ˜ U oa is based on OA ( m, , m / ,

3) and OA ( m, , / , F and F , it can be veriﬁed that Eg ( X (1)1 , X (1)2 , X (2)1 , X (2)2 ) = 23 + (cid:90) ( F ( x ) − F ( x )) d ( F ( x ) + F ( x )) / , which indicates the similarity of F and F . The null hypothesis of F = F is rejectedwhen the U-statistic is signiﬁcantly larger than 2 /

3. Note that the corresponding U-statistic is degenerate under the null hypothesis. See Table 3 for the simulation resultswhen both samples are simulated from N (0 ,

1) with sample sizes n = n = 10 .Note that in the g function of Example 4, the two separate parts are all functionsof three inputs. Thus, R (4) = 0, and we can claim that t = 3 works better than t = 4,which is veriﬁed by the results in Table 3.Table 3: Result of Example 4. m/ (cid:0) n (cid:1) U oa ) 0.836% 10.9% Eﬀ( U oa ) 0.861% 9.50% 12.9% 25.2% 28.3% 29.8% 36.3% 39.0%Eﬀ( U oa ) 0.450% 4.96% 6.78% 10.6% 10.7% 11.9% 14.5% 15.6%Eﬀ( U RND ) 0.179% 0.701% 1.50% 2.93% 4.19% 7.84% 10.9% 13.1%

4. ICUDO for multi-dimensional data

Note that step 2 of the ICUDO algorithm in Section 2 does not apply to multi-dimensional data because it relies on ordering the univariate data. To remedy this, we24dopt a clustering algorithm to divide the data into homogeneous groups. In this regard,the clustered group sizes may vary. This will necessitate a re-weighting procedure similarto the debiasing step in Section 3. To save space, we focus on the debiased ICUDO andadopt the notation of the multi-sample U-statistics in the study of multi-dimensionaldata. For k = 1 , . . . , K , let X ( k )1 , . . . , X ( k ) n k be a random sample of size n k from themulti-dimensional distribution F k . The algorithm is given as follows.Step 1. Let A be an OA ( m, d, L, t ). Adopt random level permutations { π , . . . , π d } ofcolumns of A independently. Speciﬁcally, for l ∈ Z L , change all elements l in the j th column of A to π j ( l ). The m rows of the resulting array A are denoted by { a i = ( a i , . . . , a iK ) : i = 1 , . . . , m ; a ik ∈ Z d k L , k = 1 , . . . , K } .Step 2. Let P ( k ) = { G ( k )1 , . . . , G ( k ) L } denote an L -group partition from the clustering of { X ( k )1 , . . . , X ( k ) n k } . For any a = ( a , . . . , a K ), with a k = ( a k, , . . . , a k,d k ) ∈ Z d k L ,deﬁne G a = K (cid:89) k =1 d k (cid:89) j =1 G ( k ) a k,j . (4.4)Step 3. For i = 1 , . . . , m , independently draw an element η i uniformly from G a i , where a i is the i th row of A . Let ω η i = L d |G a i ∩ S ∗ | / | S ∗ | .˜ U oa = 1 m m (cid:88) i =1 ω η i g ( X η i ) . (4.5)An example of the construction is given in the Supplementary Material.25 heorem 6. Suppose ω η i → uniformly as n, L → ∞ . For ˜ U oa based on OA ( m, d, L, t ) ,we have MSE( ˜ U oa ) = MSE( U ) + R ( t ) m + o (cid:18) m (cid:19) . (4.6)The R ( t ) in (4.6) is given by (2.10), except that the univariate distribution F ischanged to a multi-dimensional distribution. The assumption in Theorem 6 naturallyholds if we force balance the group size in the clustering process. By applying the fullstrength t = d OA to Theorem 6, we have the following corollary.

Corollary 2.

For ˜ U oa based on OA ( m, d, L, d ) , for any pair of ( g, F ) , we have MSE( ˜ U oa ) = MSE( U ) + o ( m − ) . (4.7)The choice of t has been discussed and is illustrated in Examples 2 and 4. We donot compare diﬀerent t in the following examples because d = 2 always holds, and so t ≤

2. We always take t = 2, L = 10 , , . . . , m = L t . Example 5. (Kendall’s tau, Chen and Kengo (2019)). The Kernel function h (( x , y ) , ( x ,y )) = 2 I ( x < x , y < y ) + 2 I ( x < x , y < y ) −

1. For simplicity, we assume that(

X, Y ) follows a normal distribution, with µ = (0 ,

0) and Σ = diag(3 , n = 10 .The MSE when estimating the Kendall correlation using U RND and ˜ U oa is shown inTable 4. As a reference, we have MSE( U ) = 8 . × − .26able 4: Result of Example 5. m

100 400 900 1600 2500 3600 4900 6400 8100 10000MSE( U RND ) .765 .191 .0903 .0515 .0260 .0195 .0167 .0137 .0098 .0089MSE( ˜ U oa ) .075 .0096 .0032 .0015 .00063 .00035 .00023 .00014 .00011 .00009 Example 6. (Testing stochastic monotonicity, Lee et al. (2009)). Let (

X, Y ) be a real-valued random vector, and denote by F Y | X ( y | x ) the conditional distribution functionof Y , given X . Consider the problem of testing the stochastic monotonicity hypothesis H : F Y | X ( y | x ) ≤ F Y | X ( y | x (cid:48) ) , ∀ y ∈ R and whenever x ≥ x (cid:48) . This essentially tests where an increase in X would induce an increase in Y (e.g.,income vs. expenditure in a household). Lee et al. (2009) proposed the followingtesting statistic: U n ( x, x (cid:48) ) = 1 n ( n − (cid:88) ≤ i (cid:54) = j ≤ n ( I { Y i ≤ x (cid:48) } − I { Y j ≤ x (cid:48) } )sign( X i − X j ) K ( x − X i ) K ( x − X j ) , (4.8)where K ( x ) = 0 . − x ). We simulate ( X, Y ) from a normal distribution with µ = (0 ,

0) and Σ = diag(3 , x, x (cid:48) ) = (0 , n = 10 ,the comparison between ˜ U oa and U RND is given in Table 5. As a reference, we haveMSE( U ) = 2 . m

100 400 900 1600 2500 3600 4900 6400 8100 10000MSE( U RND ) 302.7 69.01 38.01 17.45 12.86 8.613 7.438 6.273 4.886 4.327MSE( ˜ U oa ) 33.18 15.73 8.848 4.252 3.524 3.168 2.732 2.662 2.630 2.602 Example 7. (Clustering performance evaluation, Papa et al. (2015)). For a givendistance D : X × X → R deﬁned on X , the performance of a partition P can beevaluated from the data X , . . . , X n ∈ X using W ( P ) = (cid:88) ≤ i

0) and Σ = diag(1 , U RND and ˜ U oa when estimating W ( P ) for diﬀerent m is shown in Table 6.As a reference, we have MSE( U ) = 1 . × − .

5. Conclusion

To tackle the computational issue of U-statistics, we have introduced a new typeof incomplete U-statistic called the ICUDO, which has much higher eﬃciency thanexisting methods. The required computational burden, as indexed by the number of28able 6: Result of Example 7. m

100 400 900 1600 2500 3600 4900 6400 8100 10000MSE( U RND ) .216 .0625 .0346 .0171 .0064 .0047 .0038 .0021 .0017 .0010MSE( ˜ U oa ) .011 .0064 .0038 .0019 .00056 .00051 .00038 .00027 .00013 .00012combinations m for the ICUDO to be statistically equivalent to the complete U-statistic,is of smaller magnitude than existing methods. This was validated theoretically andempirically for degenerate and non-degenerate one- and multi-sample U-statistics onunivariate and multi-dimensional data. In fact, m is allowed to grow more slowly thanthe data size n in the non-degenerate case.The OA plays a critical role in the construction of the ICUDO, in light of itsprojective uniformity. Other space-ﬁlling design schemes exist with similar properties,such as the OA-based Latin hypercube by Tang (1993), and the strong orthogonal arrayby He and Tang (2012), which is used frequently in the design of computer experiments.By exhaustive simulations, we ﬁnd the improvement of the eﬃciency by these designschemes over that of the ICUDO to be within 1%. However, this improvement isnot suﬃcient to advocate using these structures, owing to the extra complexity of thecomputation. Other improvements over the OA are based on optimal criteria, such asthe generalized minimum aberration OA. However, no theoretical results are available29or these ﬁxed structures.Lastly, the following oﬀer potential future research directions. ( i ) For high-dimensionaldata, dimension-reduction techniques need to be integrated into our current algorithm.( ii ) For multi-sample cases, we may divide diﬀerent samples into diﬀerent numbers ofgroups in some optimal way. This will induce more complicated OA structures. ( iii )For the purpose of statistical inference, it would be of interest to study the asymptoticdistributions of the ICUDO under diﬀerent conditions. ( iv ) The dimension of the kernelfunctions is ﬁxed at d as n increases, and all data are generated independently. In oneimportant type of U-statistic based on stochastic processes, d increases with n and thedata can be dependent. These topics will involve quite diﬀerent methodologies, andhence are left to future work. Supplementary Material

The online Supplementary Material generalizes the result of Theorem 2 under ad-ditional conditions. It also provides details on how to choose the combination of L and t and illustrates the generation of the ICUDO for multi-sample and multi-dimensionalcases. Acknowledgments

Dr. Kong’s research was partially supported by NSFC grant 11801033 and the Bei-30ing Institute of Technology Research Fund Program for Young Scholars. Dr. Zheng’sresearch was partially supported by the National Science Foundation, DMS-1830864.

Appendix. Proof of Theorems

Lemmas 1–3 contribute to the proof of Theorem 1. Theorems 4 and 6 can be provedsimilarly as Theorem 1, but only with more tedious analysis, and hence they are omitteddue to the limit of space. For any a ∈ Z dL , we call the set G a = (cid:81) dj =1 G a j a grid . Let F n be the empirical distribution of { X , . . . , X n } and deﬁne V = (cid:82) g ( x , . . . , x d ) dF n ( x ) . . .dF n ( x d ) . For given F n and η ∈ G a , deﬁne¯ g ( X η ) = |G a | − (cid:88) η (cid:48) ∈G a g ( X η (cid:48) ) . For the same S oa = { η , . . . , η m } in generating U oa , deﬁne¯ V = 1 m m (cid:88) i =1 ¯ g ( X η i ) . Lemma 1.

Some properties of V and ¯ V are listed as follows. ( i ) ¯ V is an unbiased estimator of V . ( ii ) The bias of V is of order O ( n − ) and MSE( V ) = MSE( U ) + O ( n − ) . ( iii ) U oa is an unbiased estimator of V and so also has bias O ( n − ) .Proof. ( i ) follows the unbiasedness of orthogonal arrays. ( ii ) can be found in Proposition3.5 in Shao (2007) (page 211). ( iii ) follows from Owen (1992). (cid:3) emma 2. E ( ¯ V − V ) ≤ m (cid:88) u : | u | >t (cid:0) δ u + O ( n − ) (cid:1) . Proof.

Let δ u = δ | u | and σ u = σ | u | .Change the F in section 2.2 to F n , we can deﬁne dF n, u , g n, u , h n, u , σ n, u and δ n, u analogously and sequentially. Again, by substituting ¯ g for g , with F n , we deﬁne ¯ g n, u , ¯ h n, u , ¯ σ n, u and ¯ δ n, u . Adopt (3.5) in Owen (1992) to ¯ g , wehave E [( ¯ V − V ) | F n ] ≤ m (cid:88) u : | u | >t ¯ δ n, u ≤ m (cid:88) u : | u | >t δ n, u , which leads to E ( ¯ V − V ) = E ( E [( ¯ V − V ) | F n ]) ≤ m (cid:80) u : | u | >t Eδ n, u . Consider σ n, u = (cid:82) g n, u ( x , . . . , x d ) dF n ( x ) . . . dF n ( x n ), which can be further written as (cid:82) (cid:0)(cid:82) g n, u dF n, u c (cid:1) dF n, u .This integer can be viewed as a V-statistic with the new kernel g ( x , . . . , x | u | , x | u | +1 , . . . , x d ) · g ( x , . . . , x | u | , x d +1 , . . . , x d −| u | ), which estimates σ u with bias O ( n − ). (cid:3) Lemma 3. (Lusin’s theorem)

For any measurable function g on R d and arbitrary (cid:15) > , there exists a continuous g (cid:15) deﬁned on R d with compact support such that E | g − g (cid:15) | < (cid:15) . Proof of Theorem 1.

Deﬁne g F ( Z , . . . , Z d ) = g ( F − ( Z ) , . . . , F − ( Z d )) suchthat Z ∼ U (0 ,

1) and F − ( Z ) ∼ F . With this new kernel g F , the distribution ofrandom variables X is assumed to be the uniform distribution on [0 , U oa − Θ as ( U oa − ¯ V ) + ( ¯ V − V ) + ( V − Θ). Simple analysis reveals thefollowing relationships among of V oa , ¯ V and V . Conditional on F n , V is constant and32o E ( U oa − ¯ V )( V − Θ) = 0, E ( ¯ V − V )( V − Θ) = 0 since E ( U oa − ¯ V ) = E ( ¯ V − V ) = 0.Conditional on both V and ¯ V , E ( U oa − ¯ V ) = 0 which indicates E ( U oa − ¯ V )( ¯ V − V ) = 0.Thus, MSE( U oa ) = E ( U oa − ¯ V ) + E ( ¯ V − V ) + MSE( V ) (6.6)whose last two terms have been addressed by Lemma 2 and Lemma 1. So we need toprove E ( U oa − ¯ V ) = o ( m − ). Since U oa and ¯ V always use the same S oa = { η , . . . , η m } , E ( U oa − ¯ V ) = E (cid:32) m m (cid:88) i =1 g ( X η i ) − ¯ g ( X η i ) (cid:33) . For i (cid:54) = i ( i , i ∈ Z m ), E ( g ( X η i ) − ¯ g ( X η i ))( g ( X η i ) − ¯ g ( X η i )) = 0. Denote η ∼ η (cid:48) if η and η (cid:48) belong to the same grid. E ( U oa − ¯ V ) ≤ m − E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] . (6.7)For any M >

0, deﬁne g ( x , M ) = max { min { g ( x ) , M } , − M } . Obviously, we havelim M →∞ g ( x , M ) = g ( x ), and dominated convergence theorem indicates E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] = lim M →∞ E [( g ( X η , M ) − g ( X η (cid:48) , M )) | η ∼ η (cid:48) ] . (6.8)Thus, for arbitrary (cid:15) >

0, we can ﬁnd M (cid:15) such that E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] ≤ E [( g ( X η , M (cid:15) ) − g ( X η (cid:48) , M (cid:15) )) | η ∼ η (cid:48) ] + (cid:15). (6.9)Note that { X , . . . , X n } are random, so is X η . Note that Eg ( X , . . . , X d ) < ∞ . Wehave Eg ( X η ) < ∞ and so Eg ( X η ) < ∞ , which indicates Eg ( X η , M (cid:15) ) < ∞ and33 g ( X η , M (cid:15) ) < ∞ . From Lusin’s theorem, there exists a continuous g ∗ (cid:15),M (cid:15) with compactsupport such that E | g ( X η , M ) − g ∗ (cid:15),M (cid:15) ( X η ) | < (cid:15)M − (cid:15) . Since | g ( X η , M (cid:15) ) | ≤ M (cid:15) , E [( g ( X η , M (cid:15) ) − g ( X η (cid:48) , M (cid:15) )) | η ∼ η (cid:48) ] ≤ M (cid:15) E [ | g ( X η , M (cid:15) ) − g ( X η (cid:48) , M (cid:15) ) || η ∼ η (cid:48) ] ≤ M (cid:15) E | g ( X η , M (cid:15) ) − g ∗ (cid:15),M (cid:15) ( X η ) | + 2 M (cid:15) E | g ( X η (cid:48) , M (cid:15) ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) | +2 M (cid:15) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) ] ≤ (cid:15) + 2 M (cid:15) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) ] (6.10)Note that g ∗ (cid:15),M (cid:15) has compact support and so is uniformly continuous. There exists∆( M − (cid:15) (cid:15) ) such that | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) | ≤ (cid:15)M − (cid:15) as long as ||X η −X η (cid:48) || ≤ ∆( M − (cid:15) (cid:15) ).Deﬁne A = {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) for some j ∈ Z d } , with P ( A ) ≤ (cid:80) dj =1 P {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } , and ||X η − X η (cid:48) || ≤ ∆( M − (cid:15) (cid:15) ) on A c . 2 M (cid:15) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) ]= 2 M (cid:15) P ( A c ) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) , A c ]+2 M (cid:15) P ( A ) E [ | g ∗ (cid:15),M (cid:15) ( X η ) − g ∗ (cid:15),M (cid:15) ( X η (cid:48) ) || η ∼ η (cid:48) , A ] ≤ (cid:15) + 4 M (cid:15) d (cid:88) k =1 P {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } (6.11)34ow we give the relationship among several events. For j ∈ Z d and η ∼ η (cid:48) , {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } = {|X η j − F n ( X η j ) + F n ( X η j ) − F n ( X η (cid:48) j ) + F n ( X η (cid:48) j ) − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) }⊆ { sup x ∈ (0 , | x − F n ( x ) | ≥ d ∆( M − (cid:15) (cid:15) ) } ∪ { F n ( X η j ) − F n ( X η (cid:48) j ) ≥ d ∆( M − (cid:15) (cid:15) ) } Note that η ∼ η (cid:48) , as L → ∞ , P ( { F n ( X η j ) − F n ( X η (cid:48) j ) ≥ d ∆( M − (cid:15) (cid:15) ) } ) →

0. Dvoretzky-Kiefer-Wolfowitz inequality reveals P (cid:0) sup x ∈ (0 , | F n ( x ) − x | ≥ (cid:15) (cid:1) ≤ exp( − n(cid:15) ). So weimmediately have P ( {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } ) → n, L → ∞ , and we can ﬁnd n (cid:15) and L (cid:15) such that P ( {|X η j − X η (cid:48) j | ≥ d − ∆( M − (cid:15) (cid:15) ) } ) ≤ (4 dM (cid:15) ) − (cid:15) (6.12)as long as n ≥ n (cid:15) and L ≥ L (cid:15) .Finally, by combining (6.8)-(6.12), we know that for arbitrary (cid:15) >

0, we can ﬁnd n (cid:15) and L (cid:15) such that E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] ≤ (cid:15) , as long as n ≥ n (cid:15) and L ≥ L (cid:15) .That means E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] → n, L → ∞ . Theorem 1 is concluded by submitting (6.13) into (6.7) and combining(6.7) with (6.6), Lemma 1( ii ) and Lemma 2. (cid:3) Proof of Theorem 2.

There exists c > f ( · ) > c on [ a, b ], and | F ( x ) − F ( x ) | ≥ c | x − x | for x , x ∈ [ a, b ]. In (6.6), we only analyze35 ( U oa − ¯ V ) since the rest two terms are given by Lemma 1( ii ) and Lemma 2. Dvoretzky-Kiefer-Wolfowitz inequality reveals P (sup x ∈ R | F n ( x ) − F ( x ) | ≥ (cid:15) ) ≤ exp( − n(cid:15) ). Bytaking (cid:15) = [log( n ) n − ] / , we have P ( A ) ≤ exp( − n ) = O ( n − ) , where A = { sup x ∈ R | F n ( x ) − F ( x ) | ≥ n − / log / ( n ) } . Since g is continuous and F isbounded, we can ﬁnd M > | g | ≤ M and so | U oa | , | ¯ V | ≤ M . E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ]= P ( A ) E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A ] + P ( A c ) E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A c ] ≤ M n − + E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A c ]The analysis of E [( U oa − ¯ V ) |A c ] is as follows. On A c , we have, for 1 ≤ k , k ≤ nL − , c | X (( l − nL − + k ) − X (( l − nL − + k ) |≤ | F ( X (( l − nL − + k ) ) − F ( X (( l − nL − + k ) ) |≤ | F n ( X (( l − nL − + k ) ) − F n ( X (( l − nL − + k ) ) | + 2 n − / log / n ≤ L − + 2 n − / log / n. Since g is Lipschitz continuous, we know ( g ( X η ) − g ( X η (cid:48) )) = O ( L − + n − log n ) forany η ∼ η (cid:48) . Then we have E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) , A c ] = O ( L − + n − log n ). Withthis equation, Theorem 8 is the direct result of (6.6) (6.7), Lemma 1( ii ), Lemma 2. (cid:3) roof of Theorem 3. For convenience, we simply write g F as g in this proof.In (6.6), we only analyze E ( U oa − ¯ V ) since the rest two terms are given by Lemma 1( ii ) and Lemma 2. Each row of the matrix A generated in step 1 follows the uniformdistribution on Z dL since the permutation in each column of A is independent. Thus, E ( U oa − ¯ V ) = E (cid:32) m m (cid:88) i =1 g ( X η i ) − ¯ g ( X η i ) (cid:33) = 1 mL d (cid:88) a ∈Z dL E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a ] . Analysis is now focused on E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a ] for every a ∈ Z dL . Let X (0) = 0and X ( n +1) = 1. For l ∈ Z L , given X (( l − nL − ) and X ( lnL − +1) , X (( l − nL − +1) , . . . , X ( lnL − ) has the same distribution as the order statistic of L samples following the uniformdistribution on [ X (( l − nL − ) , X ( lnL − +1) ]. For A = { sup x ∈ R | F n ( x ) − F ( x ) | ≥ n − − c } ,Dvoretzky-Kiefer-Wolfowitz inequality reveals P ( A ) = exp( − n c ). On A c , we have( X ( lnL − +1) − X (( l − nL − ) ) /L → n → ∞ . The analysis is now focused on E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a , A c ]. For this given a , deﬁne X = ( X , , . . . , X ,d ) where X ,j = Ln (cid:80) η ∈ G aj X η and so (cid:80) η ∈ G aj ( X η − X ,j ) = 0. Adopt the Taylor expansion on X , wehave g ( X η ) = g ( X ) + d (cid:88) j =1 ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j ( X η j − X ,j ) + O ( L − ) and ¯ g ( X η ) = g ( X ) + O ( L − ) . [( g ( X η ) − ¯ g ( X η )) | η ∈ G a , A c ]= E (cid:32) d (cid:88) j =1 ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j · ( X η j − X ,j ) + O ( L − ) (cid:33) | η ∈ G a , A c  = o ( L − ) + d (cid:88) j =1 E (cid:32) ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j · ( X η j − X ,j ) (cid:33) | η ∈ G a , A c  = o ( L − ) + d (cid:88) j =1 (cid:32) ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j (cid:33) L . And then we have E ( U oa − ¯ V ) = 1 mL d (cid:88) a ∈Z dL E [( g ( X η ) − ¯ g ( X η )) | η ∈ G a ]= 112 mL d (cid:88) j =1  L d (cid:88) a ∈Z dL (cid:32) ∂g∂x j (cid:12)(cid:12)(cid:12)(cid:12) X ,j (cid:33)  + o (cid:18) mL (cid:19) = 112 mL d (cid:88) j =1 E (cid:18) ∂g∂x j (cid:19) + o (cid:18) mL (cid:19) , Then Theorem 3 is the direct result of (6.6), Lemma 1( ii ) and Lemma 2. (cid:3) Proof of Theorem 5.

Consider the m rows of A , a , . . . , a m , generated in thestep 1 of the construction in section 2.1. For any a ∈ Z dL , the random permutation ingenerating a , . . . , a m reveals that P ( a = a ) = L − d . Given F n , E ( ˜ U oa | F n ) = E m m (cid:88) i =1 ω η i g ( X η i ) = Eω η g ( X η )= (cid:88) a ∈Z dL L − d E η ∈G a ω η g ( X η ) = (cid:88) a ∈Z dL |G a i ∩ S ∗ || S ∗ | E η ∈G a g ( X η )= (cid:88) a ∈Z dL |G a i ∩ S ∗ || S ∗ | (cid:32) |G a i ∩ S ∗ | (cid:88) η ∈G a g ( X η ) (cid:33) = 1 | S ∗ | (cid:88) η ∈ S ∗ g ( X η ) = U . U is unbiased, so is ˜ U oa . This proves the unbiasedness of ˜ U oa . The MSE of ˜ U oa can be similar analyzed as Theorem 1, and so is omitted here. (cid:3) References

Bickel, P. J. and Freedman, D.(1981). Some asymptotic theory for the bootstrap.

Annals of Statistics ,1196–1217.Blom, G. (1976). Some properties of incomplete U-statistic. Biometrika , 573–580.Bretagnolle, J. (1983). Lois limits du Bootstrap de certaines functionnelles. Annales de l’Institut Henri PoincareSection B

XIX(3), 281-296.Brown, B. M. and Kildea, D. G. (1978). Reduced U-statistic and the Hodges-Lehmann estimator.

Annals ofStatistics , 828–835.Chen, X. H. and Kengo Kato (2019). Randomized incomplete U-statistic in high dimensions. Annals ofStatistics , To appear.Chen, W., Liu, T. Y., Lan, Y. Y., Ma, Z. M. and Li, H. (2009). Ranking Measures and Loss Functions inLearning to Rank.

Advances in Neural Information Processing Systems 22 (NIPS 2009)

Dehling, H. and Mikosch, T. (1994). Random quadratic forms and the bootstrap for U-statistic.

Journal ofMultivariate Analysis , 392–413.Enqvist, E. (1985). A note on incomplete U-statistic for stationary absolutely regular processes.

Contributionsto Probability and Statistics in Honour of Gunnar Blom. University of Lund , 97–103.Eagleson, G. K. (1979). Orthogonal expansions and U-statistic.

Australian & New Zealand Journal of Statistics , 221–237.Herrndorf, N. (1986). An invariance principle for reduced U-statistic. Metrika , 179–188.Hoeﬀding, W. (1948). A class of statistics with asymptotically normal distribution.

Annals of MathematicalStatistics , 293–325.He, Y. Z. and Tang, B. X. (2012). Strong orthogonal arrays and associated Latin hypercubes for computerexperiments. Biometrika , 254–260.Hilbert, M. and L´opez, P. (2011). The World’s Technological Capacity to Store, Communicate, and ComputeInformation.

Science , 60–65.Janson, S. (1984). The asymptotic distributions of incomplete U-statistic.

Zeitschrift fr Wahrscheinlichkeits-theorie und Verwandte Gebiete , 1159–1169.Lee, A. J. (1979). On the asymptotic distribution of U-statistic.

Institute of Statistics Mimeo Series ,University of North Carolina at Chapel Hill.Lee, A. J. (1982). On incomplete U-statistic having minimum variance.

Australian & New Zealand Journal ofStatistics , 275–282.Lee, A. J. (1990). U-statistic: Theory and Practice. CRC Press.Lee, S., Linton, O. and Whang, Y. (2009). Testing for Stochastic Monotonicity.

Econometrica , 585–602.Lehmann, E. L. (1951). Consistency and unbiasedness of certain nonparametric tests. Annals of Statistics , 165–179.Lin, N. and Xi, R. (2010). Fast surrogates of U-statistic.

Computational Statistics and Data Analysis , Annals of Statistics , 1811–1823.Marie Huskova and Paul Janssen (1993b). Generalized bootstrap for studentized U-statistic: a rank statisticapproach.

Statistics and Probability Letters , 225–233.Owen, A. B. (1992). Orthogonal arrays for computer experiments.

Statistica Sinica , 439–452.Papa, G., Cl´emen¸con, S. and Bellet, A. (2015) SGD Algorithms based on Incomplete U-statistics: Large-ScaleMinimization of Empirical Risk. Advances in Neural Information Processing Systems 28 (NIPS 2015)

Rempala, G. and Wesolowski, J. (2003). Incomplete U-statistic of permanent design.

Journal of NonparametricStatistics , 221–236.Rempala, G. and Srivastav, S. (2004). Minimum variance rectangular designs for U-statistic.

Journal ofStatistical Planning and Inference , 103–118.Sen, P. K. (1974). Weak convergence of generalized U-statistic.

Annals of Probability , 90–102.Sen, P. K. (1977). Almost sure convergence of generalized U-statistic.

Annals of Probability , 287–290.Shao, J. (2007).

Mathematical statistics (second edition).

Springer.Tang, B. X. (1993). Orthogonal array based Latin hypercubes.

Journal of the American Statistical Association , 1392–1397.von Mises, R. (1948). On the asymptotic distribution of diﬀerentiable statistical functions.

Annals of Mathe-matical Statistics , 309–348. ang Q. (2012). Investigation of topics in U-statistics and their applications in risk estimation and cross-validation, Ph.D. dissertation, Penn State University library electronic resource, 1–187.Wang Q and Lindsay B. (2014). Variance estimation of a general U-statistic with application to cross-validation[J]. Statistica Sinica , , 1117–1141.School of Mathematics and Statistics, Beijing Institute of TechnologyE-mail: [email protected] of Business Analytics and Statistics, University of TennesseeE-mail: [email protected] ESIGN BASED INCOMPLETE U-STATISTICS

Xiangshun Kong , Wei Zheng Beijing Institute of Technology and University of Tennessee

Generalization of Theorem 2.

The following conditions on g or F will be needed by Theorem 2 in Section 2 andTheorems 7–9 in this section.( g. Lipschitz continuous : The function, g : R d → R , is said to be Lipschitz continuousif there exists a constant c > | g ( a ) − g ( a ) | ≤ c || a − a || for any a , a ∈ R d . Example: First-order polynomial functions.( g. Order- p continuous : The function, g : R d → R , is said to be order- p continuousif there exists a constant c > φ p ( a − a ) ≤ c + max p ( || a || , || a || ) for any a , a ∈ R d such that | g ( a ) − g ( a ) | ≤ φ ( a , a ) || a − a || for any a , a ∈ R d .Example: All polynomial functions.( g. Uniformly bounded-variation : For a real valued function f : R → R , the totalvariation of f is deﬁned as V R ( f ) = sup p> sup −∞ V R ( g ( · , x , . . . , x d )) < c for any ( x , . . . , x d ) ∈ R d − .Example: Linear combinations of sign functions, e.g. g ( x , x ) = sign( x x ) +43ign( x + x ).( F ) Light-tailed distribution: The distribution of a random variable X is said to belight-tailed if there exists constants c, c > P ( | X | > x ) ≤ e − cx for all x > c . Example: Normal distribution, exponential distribution, and truncateddistributions. Lemma 4.

Suppose F is light-tailed. Let X max = max {| X | , . . . , | X n |} . Then, forarbitrary a > with n → ∞ , we have EX a max = O (log n ) a . Proof.

Since the distribution is light-tailed, we have P ( | X | > x ) ≤ e − cx for any | x | > c ,where c and c are two ﬁxed positive numbers. E ( X max ) a = (cid:90) x> ax a − P ( X max > x ) dx ≤ (cid:90) c − log n ax a − dx + (cid:90) ∞ c − log n ax a − P ( X max > x ) dx = O (log n ) a + (cid:90) ∞ c − log n ax a − P ( X max > x ) dx = O (log n ) a + (cid:90) ∞ c − log n ax a − ne − cx dx = O (log n ) a + O (1) . (cid:3) Lemma 5.

Suppose ( i ) g is order- p continuous, and ( ii ) F is light-tailed. We have E ( U oa − ¯ V ) = O (cid:18) mL (log n ) p +2 (cid:19) . roof. Let X max = max {| X | , . . . , | X n |} . For l ∈ Z L , deﬁne d l = max {| X i − X i | : i , i ∈ G l } . Since g is order- p continuous, for η ∼ η (cid:48) in G a , | g ( X η ) − g ( X η (cid:48) ) | ≤ ( c + X p max ) d / d l , and so | g ( X η ) − g ( X η (cid:48) ) | ≤ ( c + X p max ) · d · (cid:80) dj =1 d a j .Since U oa and ¯ V always use the same S oa = { η , . . . , η m } , we have E ( U oa − ¯ V ) = E (cid:32) m m (cid:88) i =1 ( g ( X η i ) − ¯ g ( X η i )) (cid:33) . For i (cid:54) = i , E ( g ( X η i ) − ¯ g ( X η i ))( g ( X η i ) − ¯ g ( X η i )) = 0. E ( U oa − ¯ V ) = m − E m (cid:88) i =1 ( g ( X η i ) − ¯ g ( X η i )) ≤ m − E m (cid:88) i =1 ( c + X p max ) · d · d (cid:88) j =1 d a ij Since (cid:80) Ll =1 d l ≤ X max , we have (cid:80) Ll =1 d l ≤ X . Using Lemma 4, we have E ( U oa − ¯ V ) ≤ m − dE (cid:32) ( c + X p max ) m (cid:88) i =1 d (cid:88) j =1 d a ij (cid:33) = m − dE (cid:32) ( c + X p max ) d (cid:88) j =1 m (cid:88) i =1 d a ij (cid:33) = m − dE (cid:32) ( c + X p max ) d (cid:88) j =1 mL − X (cid:33) = O (cid:18) mL (log n ) p +2 (cid:19) . (cid:3) Theorem 7.

Suppose ( i ) The kernel function g is order- p continuous, and ( ii ) F islight-tailed. For U oa based on OA ( m, d, L, t ) , we have MSE( U oa ) = MSE( U ) + R ( t ) m + O (cid:18) (log n ) p +2 mL (cid:19) + O (cid:18) n (cid:19) . (6.14) Proof . This is the direct result of (6.6), Lemma 1( ii ), Lemmas 2 and 5. (cid:3) heorem 8. Suppose the kernel function g has uniformly bounded variation. For U oa based on OA ( m, d, L, t ) , we have MSE( U oa ) = MSE( U ) + R ( t ) m + O (cid:18) mL (cid:19) + O (cid:18) n (cid:19) . (6.15) Proof.

From (6.6), Lemma 1( ii ) and Lemma 2, we only need to prove E ( U oa − ¯ V ) = O ( m − L − ). First, we introduce some notations that will be used only in the proof ofthis theorem. Given the order statistic of { X , . . . , X n } denoted by X (1) , . . . , X ( n ) , for l = 1 , . . . , L and ( x , . . . , x d ) ∈ R d − , deﬁne D ( l | x , . . . , x k ) = max ( l − nL −

0. So we have (cid:88) a ∈Z dL (cid:88) η ∈G a (cid:88) η (cid:48) ∈G a D ( a | X η , . . . , X η d ) ≤ cL d |G a | /L. Similarly analyzing the D ( a | X η (cid:48) , X η , . . . , X η d ), . . . , D ( a d | X η (cid:48) , X η (cid:48) , . . . , X η (cid:48) d − , we have E [( g ( X η ) − g ( X η (cid:48) )) | η ∼ η (cid:48) ] = O ( L − ) and so E ( U oa − ¯ V ) = O ( m − L − ). Theorem 8is the direct result of (6.6), Lemma 1( ii ), Lemma 2. (cid:3) Theorem 9.

Suppose ( i ) The kernel function g is a linear combination of some order- p continuous functions and some uniformly bounded-variation functions, and ( ii ) F islight-tailed. Then (6.14) still holds with L ≤ n (log n ) − .Proof . This is the direct result of Theorems 7 and 8. Choosing L and t . From Eq(2.13) of Theorem 3 in the manuscript and the relation m = λL t , weknow that the trade-oﬀ between L and t depends on the variance of each componentin the Heoﬀding’s decomposition, i.e., δ j , j = 1 , . . . , d . We shall give these variancesa estimator ˆ δ j . Using Eq(2.13) with R ( t ) and Eγ ( X , . . . , X d ) being estimated as afunction of ˆ δ j , we should choose the combination of L and t which minimizes φ ( L, t ) = ˆ R ( t ) m + d mL ˆ Eγ ( X , . . . , X d ) , where ˆ R ( t ) and ˆ Eγ ( X , . . . , X d ) are functions of ˆ δ j ’s.47ow we provide two methods for generating ˆ δ j . (1) When the Heoﬀding’s decom-position is easy to calculate, one can write down the analytical expression and give adirect estimation of δ j ’s. (2) We can use a bootstrap approach for ˆ δ j ’s. With a smallsample size n (cid:48) (cid:28) n , it is easy to bootstrap MSE( U ) (the complete U-statistic). Fordetails of the bootstrap approach, we may refer to Marie Huskova and Paul Janssen(1993a,b). Now, let us review the formula of MSE( U ):MSE( U ) = (cid:18) nd (cid:19) − d (cid:88) j =1 (cid:18) dj (cid:19)(cid:18) n − dd − j (cid:19) σ j = d (cid:88) j =1 (cid:18) dj (cid:19) (cid:18) nj (cid:19) − δ j . Usually, with at most d diﬀerent n (cid:48) ( > d ), we can generate linear equations of δ j basedon the d diﬀerent (cid:91) MSE( U ) based on the bootstrap approach. And the solution of theselinear equations can be used as the estimation of ˆ δ j ’s.For the second method, we now use the setup in Example 1 for illustration. Forconvenience, we set n = 10 and m = 10 . The two choices of the combination of L and t is ( L = 100 , t = 3) and ( L = 1000 , t = 2). We use bootstrap method to estimatethe variance of the complete U-statistic with n (cid:48) = 4 , ,

6. The subsample size n (cid:48) isso small that the computational burden of the bootstrapped complete U-statistic, i.e., (cid:0) n (cid:48) (cid:1) is negligible. Simulation reveals that ˆ δ = 0 . δ = 0 . δ = 1 . t = 3 shall work better than t = 2, which is veriﬁed by thesimulation result. Actually, with m = 10 , the eﬃciency of U oa is 100 .

0% when t = 3and 97 .

88% when t = 2. Examples for multi-sample and multi-dimensional cases.

Consider48he multi-sample case. Suppose d = d = 2, n = n = 9 and the two samples are X (1)6 ≤ X (1)8 ≤ X (1)2 ≤ X (1)4 ≤ X (1)7 ≤ X (1)5 ≤ X (1)3 ≤ X (1)9 ≤ X (1)1 .X (2)2 ≤ X (2)7 ≤ X (2)3 ≤ X (2)6 ≤ X (2)1 ≤ X (2)4 ≤ X (2)5 ≤ X (2)9 ≤ X (2)8 . Then we have L = 3 groups listed as G (1)1 = { , , } , G (1)2 = { , , } , G (1)3 = { , , } and G (2)1 = { , , } , G (2)2 = { , , } , G (2)3 = { , , } . An example of OA ( m = 9 , d =4 , L = 3 , t = 2) in step 1 is given as follows in transpose. A T =   . Then we could possibly have the X η i , i = 1 , . . . ,

9, used in the construction of 9-runmulti-sample construction as follows. {X η , . . . , X η } =  X (1)8 X (1)2 X (1)6 X (1)4 X (1)4 X (1)5 X (1)9 X (1)1 X (1)9 X (1)6 X (1)7 X (1)3 X (1)8 X (1)7 X (1)1 X (1)6 X (1)7 X (1)3 X (2)7 X (2)1 X (2)5 X (2)4 X (2)8 X (2)3 X (2)9 X (2)2 X (2)6 X (2)3 X (2)6 X (2)9 X (2)5 X (2)3 X (2)1 X (2)6 X (2)8 X (2)3  . Consider the multi-dimensional case. Suppose X = (1 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . X = (0 . , . G = { , , } , G = { , , } , G = { , , } . The choosing of η i , i = 1 , . . . ,, . . . ,

Related Researches

Differentially private depth functions and their associated medians

by Kelly Ramsay

Factor Modelling for Clustering High-dimensional Time Series

by Bo Zhang

Mixing convergence of LSE for supercritical Gaussian AR(2) processes using random scaling

by Matyas Barczy

Benign overfitting without concentration

by Zong Shang

The Gaussian entropy map in valued fields

by Yassine El Maazouz

Measuring association with Wasserstein distances

by Johannes Wiesel

Transition kernel couplings of the Metropolis-Hastings algorithm

by John O'Leary

Better understanding of the multivariate hypergeometric distribution with implications in design-based survey sampling

by X.G. Duan

Adaptive Sequential Design for a Single Time-Series

by Ivana Malenica

Settling the Sharp Reconstruction Thresholds of Random Graph Matching

by Yihong Wu

Ellipse Combining with Unknown Cross Ellipse Correlations

by Adam Hall

Adaptive Estimation of Quadratic Functionals in Nonparametric Instrumental Variable Models

by Christoph Breunig

Tail concordance measures: A fair assessment of tail dependence

by Takaaki Koike

Motif-based tests for bipartite networks

by Sarah Ouadah

Tensor denoising with trend filtering

by Francesco Ortelli

The Langevin Monte Carlo algorithm in the non-smooth log-concave case

by Joseph Lehec

Statistical models and probabilistic methods on Riemannian manifolds

by Salem Said

Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration

by Song Mei

On the distributions of some statistics related to adaptive filters trained with t -distributed samples

by Olivier Besson

A general method for power analysis in testing high dimensional covariance matrices

by Qiyang Han

Inferring serial correlation with dynamic backgrounds

by Song Wei

Diffusion Asymptotics for Sequential Experiments

by Stefan Wager

The EM Perspective of Directional Mean Shift Algorithm

by Yikun Zhang

Eigen-convergence of Gaussian kernelized graph Laplacian by manifold heat interpolation

by Xiuyuan Cheng

Differential equations, splines and Gaussian processes

by Lars Lau Raket

«
1

2

3

4

»

Submitted on 10 Aug 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar