Generalized Spacing-Statistics and a New Family of Non-Parametric Tests
GGeneralized Spacing-Statistics and a New Family ofNon-Parametric Tests
Dan D. Erdmann-Pham , Jonathan Terhorst , and Yun S. Song , , Department of Mathematics, University of California, Berkeley, CA 94720 Department of Statistics, University of Michigan, Ann Arbor, MI 48109 Department of Statistics, University of California, Berkeley, CA 94720 Computer Science Division, University of California, Berkeley, CA 94720 Chan Zuckerberg Biohub, San Francisco, CA 94158August 18, 2020
Abstract
Random divisions of an interval arise in various context, including statistics, physics, andgeometric analysis. For testing the uniformity of a random partition of the unit interval [0 , k disjoint subintervals of size ( S k (cid:74) (cid:75) , . . . , S k (cid:74) k (cid:75) ), Greenwood (1946) suggested using thesquared (cid:96) -norm of this size vector as a test statistic, prompting a number of subsequent studies.Despite much progress on understanding its power and asymptotic properties, attempts tofind its exact distribution have succeeded so far for only small values of k . Here, we developan efficient method to compute the distribution of the Greenwood statistic and more generalspacing-statistics for an arbitrary value of k . Specifically, we consider random divisions of { , , . . . , n } into k subsets of consecutive integers and study (cid:107) S n,k (cid:107) pp, w , the p th power of theweighted (cid:96) p -norm of the subset size vector S n,k = ( S n,k (cid:74) (cid:75) , . . . , S n,k (cid:74) k (cid:75) ) for arbitrary weights w = ( w , . . . , w k ). We present an exact and quickly computable formula for its moments, aswell as a simple algorithm to accurately reconstruct a probability distribution using the momentsequence. We also study various scaling limits, one of which corresponds to the Greenwoodstatistic in the case of p = 2 and w = (1 , . . . , (cid:107) S n,k (cid:107) pp, w and demonstrate that they exhibitsubstantially improved power for a large class of alternatives, compared to existing popularmethods such as the Kolmogorov-Smirnov, Cram´er-von Mises, and Mann-Whitney/Wilcoxonrank-sum tests. In an address to the Royal Statistical Society, epidemiologist and statistician Greenwood (1946)introduced what became known as Greenwood’s statistic in order to study the randomness of infec-tion times T , . . . , T k of a given number k of patients. More precisely, he was interested in testing thehypothesis of whether these infection times were generated by a homogeneous Poisson process; thatis, whether ( T , T , . . . , T k ) / ( (cid:80) kj =1 T j ) is distributed uniformly on the ( k − (cid:80) kj =1 T j ) / ( (cid:80) kj =1 T j ) ,for which he was able to provide a complete description for k = 2, but none for k >
2. This sparkeda flurry of studies attempting both to better understand Greenwood’s statistic in higher dimensions,as well as to clarify its power and efficiency as a hypothesis test. In a series of papers, Moran (1947,1951, 1953) computed the test statistic’s first four moments for an arbitrary k , proved a centrallimit theorem as k → ∞ (albeit with very slow convergence), and estimated the test’s efficiency1 a r X i v : . [ m a t h . S T ] A ug gainst a specific class of alternatives. Around the same time, Gardner (1952) fully described thecase k = 3 by fruitfully interpreting the distribution function of the test statistic as the volume ofintersection between a sphere and the simplex. Subsequently, Darling (1953) found a closed-form,but difficult to invert, characteristic function, while Weiss (1956) and Sethuraman and Rao (1970)investigated the role of Greenwood’s statistic in the context of the general goodness-of-fit test ofwhether a sample X , . . . , X n follows a given arbitrary distribution F . There they proved that atest based on Greenwood’s statistic is both most powerful against symmetric linear alternatives,and enjoys the greatest asymptotic relative efficiency against a wide class of tests and alternativehypotheses. Given these favorable properties, but analytically intractable nature of Greenwood’sstatistic, approximations and numerical schemes were devised by Burrows (1979), Currie (1981)and Stephens (1981) to tabulate scores for certain significance levels up to k = 20. Most recently,Schechtman and Zinn (2000) exploited the geometric interpretation of Gardner (1952) to derive anexplicit rate function to characterize the large deviations of Greenwood’s statistic as k → ∞ .Closely connected to Greenwood’s statistic, and applicable in an equal variety of settings, is adiscretized version of it: instead of sampling uniformly from the simplex ∆ k − , the null distribu-tion may be uniform over all integral points in the scaled simplex n · ∆ k − for some n ∈ Z + . Testsand computations based on such measure occur in various contexts in computational biology (e.g.,Palamara et al., 2018; Riley et al., 2007), physics (where it is known as the Bose-Einstein distri-bution) and theoretical statistics, where it emerges naturally as an urn model (e.g., Holst, 1979).Greenwood’s statistic has a natural analogue in this discretized scenario, where it is known asDixon’s statistic after its proposed use by Dixon (1940) for performing non-parametric two-sampletesting. However, although its asymptotic behavior has been well studied by Holst and Rao (1980),a description in the non-asymptotic regime as well as convergence rates have, to the best of ourknowledge, remained elusive.Here we fill this gap by studying a generalized family of Greenwood’s statistics (includingDixon’s statistic) for finite n and k , for which we are able to exactly and efficiently compute itsmoments. Using this knowledge, we examine various scaling limits, proving CLT results as wellas identifying novel limiting distributions. We then quantify the connection between Dixon’s andGreenwood’s statistic through precise convergence rates and monotonicity results, while using ourunderstanding from the discrete setting to offer new insights into the moment sequence, smoothnessand monotonicity of Greenwood’s statistic. Finally, we propose a simple and efficient algorithm thatrecovers an underlying continuous distribution from its first m moments up to O ( m − ) accuracy,and use it to devise a powerful hypothesis test of whether two data sets { X , . . . , X k − } and { Y , . . . , Y n } were sampled from the same distribution. We demonstrate the test’s suitability fora large class of alternatives through extensive power studies, and compare it with the classicalKolmogorov-Smirnov, Cram´er-von Mises and Mann-Whitney tests. We illustrate how the sameprinciples employed in designing our two-sample test can be applied equally successfully to thesettings of one-sample tests, and tests using paired data. For a positive integer n , we use [ n ] to denote the set { , . . . , n } . The two probability spacesunderlying most of our discussion will consist of the ( k − k − = (cid:40) ( x , x , . . . , x k ) ∈ [0 , k : k (cid:88) i =1 x i = 1 (cid:41) µ ∆ k − = σ/σ (∆ k − ) = ( k − σ/ √ k , where σ is surface measurein R k , and its discretized version D n,k = ( n ∆ k − ) ∩ Z k = (cid:40) ( z , z , . . . , z k ) ∈ { , . . . , n } k : k (cid:88) i =1 z i = n (cid:41) with its uniform measure µ D n,k = | D n,k | − = (cid:0) n + k − k − (cid:1) − . In other words, µ ∆ k − is the law of aDirichlet(1 , . . . ,
1) variable, while a k -part weak composition of n chosen uniformly at random isdistributed according to µ D n,k . Occasionally we will refer to this latter distribution as a uniformconfiguration of n balls distributed over k bins. To test the hypothesis of whether a sampledrandom variable X = ( X , . . . , X k ) in Z k or R k has distribution µ ∆ k − or µ D n,k , respectively, weare interested in comparing the distribution of its weighted (cid:96) p -norms (cid:107) X (cid:107) pp, w = k (cid:88) i =1 w i X pi , for some fixed weight vector w = ( w , . . . , w k ) ∈ R k , against its null distribution. That is, if S k ∼ µ ∆ k − and S n,k ∼ µ D n,k , we are interested in studying the distributions of (cid:107) S k (cid:107) pp, w and (cid:107) S n,k (cid:107) pp, w (to prevent confusion of powers and vector indices, we will denote entries of S k and S n,k by S k (cid:74) j (cid:75) and S n,k (cid:74) j (cid:75) for j ∈ [ k ]). For p = 2 and w = k , with k := (1 , . . . ,
1) being the all-onesvector of length k , these are precisely Greenwood’s and Dixon’s statistics, respectively.A primary reason why understanding these statistics is important is their application to non-parametric testing. If Z , . . . , Z N are independently sampled from the same continuous distri-bution F , then the spacings (that is, the differences between consecutive order statistics) of0 , F ( Z (1) ) , . . . , F ( Z ( N ) ) , µ ∆ N , so characterizing the distribution ofGreenwood’s statistic allows to test if the sample is distributed according to F . Similarly, if X , . . . , X k − , Y , . . . , Y n are i.i.d. samples from the same continuous distribution G , and we definethe order statistics −∞ =: X (0) < X (1) < · · · < X ( k − < X ( k ) := ∞ , and the number of Y i sandwiched between every two consecutive order statistics S n,k (cid:74) j (cid:75) := (cid:8) i : X ( j − ≤ Y i < X ( j ) (cid:9) , (1)for j ∈ [ k ], then ( S n,k (cid:74) (cid:75) , . . . , S n,k (cid:74) k (cid:75) ) is distributed according to µ D n,k .These two hypothesis tests will be our main application considered in Section 5. Before comingto those, in Section 3 we present our main theoretical results about (cid:107) S n,k (cid:107) pp, w and (cid:107) S k (cid:107) pp, w . Inparticular, we will detail a simple and efficient way to compute their moment sequences exactly.Recovering the distribution of (cid:107) S n,k (cid:107) pp, w and (cid:107) S k (cid:107) pp, w from these moments seems analytically in-tractable, but can be done algorithmically in an efficient and exact manner. Section 4 is dedicatedto describing such an algorithm. We start by investigating the discrete generalized spacing-statistics (cid:107) S n,k (cid:107) pp, w , from which their con-tinuous analogues will follow. Our point of departure is the well-known Wilcoxon-Mann-Whitney U statistic (Mann and Whitney, 1947; Wilcoxon, 1945). It is easy to see that in the special case of p = 1 and w = ( k − ↓ := ( k − , k − , . . . , , k , the U statistic. Indeed, for two samples X , . . . , X k − and Y , . . . , Y n , with R , . . . , R k − the ranks of X , . . . , X k − computed in the joint ensemble { X , . . . , X k − , Y , . . . , Y n } , their U statis-tic is given by U = (cid:80) k − j =1 R j . In our notation introduced in (1), we then have R j = j + (cid:80) ji =1 S n,k (cid:74) i (cid:75) ,and consequently U = k − (cid:88) j =1 R j = k − (cid:88) j =1 (cid:34) j + j (cid:88) i =1 S n,k (cid:74) i (cid:75) (cid:35) = k − (cid:88) j =1 j + k − (cid:88) i =1 k − (cid:88) j = i S n,k (cid:74) i (cid:75) = (cid:18) k − (cid:19) + k − (cid:88) i =1 ( k − i ) S n,k (cid:74) i (cid:75) = (cid:18) k − (cid:19) + (cid:107) S n,k (cid:107) , ( k − ↓ , as desired. In order to both compute the exact distribution of U under the null hypothesis of all X i , Y i being generated i.i.d, as well as its asymptotic normality, Mann and Whitney exploited a re-currence relation which, in our language, consists of conditioning on the occupancy of the very lastentry in S n,k . Conditional on the event { S n,k (cid:74) k (cid:75) > } —that is, conditional on the last bin contain-ing at least one ball—we may remove one such ball to find that (cid:16) (cid:107) S n,k (cid:107) , ( k − ↓ | { S n,k (cid:74) k (cid:75) > } (cid:17) d = (cid:107) S n − ,k (cid:107) , ( k − ↓ . Similarly, on { S n,k (cid:74) k (cid:75) = 0 } we may omit the very last bin to arrive at (cid:16) (cid:107) S n,k (cid:107) , ( k − ↓ | { S n,k (cid:74) k (cid:75) = 0 } (cid:17) d = (cid:107) S n,k − (cid:107) , ( k − ↓ + n . Combining these two observations, andwriting q n,k ( x ) = P (cid:16) (cid:107) S n,k (cid:107) , ( k − ↓ = x (cid:17) , yields the two-term recursion q n,k ( x ) = nn + k − q n − ,k ( x ) + k − n + k − q n,k − ( x − n ) , (2)from which the whole probability mass function of (cid:107) S n,k (cid:107) , ( k − ↓ can be computed in O ( n k ) time.Moreover, this recursion directly translates into a recurrence of the moments of (cid:107) S n,k (cid:107) , ( k − ↓ ,which allowed Mann and Whitney to prove a central limit theorem as n, k → ∞ , thus rendering U a versatile and quickly computable two-sample test statistic.Unfortunately, the recurrence relation (2) lacks robustness with respect to varying either p or w = ( w , . . . , w k ): for p > w , removing a ball from any bin changes (cid:107) S n,k (cid:107) pp, w byan amount depending on the total number of balls in that bin, and removing a bin may result inweights that are in no way related to the original weights. This can be fixed by conditioning notonly on the vacancy of S n,k (cid:74) k (cid:75) , but its precise occupancy. Defining q p, w n,k ( x ) := P ( (cid:107) S n,k (cid:107) pp, w = x )and w − k := ( w , . . . , w k − ), we have q p, w n,k ( x ) = n (cid:88) j =0 P ( S n,k (cid:74) k (cid:75) = j ) q p, w − k n − j,k − ( x − w k j p )= (cid:18) n + k − k − (cid:19) − n (cid:88) j =0 (cid:18) n − j + k − k − (cid:19) q p, w − k n − j,k − ( x − w k j p ) , (3)with initial conditions q p, w n,k ( x ) = , if x / ∈ { x min , x max } or n < , x = w n p , if k = 1 , x =0 , if n = 0 , (4)4here x min = min S n,k ∈ D n,k (cid:107) S n,k (cid:107) pp, w and x max = (cid:107) w (cid:107) ∞ n p bound the range of (cid:107) S n,k (cid:107) pp, w . However,solving (3) and (4) generically requires O (cid:0) n p +1 k (cid:107) w (cid:107) ∞ (cid:1) time, thus rendering it unfeasible to com-pute for even modestly large n and p . To circumvent this problem, in what follows we will insteaduse a moment-based approach. (cid:107) S n,k (cid:107) pp, w and (cid:107) S k (cid:107) pp, w The following theorem, together with
Proposition (cid:107) S n,k (cid:107) pp, w : Theorem 1.
Let G ( x, y ) = (cid:80) ∞ m =0 Li − pm ( x ) y m /m ! , where Li s ( x ) = (cid:80) ∞ j =1 j − s x j is the polylogarithmfunction. Denoting by [ x n y m ] P ( x, y ) the ( n, m ) th coefficient of a power series P in x and y , wehave E (cid:0) (cid:107) S n,k (cid:107) pp, w (cid:1) m = m ! (cid:0) n + k − k − (cid:1) [ x n y m ] k (cid:89) i =1 G ( x, w i y ) . (5) In particular, the first m moments of (cid:107) S i,j (cid:107) pp, w for ( i, j ) ∈ { , . . . , n } × { , . . . , k } can be computedin O ( nm · (log nm ) · (log k )) time.Proof. We first expand the left-hand side of (5) to find E (cid:0) (cid:107) S n,k (cid:107) pp, w (cid:1) m = (cid:88) σ ∈ D n,k P ( S n,k = σ ) k (cid:88) j =1 w j σ pj m = (cid:18) n + k − k − (cid:19) − (cid:88) σ ∈ D n,k (cid:88) η ∈ D m,k (cid:18) mη , . . . , η k (cid:19) k (cid:89) j =1 w η j j σ η j pj = m ! (cid:0) n + k − k − (cid:1) (cid:88) η ∈ D m,k (cid:88) σ ∈ D n,k k (cid:89) j =1 ( w j σ pj ) η j η j ! (cid:124) (cid:123)(cid:122) (cid:125) A n,k,m,w , (6)so it remains to show that A n,k,m,w = [ x n y m ] (cid:81) kj =1 G ( x, w j y ). By definition of Li x ( x ), we have forevery fixed η ∈ D m,k (cid:88) σ ∈ D n,k k (cid:89) j =1 w η j j σ pη j j η j ! = [ x n ] k (cid:89) j =1 Li − pη j ( x ) η j ! w η j j , (7)and so A n,k,m,w = [ x n ] (cid:88) η ∈ D m,k k (cid:89) j =1 Li − pη j ( x ) η j ! w η j j = [ x n ] [ y m ] k (cid:89) j =1 (cid:32) ∞ (cid:88) i =0 Li − pi ( x ) i ! ( w j y ) i (cid:33) = [ x n y m ] k (cid:89) j =1 G ( x, w j y ) , (8)as desired. The O ( nm · (log nm ) · (log k )) runtime is now a direct consequence of computing theCauchy product of k bivariate degree-( n, m ) polynomials using the Fast Fourier Transform.5he above theorem will be our main tool for devising an efficient, powerful, and general two-sample test in Section 5, if the two samples are modestly sized. In the case where one sample issignificantly larger than the other, the following result, paired with Proposition
Proposition 1.
For fixed k , S n,k /n converges in distribution to S k ∼ µ ∆ k − . In particular, wehave (cid:13)(cid:13)(cid:13)(cid:13) S n,k n (cid:13)(cid:13)(cid:13)(cid:13) pp, w d −→ (cid:107) S k (cid:107) pp, w as n → ∞ . (9) Proof.
It suffices to show that lim n →∞ P (cid:18) S n,k n ∈ E (cid:19) = P ( S k ∈ E ) (10)for any Lebesgue-measurable set E ⊂ ∆ k − . Moreover, by Dynkin’s π - λ theorem, we may withoutloss of generality assume E to be a box with rational vertex points inside ∆ k − (see e.g. Theorem1.1 in Kallenberg, 2006), which will allow us to count the number of lattice points in E as n → ∞ .To wit, let L ( E, n ) be the cardinality of the set nE ∩ Z k , then Ehrhart theory informs us that (cf.Ehrhart, 1967) L ( E, n ) = Vol Λ ( E ) · n k − + O ( n k − ) , (11)where Vol Λ ( E ) = λ k − ( E ) / d(Λ) is the ( k − E , normalized by theco-volume of the lattice Λ = Z k ∩ H induced by Z k on the hyperplane H = { x ∈ R k : (cid:80) kj =1 x j = 0 } .But since the fundamental region of Λ is the parallelepiped formed by { e − e j } j ∈{ ,...,k } , where e j ∈ R k is the j th standard basis vector, the co-volume can be computed to bed(Λ) = det V t V = det (cid:0) I k − + k − Tk − (cid:1) , (12)with the columns of V being given by { e − e j } j ∈{ ,...,k } . It is straightforward to check that V t V haseigenvalue 1 of multiplicity k − { e − e j } j ∈{ ,...,k − } ),and eigenvalue k of multiplicity 1 (with eigenvector k − = (1 , . . . , √ k. (13)Since √ k/ ( k − k − k − , we finally arrive at P (cid:18) S n,k n ∈ E (cid:19) = L ( E, n ) (cid:0) n + k − k − (cid:1) = L ( E, n ) n k − ( k − + O ( n k − ) = ( k − Λ ( E ) + O (cid:0) n − (cid:1) n →∞ −−−→ ( k − √ k λ k − ( E ) = P ( S k ∈ E ) , (14)which proves the first part of our proposition. The result in (9) is now a direct consequence of thecontinuous mapping theorem.We note in particular that for p = 2 and w = k = (1 , . . . , (cid:107) S n,k (cid:107) pp, w generalized Greenwood statistics. In order for this connection between two-sample testing andtests of uniformity to be of any use, a clear understanding of both the limiting distributions, aswell as the convergence rates is necessary. We begin with the former, for which reasoning akin to Theorem heorem 2.
Let Q p ( x ) = (cid:80) ∞ m =0 ( pm )! x m /m ! . Then, E (cid:0) (cid:107) S k (cid:107) pp, w (cid:1) m = ( k − m !( pm + k − x m ] k (cid:89) j =1 Q p ( w j x ) . (15) In particular, the first m moments of (cid:107) S j (cid:107) pp, w for j ∈ { , . . . , k } can be computed in O ( m · (log m ) · (log k )) time.Proof. As in (6), we expand the left-hand side of (15) to obtain E (cid:0) (cid:107) S k (cid:107) pp, w (cid:1) m = (cid:90) ∆ k − (cid:0) (cid:107) x (cid:107) pp, w (cid:1) m d µ ∆ k − ( x )= (cid:88) η ∈ D m,k (cid:18) mη , . . . , η k (cid:19) (cid:90) ∆ k − k (cid:89) j =1 (cid:16) w η j j x pη j j (cid:17) d µ ∆ k − ( x )= ( k − m ! √ k (cid:88) η ∈ D m,k (cid:32) k (cid:89) j =1 w η j j η j ! (cid:33) (cid:90) ∆ k − k (cid:89) i =1 x pη i i d σ ( x )= ( k − m ! √ k (cid:88) η ∈ D m,k (cid:32) k (cid:89) j =1 w η j j η j ! (cid:33) × (cid:90) Π∆ k − (cid:32) k − (cid:89) i =1 x pη i i (cid:33) (1 − x − · · · − x k − ) pη k √ k d λ k − ( x ) (16)= ( k − m !( pm + k − (cid:88) η ∈ D m,k k (cid:89) j =1 ( pη j )! η j ! w η j j (17)= ( k − m !( pm + k − x m ] k (cid:89) j =1 (cid:32) ∞ (cid:88) i =0 ( pi )! i ! ( w j x ) i (cid:33) (18)where σ (d x ) is (unnormalized) surface measure on ∆ k − , Π∆ k − the projection of ∆ k − on thehyperplane spanned by the first k − pη , . . . , pη k ). Weidentify (18) as (15), and thus complete the first part of the proof. The second part now follows asin Theorem
Remark 1.
The generating function Q p ( x ) in Theorem µ ∆ k − in various applicationsin physics (where it is known as the Bose-Einstein distribution). More precisely, Q p ( x ) can beexpressed as the generalized hypergeometric series Q p ( x ) = p F (cid:20) , p , p , . . . , p − p (cid:21) (cid:0) p x (cid:1) . In particular, for p = 2 (i.e., including the Greenwood statistic (cid:107) S k (cid:107) , k ), we have Q ( x ) = F (cid:2) , (cid:3) (4 x ) = √ x D (cid:16) √ x (cid:17) , where Dawson’s integral D ( x ) = e − x (cid:90) x e t d t (19) is interpreted through its asymptotic expansion (cf. Abramowitz and Stegun, 1965, formula 7.1.23). Proposition
Theorem (cid:107) S n,k (cid:107) pp, w for large n (while also satisfactorilyanswering Greenwood’s question of describing the distributional properties of (cid:107) S k (cid:107) , k , cf. Sec-tion 3.2). The following proposition guarantees the quality of these approximations: Proposition 2.
Let F p, w n,k , F p, w k : [0 , → [0 , be the cumulative distribution functions of (cid:13)(cid:13) S n,k n (cid:13)(cid:13) pp, w and (cid:107) S k (cid:107) pp, w , respectively. Then we have (cid:107) F p, w n,k − F p, w k (cid:107) ∞ = O ( n − ) , (20) for every fixed k ≥ .Proof. As in the proof of
Proposition
1, let Λ = Z k ∩ H where H = { x ∈ R k : (cid:80) kj =1 x j = 0 } .Denoting by E t = { x ∈ ∆ k − : (cid:107) x (cid:107) pp, w ≤ t } = {(cid:107) S k (cid:107) pp, w ≤ t } (21)the t -level set of F p, w k , we observe that since the fundamental domain of Λ has diameter (cid:107) ( k − , − , . . . , − (cid:107) = (cid:112) k ( k − L ( E t , n ) of lattice points in nE t is boundedby (cid:16) n − (cid:112) k ( k − (cid:17) k − Vol Λ (cid:0) E t (cid:1) ≤ d (Λ) L ( E t , n ) ≤ (cid:16) n + (cid:112) k ( k − (cid:17) k − Vol Λ (cid:0) E t (cid:1) . (22)Thus, in particular, µ D n,k (cid:0) nE t (cid:1) − µ ∆ k − (cid:0) E t (cid:1) = L ( E t , n ) (cid:0) n + k − k − (cid:1) − ( k − Λ (cid:0) E t (cid:1) ≤ ( k − Λ (cid:0) E t (cid:1) (cid:34)(cid:18) kn (cid:19) k − − (cid:35) ≤ √ k k − (cid:88) j =1 (cid:18) k − j (cid:19) (cid:18) kn (cid:19) j , (23)where using Vol Λ (cid:0) E (cid:1) = √ k/ ( k − Λ (cid:0) E t (cid:1) turns (23) independent of t . Similarly, a uniform lower bound is given by µ D n,k (cid:0) nE t (cid:1) − µ ∆ k − (cid:0) E t (cid:1) ≥ ( k − Λ (cid:0) E t (cid:1) (cid:34)(cid:18) − kn + k − (cid:19) k − − (cid:35) ≥ √ k k − (cid:88) j =1 (cid:18) k − j (cid:19) (cid:18) − kn + k − (cid:19) j . (24)Combining (23) and (24) gives (20) as desired.The results presented so far cover a wide range of scenarios encountered in practice whenperforming two-sample tests. Before considering other limiting situations, we first examine theinsight that Theorem (cid:107) S k (cid:107) pp, w , with particular emphasis on the Greenwood statistic (cid:107) S k (cid:107) , k itself.8 .2 Analysis of (cid:107) S k (cid:107) pp, w and the right tail probability While providing an efficient means of computing large n limits of any generalized Greenwood statis-tics (cid:107) S k (cid:107) pp, w , Theorem
Proposition
6) also satisfactorily answers Greenwood’squestion of describing the distribution of (cid:107) S k (cid:107) , k , which had remained open since Greenwood(1946). In particular, using the formulas given in (15) and Proposition
6, the approximate z -score tabulations of Burrows (1979); Currie (1981); Stephens (1981) can be extended to arbitrary k and arbitrary accuracy at reasonable runtime. In practice, this should be useful both for moderateand large k , for even though a central limit theorem (in k ) exists for (cid:107) S k (cid:107) , k (Moran, 1947), itsconvergence rate is unfeasibly slow. Apart from computational improvements, Theorem (cid:107) S k (cid:107) pp, w , and (cid:107) S k (cid:107) , k more specifically.We begin by controlling the decay of moments, which in turn will inform us about tail behaviornear x = 1. Proposition 3.
For p ≥ and k ≥ , and fixed weights w i ∈ [0 , , for all i ∈ [ k ] , we have lim m →∞ ( E (cid:107) S k (cid:107) pp, w ) m m k − = ( k − p k − · W w , (25) where W w = |{ ≤ i ≤ k : w i = 1 }| is the number of weights taking value . In particular, theGreenwood statistic satisfies lim m →∞ (cid:0) E (cid:107) S k (cid:107) , k (cid:1) m m k − = k !2 k − . (26) Proof.
We first rewrite (17) as E (cid:0) (cid:107) S k (cid:107) pp, w (cid:1) m = 1 (cid:0) pm + k − k − (cid:1) (cid:88) η ∈ D m,k (cid:0) mη ,...,η k (cid:1)(cid:0) pmpη ,...,pη k (cid:1) k (cid:89) j =1 w η j j = 1 (cid:0) pm + k − k − (cid:1) s w m , (27)which has leading order O (cid:0) m − ( k − (cid:1) , if we can show that s w m is Ω(1). To do so, we proceed byinduction on k , the length of w , proving that in fact lim m →∞ s w m = W w . It is straightforward tocheck that for η ∈ { , . . . , m − } , (cid:0) mη (cid:1) / (cid:0) pmpη (cid:1) is bounded above by (cid:0) m (cid:1) / (cid:0) m η (cid:1) , and thus for the basecase k = 2 we have s ( w ,w ) m = m (cid:88) η =0 (cid:0) mη (cid:1)(cid:0) pmpη (cid:1) w η w m − η ≤ w m + w m + (cid:0) m (cid:1)(cid:0) pmp (cid:1) + ( m − (cid:0) m (cid:1)(cid:0) pm p (cid:1) m →∞ −−−−→ w =1 + w =1 = W ( w ,w ) , (28)as desired. For the inductive step, we condition on the first entry of η to obtain s ( w ,...,w k ) m = m (cid:88) (cid:96) =0 (cid:0) m(cid:96) (cid:1)(cid:0) pmp(cid:96) (cid:1) w (cid:96) (cid:88) η ∈ D m − (cid:96),k − (cid:0) m − (cid:96)η ,...,η k − (cid:1)(cid:0) p ( m − (cid:96) ) pη ,...,pη k − (cid:1) k − (cid:89) j =1 w η j j +1 = s ( w ,...,w k ) m + w m + O (cid:0) m − (cid:1) m →∞ −−−−→ W ( w ,...,w k ) + w =1 = W ( w ,...,w k ) , (29)where we used the inductive hypothesis on s ( w ,...,w k ) m , and as in (28), bounded summands corre-sponding to (cid:96) ∈ { , . . . , m − } by (cid:0) m (cid:1) / (cid:0) pm p (cid:1) . (25) and (26) now follow from taking the limit as m → ∞ in (27). 9he above result is useful primarily for describing the right tail of (cid:107) S k (cid:107) pp, w . Despite the obviousutility of such a result to hypothesis testing, such a description has not been available so far, evenfor (cid:107) S k (cid:107) , k . Corollary 1.
For w = ( w , . . . , w k ) such that W w ≥ , the density f p, w k of (cid:107) S k (cid:107) pp, w is analytic at x = 1 , and its first non-zero term in the Taylor expansion is ( k − W w / k − (1 − x ) k − . That is,for x close to x = 1 , we have f p, w k ( x ) = ( k − W w k − (1 − x ) k − + O (cid:16) (1 − x ) k − (cid:17) . (30) In particular, Greenwood’s statistic satisfies f , k k ( x ) = (cid:0) k (cid:1) k − (1 − x ) k − + O (cid:16) (1 − x ) k − (cid:17) . (31) Proof.
Let f p, w k ( x ) = (cid:80) ∞ j =0 c j (1 − x ) j be the Taylor expansion of f p, w k around x = 1. We firstnotice that for any r ≥ (cid:90) x m (1 − x ) r d x = 1 m + r + 1 · (cid:0) m + rr (cid:1) , (32)and hence, using the fact that f p, w k is bounded, E (cid:0) (cid:107) S n,k (cid:107) pp, w (cid:1) m = (cid:90) x m f p, w k ( x ) d x + O (cid:0) e − m (cid:1) = ∞ (cid:88) j =0 c j (cid:90) x m (1 − x ) j d x + O (cid:0) e − m (cid:1) = ∞ (cid:88) j =0 c j m + j + 1 1 (cid:0) m + jj (cid:1) + O (cid:0) e − m (cid:1) . (33)Identifying the ( k − nd term with (25) immediately yields (30).While Corollary k − nd one in the Taylor expansionof f p, w k around x = 1. For instance, it is straightforward to compute c k − = (cid:0) k (cid:1) ( k +2) / k and c k =( k +1)( k +6)( k +7 k +16) / k +4 by hand, and more generally, c r for arbitrary r ∈ N can be efficientlycomputed in O (cid:16) rp log rp log k + [ r log r ] (cid:17) time. See Appendix A for the detailed algorithm. Thisdistributional understanding of the right tail complements an explicit description of f p, w k on thefar left worked out by Moran (1953) for the Greenwood statistic, valid for x ∈ [0 , / ( k − Lemma 1.
For k, p > , the density f p, k k ( x ) of (cid:107) S k (cid:107) pp, k is analytic on the intervals (cid:8)(cid:0) j p − , j − p − (cid:1)(cid:9) j ∈{ ,...,k } . In particular, the Taylor expansion of f p, k k ( x ) around x = 1 hasradius of convergence / p − .Proof. As we increase the radius r of an (cid:96) p ball centered at the origin, the ball will intersect the j < k dimensional faces of ∆ k − for the first time at r pj = (cid:107) j j (cid:107) pp = j p − , which can be seenfrom projecting the (cid:96) p ball onto the j -dimensional coordinate-hyperplanes. These are the onlypoints where f p, k k is not smooth, and therefore f p, k k must be analytic on (cid:8)(cid:0) j p − , j − p − (cid:1)(cid:9) for j ∈ { , . . . , k } . 10n light of Lemma
1, it is clear that the narrow applicability of the left-tail formulas in Moran(1953) is due to the quickly decreasing volume λ k − (cid:0) ∆ k − (cid:1) in k : the only regime in which theintersection of an (cid:96) and an (cid:96) ball is straightforward to compute is when this intersection is empty(i.e., (cid:107) S k (cid:107) , k ≤ /k ) or restricted to the k − k − (in which case this com-putation reduces to a calculation of the volume of spherical caps). However, the volumes of theseregimes are exhausted quickly, highlighting the importance of radius-independent descriptions like Theorem
Proposition k approximations provideconservative estimates to large k instances. Proposition 4.
For p > , the c.d.f. F p, k k of (cid:107) S k (cid:107) pp, k is increasing in k . That is, F p, k (cid:48) k (cid:48) ( x ) ≥ F p, k k ( x ) for all x ∈ [0 , and k (cid:48) > k .Proof. We let B pk ( r ) be the (cid:96) p ball of radius r in R k , and recall that µ ∆ k − (cid:110) (cid:107) S k (cid:107) pp, k ≤ r p (cid:111) = λ k − (cid:0) B pk ( r ) ∩ ∆ k − (cid:1) λ k − (∆ k − ) . (34)From Lemma x ≤ k − p − . In order to relate λ k − (cid:0) B pk ( r ) ∩ ∆ k − (cid:1) to λ k − (cid:0) B pk − ( r ) ∩ ∆ k − (cid:1) for x > k − p − , we define K i ( r ) for i ∈ { , . . . , k } to be the cone of apex k k and base formed by the intersection of B pk ( r ) with the i th ( k − k − (for some fixed enumeration of the k ( k − · (cid:83) ki =1 K i ( r ) ⊂ B pk ( r ) ∩ ∆ k − , it follows that λ k − (cid:16) B pk ( r ) ∩ ∆ k − (cid:17) > k (cid:88) i =1 λ k − ( K i ) (35)= kk − · (cid:112) k ( k − λ k − (cid:16) B pk ( r ) ∩ ∆ k − (cid:17) = √ k ( k − λ k − (cid:16) B pk − ( r ) ∩ ∆ k − (cid:17) , (36)where by slight abuse of notation we used λ k − (cid:0) B pk ( r ) ∩ ∆ k − (cid:1) for the ( k − B pk ( r ) intersected with the ( k − k − . The fact that k · λ k − (cid:0) B pk ( r ) ∩ ∆ k − (cid:1) = λ k − (cid:0) B pk − ( r ) ∩ ∆ k − (cid:1) follows from the same projection argument as used in Lemma Our treatment up to now has primarily focused on the large n limit of (cid:107) S n,k (cid:107) pp, w while keeping k fixed, for which we saw a rich limiting distribution with fruitful connections to the previouslystudied Greenwood statistic emerge. However, in the setting of two-sample testing, it may very wellhappen that k and n are of comparable order, in which case similar behavior cannot be expected togovern the distribution of (cid:107) S n,k (cid:107) pp, w . What happens in these cases is much simpler, as the followingproposition demonstrates. Proposition 5.
Assume n, k → ∞ , and define µ n,k,p = k − E (cid:107) S n,k (cid:107) pp, k , σ n,k,p = k − Var (cid:107) S n,k (cid:107) pp, k .If < W min ≤ w i ≤ W max for all i ∈ { , . . . , k } , then Z n,k,p, w = (cid:107) S n,k (cid:107) pp, w − µ n,k,p (cid:16)(cid:80) kj =1 w j (cid:17) σ n,k,p (cid:16)(cid:80) kj =1 w j (cid:17) / d −→ (cid:40) N (0 , , if kn → α ≤ , , if k = o ( n ) . (37)11 oreover, in the case of k → ∞ while n remains fixed, if for every k , { w i } i ∈ [ k ] is the discretization w i = w ( i/k ) of some function w : [0 , → [ W min , W max ] continuous (Lebesgue) almost everywhere,then (cid:107) S n,k (cid:107) pp, w d −→ n (cid:88) j =1 w ( U j ) , (38) where { U j } j ∈ [ n ] are i.i.d. Uniform ([0 , . The proof of this central limit theorem is by the method of moments, where explicit combi-natorial expressions like (27) and known large deviations allow for precise quantification of thedecorrelation in S n,k . The full details are presented in the Appendix B . We point out here thatin the two-sample setting, constraining k to grow at most linearly in n is no restriction, as theroles of balls and bins turn out to be easily exchanged. Before elaborating on the application totwo-sample testing and making this statement precise, we give an efficient algorithm to reconstructa distribution from its moments. Reconstructing a probability measure from its moments is a task that has received attention both intheoretical settings (see e.g. Akhiezer and Kemmer, 1965), where existence and uniqueness questionsare addressed, and applied statistical problems, where existence and uniqueness are typically takenfor granted, and efficient algorithms for computing the distribution in question are sought.For a discrete distribution of n atoms, the latter can be done by solving an n × n Vandermondesystem, which Bj¨orck and Pereyra (1970) showed is solvable in O (cid:0) n (cid:1) time. However, in our set-ting (cid:107) S n,k (cid:107) pp, w generically has O (cid:0) n k − (cid:1) atoms, whose precise location within { x min , . . . , (cid:107) w (cid:107) ∞ n p } is typically unknown. That is, solving the moment problem for (cid:107) S n,k (cid:107) pp, w exactly via its asso-ciated Vandermonde system requires O (cid:0) min (cid:8) (cid:107) w (cid:107) ∞ n p , n k − (cid:9)(cid:1) operations, which already forsmall values of p or k becomes prohibitively large. Moreover, aside from this computational in-tractability in our discrete setting, it is clear that such direct approach is unfeasible to conductin the corresponding infinite-dimensional scenario required for (cid:107) S k (cid:107) pp, w , and hence new algorithmsare needed.A commonly proposed alternative consists of forfeiting the demand for an exact recovery andfocus on approximate reconstruction instead, attempting to trade off accuracy for accelerated run-times. Examples of such approximation schemes include maximum entropy based algorithms (seee.g. Mead and Papanicolaou, 1984) as well as various applications of the method of moments. Whilethe latter relies to a large extent on strong parametric assumptions, which are not available for ourgeneralized Greenwood statistics, the former is primarily useful for density estimations with onlya few explicit (or estimated) moments. Theorem
Theorem
2, however, allow us accessto a vast number of moments quickly, suggesting that a maximum entropy ansatz could wastevaluable information. To remedy this situation, we recall a fact that is mentioned in Feller (2008)(p. 227, Theorem 2), but that to the best of our knowledge has not found widespread use in appliedstatistics.
Fact 1.
Let X ∈ [0 , be a (not necessarily continuous) random variable with cumulative dis-tribution function F and moments µ m = E X m , then at every continuity point x of F , we have lim n →∞ ˆ F n ( x ) = F ( x ) , where ˆ F n := n − (cid:88) j =0 jn ≤ x (cid:18) nj (cid:19) ( − n − j (cid:0) δ n − j µ (cid:1) j , (39)12 ith δ : R N → R N being the difference operator δ : ( a j ) j ∈ N (cid:55)→ ( a j +1 − a j ) j ∈ N . Part of the reason for the modest popularity of
Fact µ m are not known exactly. In our situation, this instability does not presentany limitations, since Theorem
Theorem µ m = E (cid:107) S n,k (cid:107) pp, w (or µ m = E (cid:107) S k (cid:107) pp, w , respectively) to arbitrary precision, thus rendering (39) a promising candidate forreconstructing the distributions in question. It remains to clarify its convergence speed: Proposition 6.
Let X ∈ [0 , be a random variable which is either (i) absolutely continuous withrespect to λ , with density f ∈ C ([0 , , or (ii) discrete with support supp X = { x , . . . , x N } . Thenfor any resolution ε n → , ε n > n . , there exists n ( f, ε ) ∈ N , so that for all n ≥ n , sup x ∈ [0 , (cid:12)(cid:12)(cid:12) ˆ F n ( x ) − F ( x ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) f (cid:107) ∞ + 2 (cid:107) f (cid:48) (cid:107) ∞ + 2 n + 1 , (i)sup x ∈ [0 , \ supp εnX (cid:12)(cid:12)(cid:12) ˆ F n ( x ) − F ( x ) (cid:12)(cid:12)(cid:12) ≤ e − nε n + ( | supp X | − e − nh , (ii) where ˆ F n ( x ) is defined in (39) , supp εX = { x ∈ [0 ,
1] : d ( x, supp X ) < ε } is the ε -fattening, and h = min i,j | x i − x j | is the mesh of supp X .Proof. We first tackle (i) by recalling from Feller (2008) that the summation in (39) is nothing but E B n,x ( X ) = E n − (cid:88) j =0 jn ≤ x (cid:18) nj (cid:19) X k (1 − X ) n − k , (40)where B n,x is the degree n approximation of [0 ,x ] by Bernstein polynomials (see Bernstein (1912)).To compute its approximation error, we choose a threshold ε n → F ( x ) − E B n,x ( X ) = E (cid:0) [0 ,x ] ( X ) − B n,x ( X ) (cid:1) = (cid:90) [0 , \{ x } εn (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) f ( y ) d y (cid:124) (cid:123)(cid:122) (cid:125) A n,x + (cid:90) { x } εn (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) f ( y ) d y (cid:124) (cid:123)(cid:122) (cid:125) A (cid:48) n,x , (41)in which we treat the term A n,x first: Interpreting B n,x ( y ) as P ( S n,y ≤ nx ), where S n,y ∼ Binomial( n, y ),we see that by standard large deviation estimates and Pinsker’s inequality | A n,x | ≤ ( x − ε n ) (cid:107) f (cid:107) ∞ e − nD KL ( x | x − ε n ) + (cid:107) f (cid:107) ∞ (1 − x + ε n ) e − nD KL ( x | x + ε n ) ≤ (cid:107) f (cid:107) ∞ e − nε n , (42)where D KL ( p | q ) is the KullbackLeibler divergence (or the relative entropy) between a Bernoulli( p )and Bernoulli( q ) distribution. To control A (cid:48) n,x then, we Taylor expand f to rewrite the integral in1341) as A (cid:48) n,x = (cid:90) { x } εn (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) (cid:0) f ( x ) + f (cid:48) ( ξ y,x )( y − x ) (cid:1) d y = ( f ( x ) − M n · x ) (cid:90) { x } εn (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) d y (cid:124) (cid:123)(cid:122) (cid:125) A (cid:48)(cid:48) n,x + M n (cid:90) { x } εn (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) y d y (cid:124) (cid:123)(cid:122) (cid:125) A (cid:48)(cid:48)(cid:48) n,x , (43)where min y ∈{ x } εn f (cid:48) ( y ) ≤ M n ≤ max y ∈{ x } εn f (cid:48) ( y ). In particular, since we assumed f ∈ C ([0 , ε n →
0, there must exist a n (cid:48) so that f (cid:48) ( x ) − ≤ M n ≤ f (cid:48) ( x ) + 1 for all n ≥ n (cid:48) . So it remainsto control A (cid:48)(cid:48) n,x and A (cid:48)(cid:48)(cid:48) n,x , which can be done in a manner similar to (42): (cid:12)(cid:12) A (cid:48)(cid:48) n,x (cid:12)(cid:12) ≤ (cid:90) [0 , (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) d y + e − nε n = x − n + 1 + e − nε n (44) ≤ n + 1 + e − nε n (cid:12)(cid:12) A (cid:48)(cid:48)(cid:48) n,x (cid:12)(cid:12) ≤ (cid:90) [0 , (cid:0) [0 ,x ] ( y ) − B n,x ( y ) (cid:1) y d y + e − nε n (45)= 3 nt ( x −
1) + 2( x − n + 1)( n + 2) + e − nε n ≤ n + 1 + e − nε n , provided n ≥
4. Finally, combining (41)-(45), we obtain (cid:12)(cid:12)(cid:12) ˆ F n ( x ) − F ( x ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) f (cid:107) ∞ + 2 (cid:107) f (cid:48) (cid:107) ∞ + 2 n + 1 + 2 (cid:0) (cid:107) f (cid:107) ∞ + (cid:107) f (cid:48) (cid:107) ∞ (cid:1) e − nε n , (46)independently of x . Choosing ε n ≥ n − + δ and n so large that the first term dominates the secondyields (i). (ii) follows in a very similar manner by observing that for n such that ε n < h , any x ∈ [0 , \ supp ε n X satisfies (cid:12)(cid:12) [0 ,x ] ( y ) − B n,x ( y ) (cid:12)(cid:12) ≤ (cid:40) e − nε n , if y ∈ { y min ( x ) , y max ( x ) } ,e − nh , for all other y ∈ supp X , (47)where y min ( x ) = min { y (cid:48) ∈ supp X : y (cid:48) > x } and y max ( x ) = max { y (cid:48) ∈ supp X : y (cid:48) < x } are the twoatoms of X left and right of x . Therefore, | F ( x ) − E B n,x ( X ) | ≤ (cid:88) y ∈ supp X P ( X = y ) (cid:12)(cid:12) [0 ,x ] ( y ) − B n,x ( y ) (cid:12)(cid:12) ≤ e − nε n + ( | supp X | − e − nh , (48)which is (ii).We remark that the proof works equally well for distributions that have both an absolutelycontinuous and a singular part (with respect to λ ), in which case the continuous componentpresents the bottleneck, resulting in an O (cid:0) n − (cid:1) bound like in (i). For purely discrete measures14owever, we notice that by setting ε n = ε < h/ F ( x ) for x ∈ supp X upto exponentially decreasing error (in the number of moments) by computing ˆ F ( x + 2 ε ). Moreover,the bounds (i)-(ii) present worst case errors that are achieved at x for which f ( x ) , | f (cid:48) ( x ) | are largeor atoms of X densely packed, respectively. Away from these bottlenecks, and in particular inthe tails of (cid:107) S n,k (cid:107) p, w and (cid:107) S k (cid:107) p, w , these guarantees should improve significantly. Lastly, we mayreplace each use of Bernstein polynomials throughout the entire analysis with any other expedientpolynomial approximation of [0 ,x ] in order to impose desired properties on the reconstructeddensity. If, e.g., one-sided reconstructions are preferable (for instance, in order to give rise toconservative hypothesis tests in Section 5), then resorting to appropriate one-sided polynomialapproximations (the optimal of which is worked out in Bustamante et al., 2012) will enforce thispreference. To summarize our situation then:1. We can approximate the distribution of (cid:107) S k (cid:107) pp, w on [ a p, w ,
1] (where a , k = and a p, w < r around x = 1 in O (cid:16) rp log rp log k + [ r log r ] (cid:17) time (see Proposition
Appendix ).2. On [0 , a p, w ], we can achieve a uniform approximation error of ε in O (cid:0) ε log ε log k (cid:1) (see Proposition
Proposition p = 2 and w = k , exact formulas for any x ∈ [0 , k − ] are available by Lemma n large, we can approximate the distribution of (cid:107) S n,k (cid:107) pp, w by that of (cid:107) S k (cid:107) pp, w and using thetwo bullet points above. The additional error incurred is of order O (cid:0) n − (cid:1) by Proposition n and k , Proposition ε -approximationof the distribution of (cid:107) S n,k (cid:107) pp, w in O (cid:0) log ε (cid:1) time.Observations 1 , Having developed a thorough distributional understanding of both Greenwood’s and Dixon’s statis-tics as well as their various generalizations, we are now in a position to illuminate their role in thehypothesis tests that motivated them. We begin by carrying out the original test of uniformityproposed by Greenwood (1946). We demonstrate its power in comparison to other commonly usedtest statistics, and describe its implications for the wider class of one-sample tests. Our second ap-plication is then devoted to clarifying completely the two-sample test described in Dixon (1940) byreplacing low-order approximations and lifting limiting sample size constraints that were assumedtherein; in addition to illustrating how the flexibility that comes with our family of generalized teststatistics can substantially improve power.
Recall from Section 1 that the null hypothesis to be queried in Greenwood (1946) concerned deter-mining the distributional family of k sample times T , . . . , T k . Namely, H : { T j } j ∈ [ k ] i.i.d. ∼ E ( λ ) , for some λ ∈ R + , (49)15here E ( λ ) denotes the exponential distribution of rate λ . It is of particular interest to be ableto detect alternatives consisting of point-processes whose inter-arrival times show either under-or overdispersion with respect to this homogeneous Poisson( λ )-process; that is, it is desirable tomaximize power against H : { T j } j ∈ [ k ] i.i.d. ∼ X, where c V = Var X ( E X ) (cid:54) = 1 . (50)Through normalizing by (cid:80) j T j , Greenwood noticed that this decision problem is tantamount tothe task of distinguishing the null hypothesis H : (cid:32) T , T + T , . . . , k − (cid:88) j =1 T j (cid:33)(cid:14)(cid:32) k (cid:88) j =1 T j (cid:33) ∼ (cid:0) U (1) , . . . , U ( k − (cid:1) , (51)where { U j } j ∈ [ k − i.i.d. ∼ Uniform ([0 , U ( j ) is the j th order statistic, from a class of alternativeswhere points in [0 ,
1] tend to, intuitively, be overly equi-spaced (corresponding to c V <
1) or overlyclustered (resulting from c V > H : (cid:16) T , . . . , T k (cid:17)(cid:14)(cid:32) k (cid:88) j =1 T j (cid:33) ∼ µ ∆ k − , (52)with spacings in the alternative class exhibiting either smaller ( c V <
1) or larger ( c V >
1) variancesthan under the null. It is this last formulation (52) that motivated Greenwood to introduce hiseponymous statistic (cid:0) (cid:80) kj =1 T j (cid:1) / (cid:0) (cid:80) kj =1 T j (cid:1) , whose law under the null is simply that of (cid:107) S k (cid:107) , k in our notation above. Greenwood successfully treated the case k = 2, but was unable to extendhis results to larger sample sizes. Theorem
Proposition p -values efficiently and accurately. Indeed, our algorithm proceeds fast enough to runlarge scale power studies for an extensive range of k (computing the p -value of a sample of 10 , χ , Kolmogorov-Smirnov,Cramer-von-Mises), are uniformly high (and particularly pronounced in the case of underdisperseddata), rendering it a suitable hypothesis test to decide (49) against (50).This performance is especially encouraging in light of the role that tests of uniformity playin the larger context of one-sample testing, where one is given a sample Z , . . . , Z k of size k ,and wants to ascertain whether these k samples all arose in an i.i.d. fashion from the samecontinuous distribution F . To ask whether { Z i } i ∈ [ k ] are i.i.d. F however, is the same as to askwhether { F ( Z i ) } i ∈ [ k ] are distributed i.i.d. as U ∼ Uniform ([0 , family of generalized Greenwood statistics {(cid:107) S n,k (cid:107) pp, w } p ∈ N , w ∈ R k is the ability to accommodate various, even strongly disparate, alternativesby means of adjusting p and w . As this is best illustrated in the framework of two-sample tests, andreadily reduced to one-sample tests from there, we will not delineate the details here, but ratherdevelop them in the following subsection on two-sample tests, while being careful to point out anyparticular adjustments that may be necessary for application to one-sample tests.16 .0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 α - β ROC curve for ℋ : σ / μ = / S k k Pearson's χ Kolmogorov - SmirnovCramer - von - Mises α - β ROC curve for ℋ : σ / μ = α - β ROC curve for ℋ : σ / μ ∈ { /
50, 10 } σ / μ - β Testing uniformity ( ℋ : σ / μ = ) at significance α = Figure 1: One-sample test results. ROC and power curves of tests based on the Greenwoodstatistic (cid:107) S k (cid:107) , k on under- and overdispersed data for k = 10, compared to three other commonlyused tests of uniformity (Pearson’s χ , Kolmogorov-Smirnov, Cramer-von-Mises): each experimentconsists of 1000 independently drawn samples from the null ( µ ∆ k − ) and alternative (Erlang andHyperexponential, either as individual classes as in the top two panels, or mixed into one classand presented at equal probability as in the bottom left panel) distributions matching the statedcoefficients of variation; α denotes the type I error, while β is the type II error (and 1 − β is power). One-sample tests are in a concrete sense (namely, that of (38) in Proposition 5) large sample-sizelimits of two-sample tests: Instead of judging whether the generating distribution of one givensample { Z i } i ∈ [ k ] matches a suspected given continuous F , the task is to decide whether two drawnsamples { X i } i ∈ [ k − and { Y j } j ∈ [ n ] have identical generating mechanisms. That is, assuming that { X i } i ∈ [ k − and { Y j } j ∈ [ n ] are generated i.i.d. from F and G , respectively, the null hypothesis to betested is H : F = G, (53)which indeed in the n → ∞ limit (where G becomes fully known) reduces to the one-sample set-ting. The equivalent of the uniformizing transformation { Z i } i ∈ [ k ] → { F ( Z i ) } i ∈ [ k ] in the one-samplesetting is now given by the discrete uniformization { X i } i ∈ [ k − , { Y j } j ∈ [ n ] → { S n,k (cid:74) j (cid:75) } j ∈ [ k ] , where S n,k (cid:74) j (cid:75) is as defined in (1). It is straightforward to verify that S n,k = ( S n,k (cid:74) (cid:75) , S n,k (cid:74) (cid:75) , . . . , S n,k (cid:74) k (cid:75) )is distributed as Multinomial( n, α ) with α ∼ Dirichlet(1 , . . . , µ D n,k .Consequently, to probe H is really to probe whether S n,k is distributed according to µ D n,k or not.An array of test statistics devised to be sensitive against either arbitrary (e.g. the Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnov, 1948) or the Cramer-von-Mises test (Cram´er, 1928;Von Mises, 2013)) or specific (e.g. the Mann-Whitney test (Mann and Whitney, 1947)) families ofalternatives have surfaced over the last century, yet, to the best of our knowledge, a clear under-standing of their relative power against each such family of alternatives has remained elusive. Inother words, given various classes of alternatives A , . . . , A d that a practitioner might deem likelyto present themselves, it is often unclear how an appropriate test statistic is to be chosen. Havingaccess to fast numerical evaluations of (cid:107) S n,k (cid:107) pp, w and their laws for arbitrary p and w offers oneattractive solution to this problem: it allows the user to optimize any quantity of interest (like17ower) over this family of generalized Greenwood statistics quickly. To wit, assuming for now asufficiently well-behaved class of alternatives A , i.e. H : G ∈ A , and denoting by H p, w n,k the lawof (cid:107) S n,k (cid:107) pp, w on R induced by discretely uniformizing { Y j } j ∈ [ n ] i.i.d. ∼ H ∈ A through { X i } i ∈ [ n ] , the(two-sided) power at significance threshold α is computed asmin H ∈A (cid:8) − β αp, w ( H ) (cid:9) = min H ∈A (cid:110) − H p, w n,k (cid:0) [ z − p, w ( α ) , z + p, w ( α )] (cid:1)(cid:111) , (54)where z ± p, w ( α ) are the α - and (1 − α )2 -quantiles of (cid:107) S n,k (cid:107) pp, w under µ D n,k (i.e. µ D n,k (cid:0) [ z − p, w , z + p, w ( α )] (cid:1) =1 − α ). z ± p, w ( α ) are efficiently computed from the moments of µ D n,k , so in principle, if A is tractableenough (relative to F ) to allow for an explicit characterization of H p, w n,k for every H ∈ A , (54)is amenable to fast numerical optimization. Alas, in practice we can hardly expect (cid:107) S n,k (cid:107) pp, w under H to be as accessible as under the null, so computing (54) to arbitrarily high accuracy willlikely prove unfeasible. Nevertheless, reasonable approximations to (54) are often available undermild assumptions on A . The following two examples illustrate the process of performing suchapproximate optimization, its impact on statistical power, as well as how to extend this selectionprocess to composite hypotheses. Example 1 (Detecting heteroskedasticity) . Assume without loss of generality that E X = 0 , Var X =1 , and that we would like to test against alternatives of the form G = F ◦ ( y (cid:55)→ y/σ ) , i.e. Y = σX ,for any constant σ ∈ R + . It is straightforward to see that as σ → ∞ , S n,k will be concentratedmostly on its far ends S n,k (cid:74) (cid:75) and S n,k (cid:74) k (cid:75) weighted by F (0) , i.e., S n,k d −→ S ∞ n,k := N ∞ { } +( n − N ∞ ) { k } = ( N ∞ , , . . . , , n − N ∞ ) , where N ∞ ∼ Binomial ( n, F (0)) and P (cid:16) S n,k (cid:54) = S ∞ n,k (cid:17) = O (cid:0) σ − (cid:1) . Likewise, the limiting law of S n,k as σ → is quickly verified to be that of a S n,k := n { N } variable, where N ∼ Binomial ( n, F (0)) as before, again with P (cid:16) S n,k (cid:54) = S n,k (cid:17) = O ( σ ) . It is there-fore plausible to assume that most mass of (cid:107) S n,k (cid:107) pp, w is tightly concentrated around (cid:107) S ∞ n,k (cid:107) pp, w =( w N p ∞ + w k ( n − N ∞ ) p ) and (cid:107) S n,k (cid:107) pp, w = n p (cid:107) { N } (cid:107) pp, w , respectively, and that thus (54) is appre-ciably large whenever µ D n,k (cid:0) (cid:107) S n,k (cid:107) pp, w / ∈ (cid:2) (cid:107) S n,k (cid:107) pp, w , (cid:107) S ∞ n,k (cid:107) pp, w (cid:3)(cid:1) = n (cid:88) i =0 n (cid:88) j =0 P ( N ∞ = i ) P ( N = j ) × µ D n,k (cid:0) (cid:107) S n,k (cid:107) pp, w / ∈ (cid:2) (cid:107) S n,k (cid:107) pp, w , (cid:107) S ∞ n,k (cid:107) pp, w (cid:3) | N ∞ = i, N = j (cid:1) (cid:46) α, (55) where by slight abuse of notation we use [ a, b ] to denote the interval [min { a, b } , max { a, b } ] . This motivates solving the approximate, yet computationally tractable, op-timization problem of finding arg min p, w µ D n,k (cid:0) (cid:107) S n,k (cid:107) pp, w / ∈ (cid:2) (cid:107) S n,k (cid:107) pp, w , (cid:107) S ∞ n,k (cid:107) pp, w (cid:3)(cid:1) , (56) for any given F (0) , instead of optimizing (54) directly. To verify empirically that any such minimiz-ers do indeed give rise to a powerful test of heteroskedasticity, we ran large scale simulations for var-ious F and G in the assumed family of distributions. As Figure 2A illustrates, where performing theoptimization in (56) yielded parameter choices of p = 1 and w = (10 , , , , , , , , , , our new two-sample test based on the generalized spacing-statistics (cid:107) S n,k (cid:107) pp, w compares favorably with other non-parametric tests (Mann-Whitney, Kolmogorov-Smirnov,and Cramer-von-Mises) commonly used for such tasks. S n , k p , wp Mann - WhitneyKolmogorov - SmirnovCramer - von - Mises α - β ROC curve for G = ( σ ) , σ ∈ { / } σ - β Power for G = ( σ ) at α = B Power for G = ( μ , σ ) at α = σ - - μ Figure 2: Two-sample test results. The null hypothesis tested is F = G using samples X , . . . , X k − iid ∼ F = Normal (0 ,
1) and Y , . . . , Y n iid ∼ G = Normal (cid:0) µ, σ (cid:1) , for k = 10 and n = 30(different parameter choices led to minor qualitative changes only). The experimental setup is sim-ilar to that of Figure 1. A . ROC and power curves for detecting heteroskedasticity. Our new testbased on the generalized spacing-statistics (cid:107) S n,k (cid:107) pp, w exhibits substantially improved performanceover the other non-parametric two-sample tests. B. Power against joint variations in location andscale. G = Normal (cid:0) µ, σ (cid:1) , with µ ∈ {− , − , , , } and σ ∈ { , , , , } . Colors of bubble indi-cate the test statistic used—generalized spacing-statistics (cid:107) S n,k (cid:107) pp, w (red), Mann-Whitney (black),Kolmogorov-Smirnov (dark gray), Cramer-von-Mises (light gray)—while its radius indicates power1 − β . The column for µ = 0 corresponds to the results illustrated in the bottom panel of A .Although the alternatives in Example 1 are composite, the laws they induce on D n,k are alltightly clustered around two universal ones, thereby effectively reducing the decision task to asemi-simple hypothesis test. The extension to truly composite settings is standard: Example 2 (Sensing location and scale) . We enrich the class of alternatives in Example 1 bylocation shifts, i.e. we consider G of the form G = F ◦ ( x (cid:55)→ ( x − µ ) /σ ) for µ ∈ R , σ ∈ R + . Thelaws G µ,σ induced on D n,k now exhibit infinitely many accumulation points, barring any simpleoptimization of the kind we performed before. Indeed, even if we had the capacity to identifymaximizers of (54) explicitly, they likely would not deliver satisfactory power, for we have max p, w min H ∈A (cid:0) − β αp, w ( H ) (cid:1) ≤ max p, w min H ∈A µ ∪A σ (cid:0) − β αp, w ( H ) (cid:1) , (57) where A σ = { G : G = F ◦ ( x (cid:55)→ x/σ ) for some σ ∈ R + } ⊂ A and A µ = { G : G = F ◦ ( x (cid:55)→ x − µ ) for some µ ∈ R } ⊂ A are the pure location shifts and pure scales,respectively. Computations as in Example 1 successfully yield powerful parameter choices against A µ and A σ individually, yet fail to do so for A µ and A σ jointly, producing values of − β notexceeding ≈ . . To mend this shortcoming, we can capitalize on the successful optimizers ( p µ , w µ )19 nd ( p σ , w σ ) against A µ and A σ individually by considering their ensemble; that is, we optimize arg max p µ , w µ ,p σ , w σ max (cid:26) min H µ ∈A µ (cid:16) − β α p µ , w µ ( H µ ) (cid:17) , min H σ ∈A σ (cid:16) − β α p σ , w σ ( H σ (cid:17)(cid:27) , (58) which, again employing strategies as in Example 1, is done efficiently. To verify the utility of ( p µ , w µ ) and ( p σ , w σ ) we, as before, resorted to extensive numerical simulations of which a typicaloutcome is depicted in the Figure 2B. Since the choices of n and k in this particular instancematch those of Figure 2A, the resulting ( p σ , w σ ) are identical to ( p, w ) of Example 1. p µ = 1 and w µ = (10 , , , , , , , , , , on the other hand, turned out to closely resemble the parameterconfigurations that gave rise to Mann-Whitney’s U statistic, which does not surprise since the latterwas designed with locations shifts in mind. As a consequence, our test based on (cid:107) S n,k (cid:107) p µ p µ , w µ and (cid:107) S n,k (cid:107) p σ p σ , w σ boasts power comparable to the Mann-Whitney test when sensing location shifts, whileextending sensitivity to heteroskedastic alternatives as well. Notably, this comparative advantage indetecting scale changes persists even against tests like Kolmogorov-Smirnov and Cram´er-von-Miseswhose design is not centered around mean shifts. Examples 1 and 2 provide manifestations of (54) in two concrete two-sample instances, forwhich the optimal choice of p happened to be p = 1. As mentioned before, these optimization toolsare easily adapted to the one-sample setting, which often requires mere replacement of S n,k with S k . The following example illustrates such adaptation in practice, while also supplying a family ofcircumstances in which choices of p greater than 2 lead to greater power. Example 3 (Spiked spacing model and higher p -norms) . We revisit the one-sample test in (49) ,where we sought to distinguish Markovian arrival times from over- or underdispersed alternatives.In our present case, however, we investigate an alternative hypothesis H consisting of a distribution G k of T , . . . , T k that is both over- and underdispersed in the following sense: Under G k , arrivaltimes are again drawn iid from an underdispersed distribution H − (that is, c V ( H − ) < ), with theexception of a single randomly chosen T K (i.e., K ∼ Uniform ([ k ]) ) whose law H + now exhibitsoverdispersion ( c V ( H + ) > ). In other words, G k mixes k − underdispersed arrivals with uniformly chosen overdispersed spacing. We will call this overdispersed T K the spiked or outlierarrival time, and refer to the just described model of G k as the spiked spacing model. Though thesubsequent analysis is phrased in terms of this spiked spacing model, much of its reasoning pertainsto similar outlier or correlation models of this kind as well.To design a test capable of reliably detecting this spiked spacing model, we first observe that thesymmetry in T , . . . , T k (induced by the uniform choice of K ) suggests little benefit of choices for w other than k , leaving p as the sole parameter to optimize in (54) . To choose among the candidatesfor p then, it is useful to clarify and compare the geometry that various (cid:96) p balls give rise to whenintersected with ∆ k − : as the -dimensional illustrations of Figure 3A demonstrate, the growthof the (normalized) intersection volume V pk ( s ) = µ ∆ k − (cid:16) (cid:107) S k (cid:107) pp, k ≤ (cid:107) s (cid:107) pp, k (cid:17) depends noticeablyon the precise location of our observation s . If s localizes exactly along any of the line segments L + = (cid:110) ←−−−−→ k k , e i (cid:111) i ∈ [ k ] , where e i is the i th standard basis vector, then V pk ( s ) ⊂ V qk ( s ) whenever p < q ,while V pk ( s ) ⊃ V qk ( s ) in case s falls precisely on any of the line segments L − = (cid:110) ←−−−−→ k k , m i (cid:111) i ∈ [ k ] ,where m i = k − ( k − e i ) is the midpoint of the ( k − -dimensional face opposite of vertex e i . Since p -values are nothing but − V pk ( s ) , it follows that tests based on (cid:107) S k (cid:107) ∞∞ , k should be most powerfulin the former scenario, while (cid:107) S k (cid:107) , k -based tests shine in the latter scenario, with intermediatelocalizations giving rise to optimal p ∗ between and ∞ . In our spiked spacing model at hand, the ℓ and ℓ ∞ level sets near corners L + ℓ and ℓ ∞ level sets near faces L - B S k k S k k Pearson's χ Kolmogorov - SmirnovCramér - von - Mises α - β ROC curve for uniformly spiked spacing spike size [ a.u. ] - β Power against uniformly spiked spacing
Figure 3: Analysis of spiked spacing model (described in detail in Example 3). A. Illustration of tailprobabilities on ∆ in the cases of p = 2 and p = ∞ , and the samples (denoted by solid dots) givingrise to them. While a fixed sample near line segments in L + (purple, dashed lines in top panel)produces smaller sub-level sets in (cid:96) than (cid:96) ∞ (thereby increasing the p -value of said sample, whichcorresponds to 1 minus the shaded area), this trend reverses for observations near line segmentsin L − (orange, dashed lines in bottom panel). B. ROC and power curves. The spiked spacingmodel largely concentrates around L + in ∆ k − , with the degree of this concentration increasingwith spike size. As a consequence, p -norms of samples generated under such alternative tend toseparate more markedly for larger p , which in turn affords increases in power of (cid:107) S k (cid:107) pp, k when p >
2. The experimental design and choice of under- and overdispersed distributions match thoseof Figure 1; in particular, k = 10. support of G k gravitates towards the line segments L + , and so we expect choices of p larger than to be profitable. Indeed, carrying out simulations as in Figure 3B reveals this to be true, withprecise values of p ∗ depending on the distributional details H + and H − . Generally, p ∗ is attainedaround or for modest amplitudes of the spiked T K and/or moderate degrees of underdispersionin the remaining arrival times, and stabilizes at for more pronounced levels of spiking and/orunderdispersion. Past p = 6 , ROC and power curves tend to change only slightly. We close this section with a few remarks on the scope and availability of our proposed hypothesistests:1. Even though our entire discussion is phrased around continuous null and alternative distri-butions F and G , the extension to discrete variables is straightforward: it merely requiresrecourse to a source of independent noise to randomly break ties when forming S n,k .2. Due to their widespread use, our primary focus lies on applications of generalized Greenwoodstatistics (cid:107) S n,k (cid:107) pp, w to unpaired one- and two-sample test. However, they can naturally bedeployed in any other goodness-of-fit context in which null distributions effectively reduce to µ D n,k or µ ∆ k − , e.g. paired two-sample tests.21. (54) and its derived optimization problems are stated so as to incorporate rare events (under H ) in both the left and right tail of (cid:107) S n,k (cid:107) pp, w . Of course, a one-sided hypothesis test canbe enforced by only considering one such tail.4. The significance threshold adjustment α/ E ( (cid:107) S n,k (cid:107) pp, w · (cid:107) S n,k (cid:107) qq,v ) m , and recover from those the joint distribution P ( (cid:107) S n,k (cid:107) pp, w ≤ s, (cid:107) S n,k (cid:107) qq,v ≤ t ), which would allow for more refined adjustment of signifi-cance thresholds.5. An implementation of both the one- and two-sample test in Mathematica together withprecomputed parameter configurations optimal against shifts in location, scale, skewness andkurtosis (as well as combinations thereof) is available at https://github.com/songlab-cal/mochis . Since early on, Greenwood’s statistic and its relatives were theorized to be powerful candidatesfor a variety of goodness-of-fit tasks, yet proving them to be such, either rigorously or empirically,has, due to a lack of distributional understanding, largely remained open. Here we contribute tosuch distributional understanding by embedding Greenwood’s statistic into a larger family of laws,the generalized Greenwood statistics (cid:107) S n,k (cid:107) pp, w , whose distributional properties are more amenableto analysis. In particular, we were able to obtain explicit, efficiently computable, expressions fortheir associated moment sequences, and glean both qualitative (e.g. convergence, regularity andmonotonicity results) as well as quantitative (convergence rates, tail behaviour, CLT) insightsfrom them. By providing an algorithmic procedure to recover a given distribution to arbitraryaccuracy from its truncated moment sequence, we are able to quickly compute quantiles and p -values, which in turn enables accurate and adaptive hypothesis tests based on said generalizedGreenwood statistics. As a consequence, we were in a position to empirically verify the gainsin power in two such goodness-of-fit settings, namely one- and two-sample tests, compared toconventional non-parametric test statistics widely used for these tasks.22 ppendixA Computing Taylor coefficients Proposition 7 (Taylor coefficients) . Let f p, w k be the density (with respect to Lebesgue measure) of (cid:107) S k (cid:107) pp, w , then on [1 / , , we have f p, w k = ∞ (cid:88) j = k − c w j (1 − x ) j , (59) where c w r can be computed in O (cid:16) rp log rp log k + [ r log r ] (cid:17) time.Proof. We recall from (27) that µ m = E ( (cid:107) S k (cid:107) p, w ) pm can be written as µ m = 1 (cid:0) pm + k − k − (cid:1) (cid:88) η ∈ D m,k (cid:0) mη ,...,η k (cid:1)(cid:0) pmpη ,...,pη k (cid:1) k (cid:89) j =1 w η j j = s w m (cid:0) pm + k − k − (cid:1) , (60)where s w m = (cid:80) ∞ j =0 σ w j ( m ) · m − j with σ w j ( m ) remaining constant σ w j past some threshold m wj . By Lemma (cid:107) S k (cid:107) p, w , f p, w k is analytic on [1 / , µ m = (cid:90) x m f p, w k ( x ) d x = ∞ (cid:88) j =0 c w j (cid:90) x m (1 − x ) j d x + O (cid:0) e − m (cid:1) = ∞ (cid:88) j =0 c w j (cid:20) ( m + j + 1) (cid:18) m + jj (cid:19)(cid:21) − + O (cid:0) e − m (cid:1) , (61)which suggests that by matching coefficients in (60) and (61) we should be able to translate between σ w j and c w j . For this to be helpful, we need to understand σ w j : Lemma 2 ( σ w j recursion) . Defining b jr = [ m − r ] (cid:0) mj (cid:1) / (cid:0) pmpj (cid:1) and employing notation as in (3) , wehave σ w r = r (cid:48) (cid:88) j =0 (cid:16) σ ( w k , j · σ w − k r − j + w k =1 s w − k j · b jr (cid:17) , (62) with initial condition σ w ,w r = (cid:80) r (cid:48) j =0 b jr (cid:16) w =1 w j + w =1 w j (cid:17) , where r (cid:48) = (cid:98) r/ ( p − (cid:99) . In particu-lar, we can compute σ w r in O (cid:16) r (cid:48) log r (cid:48) log k + [ r log r ] (cid:17) time. roof of Lemma Slightly abusing notation, we have σ w r = (cid:2) m − r (cid:3) s w m = (cid:2) m − r (cid:3) (cid:88) η ∈ D m,k (cid:0) mη ,...,η k (cid:1)(cid:0) pmpη ,...,pη k (cid:1) k (cid:89) j =1 w η j j = (cid:2) m − r (cid:3) m (cid:88) ω =0 (cid:0) mω (cid:1)(cid:0) pmpω (cid:1) w ωk (cid:88) η ∈ D m − ω,k − (cid:0) m − ωη ,...,η k − (cid:1)(cid:0) p ( m − ω ) pη ,...,pη k − (cid:1) k − (cid:89) j =1 w η j j = (cid:2) m − r (cid:3) m (cid:88) ω =0 (cid:0) mω (cid:1)(cid:0) pmpω (cid:1) w ωk · s w − k m − ω = (cid:2) m − r (cid:3) r (cid:48) (cid:88) ω =0 (cid:0) mω (cid:1)(cid:0) pmpω (cid:1) w ωk · s w − k m + (cid:2) m − r (cid:3) r (cid:48) (cid:88) ω =0 (cid:0) mω (cid:1)(cid:0) pmω (cid:1) w m − ωk · s w − k ω = (cid:2) m − r (cid:3) r (cid:48) (cid:88) ω =0 s w − k m ∞ (cid:88) j =0 b ωj w ωk m − j + (cid:2) m − r (cid:3) r (cid:48) (cid:88) ω =0 w m − ωk s w − k ω ∞ (cid:88) j =0 b ωj m − j = r (cid:48) (cid:88) j =0 σ w − k r − j · σ w k , j + w k =1 r (cid:48) (cid:88) ω =0 s w [1: k − ω · b ωj = r (cid:48) (cid:88) j =0 (cid:16) σ w k , j · σ w − k r − j + w k =1 s w − k j · b jr (cid:17) , as desired. To see that (62) can be computed in O (cid:16) r (cid:48) log r (cid:48) log k + [ r log r ] (cid:17) time, we notice thatcalculation of s w − k r is O ( r (cid:48) log r (cid:48) log k ) by the same reasoning as in Proposition
2, and b jr , writtenas, b jr = (cid:2) m − r (cid:3) (cid:0) mj (cid:1)(cid:0) pmpj (cid:1) = ( pj − p (cid:2) m − r (cid:3) pm − (cid:89) (cid:96) = p ( m − j )+1 p (cid:45) (cid:96) pm · − (cid:96)pm = ( pj − p (cid:2) m − r (cid:3) pm − (cid:89) (cid:96) = p ( m − j )+1 p (cid:45) (cid:96) R (cid:18) (cid:96)pm (cid:19) , (63)where R ( x ) = (cid:80) ∞ j =0 x j is again a convolution of ( p − · r (cid:48) = r polynomials and hence computablein O (cid:16) [ r log r ] (cid:17) .With a proper understanding of σ w j at hand from Lemma
2, we may rewrite (60) as µ m = ∞ (cid:88) j =0 (cid:32) j (cid:88) ω =0 a kω · σ w j − ω (cid:33) m − j + O (cid:0) e − m (cid:1) , (64)where a kω = [ m − ω ] (cid:0) pm + k − k − (cid:1) − . Similarly, expanding (61) yields µ m = ∞ (cid:88) j =0 (cid:32) j − (cid:88) ω =0 d ωj · c w ω (cid:33) m − j + O (cid:0) e − m (cid:1) , (65)24here d ωj = (cid:2) m − j (cid:3) (cid:2) ( m + ω + 1) (cid:0) m + ωω (cid:1)(cid:3) − . Consequently, matching the r th coefficients in (64) and(65) allows to solve for c w r : c w r = 1 r ! r +1 (cid:88) j = k − a kj · σ w r +1 − j − r − (cid:88) j = k − d jr +1 c w j , (66)where in the choice of summation indices we have used the fact that a kj = 0 for j ∈ { , . . . , k − } and c w j = 0 for j ∈ { , . . . k − } by Proposition
4. Now { d r +1 , . . . , d r − r +1 } can be found in O (cid:16) r (log r ) (cid:17) time, and given a, b and d , the recursion is solved in O (cid:0) r (cid:1) steps, amounting to a total complexityof O (cid:16) r (log r ) + r + r (cid:48) log r (cid:48) log k + [ r log r ] (cid:17) = O (cid:16) r (cid:48) log r (cid:48) log k + [ r log r ] (cid:17) . B Alternative scaling limits
Proof of
Proposition We will show that the moments of Z n,k,p, w converge to the respectivelimiting moments of a N (0 ,
1) variable, which together with Theorem 30.2 of Billingsley (1995)implies the desired result. Hence we investigate E (cid:0) Z mn,k,p, w (cid:1) = σ − m/ n,k,p, w E (cid:0) (cid:107) S n,k (cid:107) pp, w − µ n,k,p, w (cid:1) m = σ − m/ n,k,p, w E k (cid:88) j =1 w j [( S n,k (cid:74) j (cid:75) ) p − µ n,k,p ] (cid:124) (cid:123)(cid:122) (cid:125) =: X j m = σ − m/ n,k,p, w m (cid:88) t =1 (cid:88) a ∈ D ≤ m,t C a (cid:88) i ,...,i t distinct E [( w i X i ) a · · · ( w i t X i t ) a t ] , (67)where C a is a combinatorial factor to be determined later, and D ≤ m,t is the set of ordered (strong)compositions of m into t parts (which is in bijection with the set of partitions of m into t parts).We will argue that for case 1 of (37), all summands in (67) vanish in the limit of n → ∞ , while onlythe t = m/ , a = a = · · · = a t = 2 term survives in the k = αn regime. We begin by determiningthe asymptotics of µ n,k,p : Lemma 3 (Asymptotics of µ n,k,p ) . We have µ n,k,p = n · q p ( n, k ) (cid:104) k (cid:105) p = O (cid:18) n p k p (cid:19) , (68) where q p ( n, k ) = O ( n p − ) is a polynomial in n and k , and (cid:104) k (cid:105) p = k · ( k + 1) · · · ( k + p − is therising factorial.Proof of Lemma Let us first rewrite µ n,k,p into a form more amenable to extract asymptotics: µ n,k,p = k − E (cid:107) S n,k (cid:107) pp, k = k − k (cid:88) j =1 E ( S n,k (cid:74) j (cid:75) ) p = E ( S n,k (cid:74) (cid:75) ) p = 1 (cid:0) n + k − k − (cid:1) n (cid:88) j =0 j p (cid:18) n − j + k − k − (cid:19) , (69)25hose RHS sum we claim can in general form be expressed as n (cid:88) j =0 j p (cid:18) n − j + (cid:96)(cid:96) (cid:19) = n + (cid:96) (cid:88) j = (cid:96) ( n − j + (cid:96) ) p (cid:18) j(cid:96) (cid:19) = (cid:18) n + (cid:96) − (cid:96) (cid:19) ( n + (cid:96) )( n + (cid:96) + 1) (cid:104) (cid:96) + 1 (cid:105) p +1 q (cid:48) p ( n, (cid:96) ) , (70)for some polynomial q (cid:48) p ( n, (cid:96) ) = O (cid:0) n p − (cid:1) if p ≥ q (cid:48) ( n, (cid:96) ) = n − . We prove (70) by inductionon p . Base case:
When p = 0, (70) simply becomes the hockey stick identity n + (cid:96) (cid:88) j = (cid:96) (cid:18) j(cid:96) (cid:19) = (cid:18) n + (cid:96) + 1 (cid:96) + 1 (cid:19) = (cid:18) n + (cid:96) − (cid:96) (cid:19) ( n + (cid:96) )( n + (cid:96) + 1)( (cid:96) + 1) · n − = (cid:18) n + (cid:96) − (cid:96) (cid:19) ( n + (cid:96) )( n + (cid:96) + 1) (cid:104) (cid:96) + 1 (cid:105) q (cid:48) ( n, (cid:96) ) . (71) Inductive step:
Using the binomial recursion (cid:0) n +1 k +1 (cid:1) = (cid:0) nk (cid:1) + (cid:0) nk +1 (cid:1) , we find for p ≥ n + (cid:96) (cid:88) j = (cid:96) ( n − j + (cid:96) ) p (cid:18) j(cid:96) (cid:19) = n + (cid:96) (cid:88) j = (cid:96) ( n − j + (cid:96) ) p (cid:18) j + 1 (cid:96) + 1 (cid:19) − n + (cid:96) (cid:88) j = (cid:96) ( n − j + (cid:96) ) p (cid:18) j(cid:96) + 1 (cid:19) = n + (cid:96) (cid:88) j = (cid:96) +1 ( n + (cid:96) + 1 − j ) p (cid:18) j(cid:96) + 1 (cid:19) − n + (cid:96) (cid:88) j = (cid:96) +1 (cid:34) p (cid:88) i =0 (cid:18) pi (cid:19) ( n + (cid:96) + 1 − j ) i ( − p − i (cid:35) (cid:18) j(cid:96) + 1 (cid:19) = p − (cid:88) i =1 (cid:18) pi (cid:19) ( − p − i − n + (cid:96) +1 (cid:88) j = (cid:96) +1 ( n + (cid:96) + 1 − j ) i (cid:18) j(cid:96) + 1 (cid:19) + ( − p − n + (cid:96) (cid:88) j = (cid:96) +1 (cid:18) j(cid:96) + 1 (cid:19) = (cid:34) p − (cid:88) i =1 (cid:18) pi (cid:19) ( − p − i − (cid:18) n + (cid:96)(cid:96) + 1 (cid:19) ( n + (cid:96) + 1)( n + (cid:96) + 2) (cid:104) (cid:96) + 2 (cid:105) i +1 q (cid:48) i ( n, (cid:96) + 1) (cid:35) + ( − p − (cid:20)(cid:18) n + (cid:96) + 2 (cid:96) + 2 (cid:19) − (cid:18) n + (cid:96) + 1 (cid:96) + 1 (cid:19)(cid:21) = (cid:18) n + (cid:96) − (cid:96) (cid:19) ( n + (cid:96) )( n + (cid:96) + 1) (cid:104) (cid:96) + 1 (cid:105) p +1 (cid:34) (cid:104) (cid:96) + 3 (cid:105) p − + p − (cid:88) i =1 (cid:18) pi (cid:19) ( − p − i − ( n + (cid:96) + 2) (cid:104) (cid:96) + 3 + i (cid:105) p − − i q (cid:48) i ( n, (cid:96) + 1) (cid:35) (cid:18) n + (cid:96) − (cid:96) (cid:19) ( n + (cid:96) )( n + (cid:96) + 1) (cid:104) (cid:96) + 1 (cid:105) p +1 q (cid:48) p ( n, (cid:96) ) , (72)where the second equality follows from binomial expansion of ( n − j + (cid:96) ) p = ( n + (cid:96) + 1 − j − p ,while the fourth equality follows from applying the inductive hypothesis. Note that q (cid:48) p ( n, (cid:96) ) is asum of polynomials in O ( n p − ), and hence it itself is a polynomial in O ( n p − ).To finish the proof of the lemma, it suffice to observe from (68) and (70) with (cid:96) = k − µ n,k,p = (cid:0) n + k − k − (cid:1)(cid:0) n + k − k − (cid:1) ( n + k − n + k − (cid:104) k − (cid:105) p +1 q (cid:48) p ( n, k −
2) = n · q p ( n, k ) (cid:104) k (cid:105) p , (73)where q p ( x, y ) = q (cid:48) p ( x, y − µ n,k,p on n and k in hand, we can elucidate the asymptotics of thesummands in (67): Lemma 4 (Asymptotics of summands) . For (cid:96) ≤ t ≤ m and a ∈ D ≤ m,t such that a = · · · = a (cid:96) = 1 and a j ≥ for all j ∈ { (cid:96) + 1 , . . . , t } , we have σ − mn,k,p, w (cid:88) i ,...,i t distinct E (cid:96) (cid:89) j =1 w i j X i j t (cid:89) j = (cid:96) +1 ( w i j X i j ) a j = O (cid:34)(cid:18) kn (cid:19) p ( m − t ) k t − m/ n (cid:96) (cid:35) . (74) Proof.
Three cumbersome applications of (70) to the computation of the variance show that σ n,p,k,w = O (cid:0)(cid:0) nk (cid:1) p √ k (cid:1) , which takes care of the denominator on the LHS of (74). To treat theenumerator, we use exchangeability of the X i as well as the compact support of the weights w j toupper bound the magnitude of the sum on the LHS by the magnitude of k t W m max · E (cid:96) (cid:89) j =1 X j t (cid:89) j = (cid:96) +1 X a j j = k t W m max E E [ X | X , . . . , X t ] (cid:96) (cid:89) j =2 X j t (cid:89) j = (cid:96) +1 X a j j . (75)For fixed n and growing k , we expect the bin occupations to decorrelate, and hence the conditionalexpectation on the RHS of (75) to vanish. Indeed, referring once more to (70) and writing X t := (cid:80) tj =2 X j , we have E [ X | X , . . . , X t ] = E (cid:2) X | X t (cid:3) = µ n − X t ,k − t +1 ,p − µ n,k,p = n − X t (cid:104) k − t + 1 (cid:105) p q p (cid:0) n − X t , k − t + 1 (cid:1) − n (cid:104) k (cid:105) p q p ( n, k )= O (cid:18) n p − k p (cid:19) , (76)27s long as X t = (cid:80) tj =2 S n,k (cid:74) j (cid:75) = o ( n ). Whence the magnitude of (75) is bounded above by k t W m max E O (cid:18) n p − k p (cid:19) X t = o ( n ) (cid:96) (cid:89) j =2 X j t (cid:89) j = (cid:96) +1 X a j j + X t (cid:54) = o ( n ) (cid:96) (cid:89) j =1 X j t (cid:89) j = (cid:96) +1 X a j j ≤ k t W m max O (cid:18) n p − k p (cid:19) E X t = o ( n ) (cid:96) (cid:89) j =2 X j t (cid:89) j = (cid:96) +1 X a j j + k t W max P (cid:2) X t (cid:54) = o ( n ) (cid:3) n m = k t W m max O (cid:18) n p − k p (cid:19) E (cid:96) (cid:89) j =2 X j t (cid:89) j = (cid:96) +1 X a j j (cid:124) (cid:123)(cid:122) (cid:125) A +2 n m k t W max P (cid:2) X t (cid:54) = o ( n ) (cid:3) . (77)Repeating the same reasoning used in (75) and (77) (cid:96) times on A yields an upper bound for themagnitude of (74) of k t W m max O (cid:32) n (cid:96) ( p − k (cid:96)p (cid:33) E t (cid:89) j = (cid:96) +1 X a j j (cid:124) (cid:123)(cid:122) (cid:125) B +2( (cid:96) − k t n m P [ S n,k (cid:74) (cid:75) (cid:54) = o ( n )] (cid:124) (cid:123)(cid:122) (cid:125) C . (78)We will argue below in Lemma n/k , which impliesthat C is exponentially small in k , while B scales like ( n/k ) p ( t − (cid:96) ) . Combining these with the O ( √ k ( n/k ) p ) scaling of σ n,k,p, w gives the final upper bound O (cid:34) k t k m/ (cid:18) kn (cid:19) pm (cid:18) kn (cid:19) (cid:96) − p(cid:96) k (cid:96) (cid:18) kn (cid:19) p ( (cid:96) − t ) (cid:35) , (79)which simplifies to (74) as desired.To substantiate the claims about B and C in the proof of Lemma
4, we establish the followinglemma:
Lemma 5 (Large deviations for bin sizes) . If k = O ( n β ) , then for all < ε < β there exists C ε > so that P (cid:2) X j ≥ n − ε (cid:3) < Ce − n β − ε . (80) Proof.
We can compute explicitly P (cid:2) S n,k (cid:74) (cid:75) ≥ n − ε (cid:3) = (cid:0) n − n − ε + k − k − (cid:1)(cid:0) n + k − k − (cid:1) = n (cid:89) j = n − n − ε +1 (cid:18) k − j (cid:19) − ≤ C exp (cid:20) − ( k −
1) log (cid:18) − n − ε (cid:19) + O (cid:18) k n ε (cid:19)(cid:21) ≤ C exp (cid:104) − n β − ε + O (cid:16) n β − ε + n β − − ε (cid:17)(cid:105) , (81)which is dominated by the n β − ε term as long as ε < β .At last, we are now in shape to establish Proposition 5 from (74), for we see that28. if t < m/
2, the RHS is of O (cid:104)(cid:0) kn (cid:1) m/ k t − m/ (cid:105) and hence vanishes as n → ∞ ;2. if t > m/
2, then since (cid:96) ≥ t − m (because (cid:96) + 2( t − (cid:96) ) ≤ m ), the RHS is of O (cid:2) ( k/n ) t − m/ (cid:3) and vanishes as n → ∞ ;3. if t = m/ k = o ( n ), we obtain asymptotics of O (cid:2) ( k/n ) pm/ (cid:3) , which vanishes once againas n → ∞ .Hence, the only terms in (67) contributing to the limiting moments are those for which t = m/ (cid:96) = 0 , a j = 2 for all j ∈ { , . . . , t } , C ,..., = ( m − m must be even), when k = Θ( n ). That is, E (cid:0) Z mn,αn,p, w (cid:1) = ( m − · E m/ (cid:89) j =1 (cid:20) ( S n,k (cid:74) j (cid:75) ) p − µ n,kp k − / σ n,k,p,w (cid:21) , (82)which by decorrelation computations very similar to (76) is readily seen to converge to ( m − E e − t (cid:107) S n,k (cid:107) pp, w = P ( (cid:107) S n,k (cid:107) ∞ ≤ · E (cid:104) e − t (cid:107) S n,k (cid:107) pp, w (cid:12)(cid:12) (cid:107) S n,k (cid:107) ∞ ≤ (cid:105) + P ( (cid:107) S n,k (cid:107) ∞ > · E (cid:104) e − t (cid:107) S n,k (cid:107) pp, w (cid:12)(cid:12) (cid:107) S n,k (cid:107) ∞ > (cid:105) = 1 (cid:0) kn (cid:1) (cid:88) S ⊂ [ k ] S = n e − t (cid:80) j ∈ S w j + O (cid:0) k − (cid:1) = 1 k ( k − · · · ( k − n + 1) k (cid:88) s =1 k (cid:88) s =1 s (cid:54) = s q · · · k (cid:88) s n =1 s n / ∈{ s ,...,s n − } e − t (cid:80) nj =1 w (cid:16) sjk (cid:17) + O (cid:0) k − (cid:1) , (83)which, since w is bounded and continuous almost everywhere (and hence Riemann integrable)converges, as k → ∞ , to (cid:18)(cid:90) e − tw ( x ) d x (cid:19) n = E e − t (cid:80) nj =1 w ( U j ) (84)as desired. Acknowledgments
We thank Ben Wormleighton for acquainting the authors with Ehrhart’s work, and Jonathan Fischerfor helpful comments on software implementation. This research is supported in part by an NIHgrant R35-GM134922. YSS is a Chan Zuckerberg Biohub investigator.29 eferences
Abramowitz, M. and Stegun, I. A.
Handbook of Mathematical Functions: with Formulas, Graphs,and Mathematical Tables , volume 55. Courier Corporation, 1965.Akhiezer, N. I. and Kemmer, N.
The Classical Moment Problem: and Some Related Questions inAnalysis , volume 5. Oliver & Boyd Edinburgh, 1965.Bernstein, S. 1912. D´emonstration du th´eor`eme de weierstrass fond´ee sur le calcul des probabilit´es.
Communications de la Soci´et´e Math´ematique , (1) 1–2.Billingsley, P. Probability and Measure . Wiley Series in Probability and Statistics. Wiley, 1995.Bj¨orck, A. and Pereyra, V. 1970. Solution of Vandermonde systems of equations.
Mathematics ofcomputation , (112) 893–903.Burrows, P. M. 1979. Selected percentage points of Greenwood’s statistics. Journal of the RoyalStatistical Society. Series A (General) , (2) 256–258.Bustamante, J., Quesada, J. M., and Mart´ınez-Cruz, R. 2012. Best one-sided L approximation tothe Heaviside and sign functions. Journal of Approximation Theory , (6) 791–802.Cram´er, H. 1928. On the composition of elementary errors: First paper: Mathematical deductions. Scandinavian Actuarial Journal , (1) 13–74.Currie, I. D. 1981. Further percentage points of Greenwood’s statistic. Journal of the RoyalStatistical Society. Series A (General) , (3) 360–363.Darling, D. 1953. On a class of problems related to the random division of an interval. The Annalsof Mathematical Statistics , (2) 239–253.Dixon, W. J. 1940. A criterion for testing the hypothesis that two samples are from the samepopulation. The Annals of Mathematical Statistics , (2) 199–204.Ehrhart, E. 1967. Sur un probleme de g´eom´etrie diophantienne lin´eaire ii. J. reine angew. Math , (25) C49.Feller, W. An Introduction to Probability Theory and its Applications , volume 2. John Wiley &Sons, 2008.Gardner, A. 1952. Greenwood’s “problem of intervals”: An exact solution for n = 3. Journal ofthe Royal Statistical Society: Series B (Methodological) , (1) 135–139.Greenwood, M. 1946. The statistical study of infectious diseases. Journal of the Royal StatisticalSociety , (2) 85–110.Holst, L. 1979. A unified approach to limit theorems for urn models. Journal of Applied Probability , (1) 154–162.Holst, L. and Rao, J. 1980. Asymptotic theory for some families of two-sample nonparametricstatistics. Sankhy¯a: The Indian Journal of Statistics, Series A , Foundations of Modern Probability . Springer Science & Business Media, 2006.30olmogorov, A. 1933. Sulla determinazione empirica di una lgge di distribuzione.
Inst. Ital. Attuari,Giorn. , The Annals of Mathematical Statistics , (1) 50–60.Mead, L. R. and Papanicolaou, N. 1984. Maximum entropy in the problem of moments. Journalof Mathematical Physics , (8) 2404–2417.Moran, P. 1947. The random division of an interval. Supplement to the Journal of the RoyalStatistical Society , (1) 92–98.Moran, P. 1951. The random division of an interval–Part II. Journal of the Royal StatisticalSociety. Series B (Methodological) , (1) 147–150.Moran, P. 1953. The random division of an interval–Part III. Journal of the Royal StatisticalSociety. Series B (Methodological) , (1) 77–80.Palamara, P., Terhorst, J., Song, Y., and Price, A. 2018. High-throughput inference of pairwisecoalescence times identifies signals of selection and enriched disease heritability. Nature Genetics , (9) 1311–1317.Riley, M. C., Clare, A., and King, R. D. 2007. Locational distribution of gene functional classes inarabidopsis thaliana. BMC Bioinformatics , (1) 112.Schechtman, G. and Zinn, J. Concentration on the (cid:96) np ball. In Geometric Aspects of FunctionalAnalysis , pages 245–256. Springer, 2000.Sethuraman, J. and Rao, J. Pitman efficiencies of tests based on spacings. In Puri, M. L., editor,
Nonparametric Techniques in Statistical Inference , page 405416. Cambridge University Press,1970.Smirnov, N. 1948. Table for estimating the goodness of fit of empirical distributions.
The Annalsof Mathematical Statistics , (2) 279–281.Stephens, M. A. 1981. Further percentage points for Greenwood’s statistic. Journal of the RoyalStatistical Society. Series A (General) , (3) 364–366.Von Mises, R. Wahrscheinlichkeit, Statistik und Wahrheit: Einf¨uhrung in d. neue Wahrschein-lichkeitslehre u. ihre Anwendung , volume 3. Springer-Verlag, 2013.Weiss, L. 1956. A certain class of tests of fit.
The Annals of Mathematical Statistics , (4)1165–1170.Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics Bulletin ,1,