Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test
NNumerically more stable computation of the p-values for thetwo-sample Kolmogorov-Smirnov test
Thomas Viehmann ∗ MathInf Technical Report 2021-1, February 2021
Abstract
The two-sample Kolmogorov-Smirnov test is a widely used statistical test for detecting whethertwo samples are likely to come from the same distribution. Implementations typically recur on anarticle of Hodges from 1957. The advances in computation speed make it feasible to compute exactp-values for a much larger range of problem sizes, but these run into numerical stability problemsfrom floating point operations. We provide a simple transformation of the defining recurrence for thetwo-side two-sample KS test that avoids this.
The Kolmogorov-Smirnov two sample test (KS test) is perhaps the go-to statistical statistical test ofwhether two samples originate form the same distribution.To make things precise, we consider samples x , . . . , x m ∈ R and y , . . . , y n ∈ R drawn from two con-tinuous distribution functions F and G , respectively. We form the empirical distribution functions F m ( x ) := |{ x i : x i ≤ x }| /m and G n ( x ) = |{ y i : y i ≤ x }| /n . The KS test then tests the null hypothesis F = G .The test-statistic is computed from the empirical distributions as D = sup x | F m ( x ) − G n ( x ) | . Following the usual notation, we write the supremum even though in the cases we consider, it is actuallya maximum. Working with only the ranks, the test does not make assumptions on the distributionsthemselves. To operationalize the test, we need to compute p-values.Smirnov famously gave the asymptotic formula that if m, n → ∞ such that n/m → q ∈ R , P = P rob (cid:20)(cid:114) mnm + n D ≥ x (cid:21) → − K ( x ) = 2 ∞ (cid:88) k =1 ( − k − exp( − k x ) , where K is the cumulative distribution function of the Kolmogorov distribution. We use Hodges’ name P for the two-sided problem.However, as noted by Hodges, it is unclear how well they work in practice: After studying the problemnumerically for m = 12 , n = 13 , . . . , he writes “The Smirnov approximation is seen to be highlyinaccurate for values of m and n which are already large enough for direct computations to be arduous.”The amount of computation that is possible with ease has dramatically increased since the 1950s and soit is natural to revisit direct computation. ∗ MathInf GmbH, [email protected] a r X i v : . [ s t a t . C O ] F e b Direct computation of p-values
The key observation for the computation of p-values is that we can compute the distribution of ‘D‘ underthe null hypothesis in purely combinatorical terms. Under the assumptions, all values are distinct withprobability 1. We may order the joint sequence of the x i and y j . We ignore the indices and write an x in positions where a x i occurs, and y where y i occurs, obtaining a random sequence of m x es and n y s.Under the null hypothesis, all possible (cid:0) m + nm (cid:1) distinct sequences have equal probability.We may map these to paths P between (0 , and (1 , where for each x we move to the right by /m and for each y we move up by /n . Given a path P , the statistic D is the maximum sup ( x,y ) ∈ P | x − y | . .
000 0 .
067 0 .
133 0 .
200 0 .
267 0 .
333 0 .
400 0 .
467 0 .
533 0 .
600 0 .
667 0 .
733 0 .
800 0 .
867 0 .
933 1 . To compute the probability that D ≤ d we thus need to count the ( n, m ) -step paths such that all gridpoints are in the corridor | x − y | < d and devide this number A m,n by the total number of paths (cid:0) m + nn (cid:1) .The classical iteration ([2]) is A i,j = if | i/n − j/m | ≥ d, if | i/n − j/m | < d and ( i = 0 or j = 0) ,A i − ,j + A i,j − otherwise.As we are iterating over the nodes inside the corridor (because we will not compute the zero valuesexcept perhaps from rounding up to a fixed number of nodes), this is called the inside method . Hodgesnotes that the number of additions increases as n as problem sizes increases and that operands increaseexponentially. With today’s computational resources, we are only mildly concerned about the numberof operations. However, even in today’s software packages the size of operands translates into difficultieswith numerical accuracy. Taking SciPy [3] as an example, When the probability P is computed, theyfirst compute − P using the inside method (with appropriate stabilization) but it can happen that − P is larger than the smallest number smaller than representable by the floating point numbersused.We therefore propose to use a different scheme to compute P that avoids computing − P . We define C i,j = 1 − A i,j / (cid:18) i + ji (cid:19) as the proportion of paths from (0 , to the grid point ( i/n, j/m ) that do not stay inside the d -corridor.2t the top right corner, C ( m, n ) = P . For ( i, j ) inside the corridor we have C i,j = 1 − A i,j / (cid:18) i + ji (cid:19) = 1 − ( A i − ,j + A i,j − ) / (cid:18) i + ji (cid:19) = 1 − (cid:18) (1 − C i − ,j ) (cid:18) i + j − i − (cid:19) + (1 − C i,j − ) (cid:18) i + j − i (cid:19)(cid:19) / (cid:18) i + ji (cid:19) = 1 − (cid:18) (1 − C i − ,j ) ii + j + (1 − C i,j − ) ji + j (cid:19) = C i − ,j ) ii + j + C i,j − ) ji + j . This entails C i,j = if | i/n − j/m | ≥ d, if | i/n − j/m | < d and ( i = 0 or j = 0) ,C i − ,j ii + j + C i,j − ji + j otherwise,a structurally very simple recursion.Note that in the corridor, C i,j is a weighted average of its neighbours, so we may expect favourable errorpropagation. When the value decreases in the iteration, it typically is because one of the predecessor is or smaller than the other, i.e. through scaling rather than cancellation. We implemented the recursion on C in Python, Numba[1] and C++. The (naively implemented) com-putation is considerably slower (approximately 4x) than the path-counting algorithm implemented bySciPy (when both are either run in Python or through the numba compiler), likely at least in part dueto the additional multiplications and divisions in the retursion for C . It is possible to exploit GPU-typeparallelism using the scanning pattern also used by operations like cumulative sum.As noted by Hodges, it suffices to keep track of the values in a d -corridor along the diagonal, so anaive Python implementation (that also can be compiled to native code using Numba). There are otherpossibilities for speedup, not lest that due to the rotational symmetry outlined by Hodges, which we donot implement. Listing 1: Stabilized Inner Methodd e f compute_p2 ( n , m, d ) :s i z e = i n t ( 2 ∗m∗d+2)l a s t r o w , row = numpy . z e r o s ( ( 2 , s i z e ) , dtype=numpy . f l o a t 6 4 )l a s t _ s t a r t _ j = 0f o r i i n r a n g e ( n + 1 ) :s t a r t _ j = max( i n t (m ∗ ( i /n + d ) ) + 1− s i z e , 0 )l a s t r o w , row = row , l a s t r o wv a l = 0 . 0f o r j j i n r a n g e ( s i z e ) :j = j j + s t a r t _ j 3 i s t = i /n − j /mi f d i s t > d o r d i s t < −d :v a l = 1 . 0e l i f i == 0 o r j == 0 :v a l = 0 . 0e l i f j j + s t a r t _ j − l a s t _ s t a r t _ j >= s i z e :v a l = ( i + v a l ∗ j ) / ( i + j )e l s e :v a l = ( l a s t r o w [ j j + s t a r t _ j − l a s t _ s t a r t _ j ] ∗ i + v a l ∗ j ) / ( i + j )row [ j j ] = v a ljjmax = min ( s i z e , m + 1 − s t a r t _ j )l a s t _ s t a r t _ j = s t a r t _ jr e t u r n row [m − s t a r t _ j ] We have outlined an alternative iteration for computing p-values for the 2-sample KS test for the insidemethod that exibits better numerical stability. It would be desirable to also obtain an analogue for theoutside method.