[PDF] On the asymptotics of Ajtai-Komlós-Tusnády statistics

Abstract

In our days there is a widespread analysis of Wasserstein distances between theoretical and empirical measures. One of the first investigation of the topic is given in the paper written by Ajtai, Komlós and Tusnády in 1984. Interestingly, all the neighboring questions posed by that paper were settled already without the original one. In this paper we are going to delineate the limit behavior of the original statistics with the help of computer simulations. At the same time we kept an eye on theoretical grasping of the problem. Based on our computer simulations our opinion is that the limit distribution is Gaussian.

Full PDF

OOn the asymptotics of AKT statistics

L´ıdia Rejt˝o G´abor Tusn´ady Abstract

In our days there is a widespread analysis of Wasserstein distances between theoreticaland empirical measures. One of the ﬁrst investigation of the topic is given in the paperwritten by Ajtai, Koml´os and Tusn´ady in 1984 . Interestingly all the neighboring questionsposed by that paper were settled already without the original one. In this paper we aregoing to delineate the limit behavior of the original statistics with the help of computersimulations. At the same time we kept an eye on theoretical grasping of the problem.Based on our computer simulations our opinion is that the limit distribution is Gaussian.

Let (( X i , Y i ) , i = 1 , . . . , n ) be independent points uniformly distributed on the d dimensionalunit cube. The AKT statistic W n is the minimum of n (cid:88) i =1 | X i − Y π ( i ) | (1)taken on all permutations π of (1 , . . . , n ) . We call the statistics Wasserstein distance. Usingcomputer experiments we are going to investigate the asymptotic behavior of W n as d = 2 and n goes to inﬁnity. Based on our computer simulations we think that if n ≤ , then thehazard rate function of the distribution of W n − β log( n ) is proportional to an appropriatelyscaled Gaussian distribution function with β = 0 . . Department of Applied Economics and Statistics, University of Delaware, Newark, Delaware, [email protected] Alfr´ed R´enyi Mathematical Institute of the Hungarian Academy of Sciences, Budapest, [email protected] a r X i v : . [ s t a t . O T ] S e p History

The AKT statistics were introduced in [1]. In that paper the authors proved that there existpositive constants c , c such that with probability going to 1 c log( n ) < W n < c log( n ) (2)holds true. The origin of this statistic goes back to the one dimensional Shapiro–Wilks statisticfor testing normality. In [2] the case d = 1 was investigated. The AKT statistics were gen-eralized in [10] while a new proof was given for (2) in [12]. Talagrand next investigated thecase d ≥ d = 2were sharpened in [7]. A broad introduction to transportation theory is presented in [14]. Theﬁrst application of the AKT theorem was in bin packing [5]. A numerical investigation of thedistribution of the statistic was given in [8]. Matchings on the Euclidean ball were investigatedin [3]. The paper [13] investigates the rate of convergence in abstract settings. An investigationof diﬀerent types of matchings was given in [6]. In [4] distributions supported on a manifoldembedded in a Hilbert space were investigated. Starting with n = 2 points we evolve n = 2 k , k = 1 , .. point pairs in such a way that in eachstep the original points remain and simultaneously a new set of point pairs are supplanted. Wecall the original kids ”old” and the new ones ”young”. Having two types of kids, boys and girls,there are four possibilities for matching:– old girls with old boys ( OO k ),– old girls with young boys ( OY k ),– young girls with old boys ( Y O k ),– young girls with young boys ( Y Y k ).The matchings OO k and Y Y k are independent, and similarly OY k and Y O k are independentwhile the costs OO k and OY k are correlated. The cost is the Wasserstein distance W n betweenthe corresponding set of points. For small values of n the approximation EW n = β log( n + α ) + γ is more accurate. Figure 1 shows the averages of costs of 296 repetitions up to k = 11 withthe above theoretical function. Here α = 0 . , β = 0 . , γ = 0 . . The densities of thenormalized diﬀerences of W n − ( β log( n + α ) + γ ) are given in Figure 2. Next generation costs OO k +1 are close to the averages( OO k + OY k + Y O k + Y Y k ) / , . . The quadratic recursion OO k +1 = ( OO k + OY k + Y O k + Y Y k ) / V k +1 (3)with an iid noise V k +1 agrees with our conjecture indicating that the quadratic recursion mightbe valid for two dimensions only.In Euclidean space for arbitrary points a, b, c, d | ( a + b ) − ( c + d ) | = | a − c | + | a − d | + | b − c | + | b − d | − | a − b | − | c − d | (4)holds true. Similarly in our case OO k +1 = a ∗ ( OO k + OY k + Y O k + Y Y k ) − b ∗ ( GG k + BB k ) + V k +1 (5)holds true -where GG and BB are the girls-girls, boys-boys distances. For stacionarity4 ∗ a − ∗ b = 1 (6)must be hold. According to our computer experiences a = 0 . , b = 0 .

41 and the error term is0 . . The distribution of the V error term seems to be double exponential. The engine of the Hungarian Algorithm is the system of shadow prices a ( i ) , b ( j ) such that b ( j ) − a ( i ) ≤ c ( i, j ) , where c ( i, j ) is the cost of the marriage of a pair ( i, j ) . In Figures 3 and 4 we give a pictorialpresentation of the prices for n = 1024 and n = 2048 . Here we used toroid distances and thecolors of points of the unit square shows the price of the closest kid to the point, i.e. the a ( i )in case the closest kid is a girl and the b ( j ) otherwise. The keys of colours are given on themargin. Pixel statistics are given for the colours, ﬁrst number stands for wives, second forhusbands. E.g. in Figure 3 there are 32 yellow wives and 69 husbands. Dark blue representsthe minimal value, which is zero for the value of the last wife is always zero. The maximalvalue for n = 1024 is 0.02499 and it is 0.01654 for n = 2024, they are represented by yellowcolour. The kids themselves are complementarily coloured for seeing them. Circles representgirls, squares represent boys. Marriages are shown by gray lines. For neglecting complicationsof toroid topology girls having a husband on the other side of the unit square are marked withan Andrew cross while their husband is not assigned. Interestingly usually there is one lakeand one hill.The actual total costs of the marriages are W = 0 . , W = 1 . , thus the averagedistance in a marriage is around 0 .

02 causing the same or neighboring colours for wife andhusband. We know that for a married couple ( i, j ) we have b ( j ) − a ( i ) = c ( i, j ) , n = 1024are simpler than for n = 2024 . For the distribution of W n the best approximation achieved by us is a distribution determinedby its hazard rate function. Let us denote the tail distribution of a one–dimensional randomvariable by Q ( t ) . The hazard rate r ( t ) is the derivative of − log Q ( t ) . Customarily hazardrates are deﬁned for survival functions, belonging to positive random variables. The concept isextendable naturally for random variables taking values between −∞ and ∞ too. Although W n is positive, the best approximation comes from this broader class of distributions. The hazardrate is an arbitrary non-negative function having inﬁnite integral on the real line. In ourcase it is proportional to Φ( t ) the distribution function of a standard normal random variable.Its integral up to an x is equal to x Φ( x ) + ϕ ( x ) , where ϕ ( x ) is the standard normal densityfunction. By deﬁnition it equals to − log Q ( x ). Applying the relation r ( x ) = f ( x ) /Q ( x ) , where f ( x ) stands for the density function we get f ( x ) = Φ( x ) exp( − x Φ( x ) − ϕ ( x )) . (7)The derivative of log f ( x ) is ϕ ( x ) / Φ( x ) − Φ( x ) , which is for x going to −∞ close to − x, for x going to ∞ it is going to ( −

1) and it is monotone decreasing. Thus for negative x -s thedistribution resembles a double exponential distribution but for positive x -s it turns to be asingle exponential. That is why f is increasing ﬁrst sharply to its modus around 0 . λ be arbitrary positive number, then ρ λ ( t ) = λ Φ( t ) is the hazard rateof the distribution with tail Q ( t ) λ and for any real µ and positive σ the linear transformationpresents the density f ( x | µ, σ, λ ) = λσ Φ( y ) exp( − λ ( y Φ( y ) − ϕ ( y )) , where y = ( x − µ ) /σ. (8)Let us denote by X λ a random variable corresponding to the hazard rate (cid:37) λ . Let λ > , σ > , µ be arbitrary real numbers, then f ( x | µ, σ, λ ) is the density of σX λ + µ. Thus µ is the locationparameter, σ is the scale parameter and λ is the shape parameter of the distribution. Figure 5.shows the density of the distribution of X λ / ( λ ) α with α = 0 .

4, for eleven diﬀerent λ from 0 . . For small λ , one can see that the distribution of X λ / ( λ ) α has an intensively increasingﬁrst phase and after it becomes similar to the exponential distribution. On the other hand forlarge λ it turns to be just the opposite. 4he estimates of the parameters σ and λ are highly correlated, unfortunately our presentsample size is not large enough to decide whether the value of the parameter λ diﬀers signiﬁ-cantly from 1 or not. If it does so then the question is the tendency of λ as n goes to inﬁnity.With λ = 1 the tendency of σ is not clear: it is increasing from 0 .

08 up to 0 .

14 and perhapsremains bounded.The dynamic (5) has a second condition (not presented here) in addition to (6). It ensuresthe boundedness of variance. For that condition we have to know the covariances of the costsof diﬀerent matchings. First it was an unsettled riddle for us what is the joint distribution forthe six transportation costs plus error term resulting in a Gaussian hazard rate with stationarydistribution. In order to settle this question the ideas of the paper [9] turned out to be useful:even the reference to the Euclidean relation (4) comes from the possibility that Euclideanrelations might be generalized into transportation equations. In the next section we are goingto discuss a partial solution of the riddle.

For ﬁxed k we generate four times 2 k of points, two sets of girl and two sets of boys. Let uslabel then A, B, C, D . ( A is the set of old girls, B is the set of young girls, C is the set of oldboys, and D is the set of young boys.) There are six Wasserstein distances among them: W = ( A, C ) , W = ( A, D ) , W = ( B, C ) , W = ( B, D ) , W = ( A, B ) , W = ( C, D ) .W is independent of W and its relation with the others is symmetric. In building a jointdistribution for the six variables we use an autoregressive model: the conditional distributionof W i +1 given ( W , . . . , W i ) , is a distribution with Gaussian hazard rate with arbitrary σ i , λ = 1and µ i = γ i + i (cid:88) s =1 γ is W s . The pairs ( γ i , µ i ) for k = 10 are the followings:(1 . , . . , . . , . . , . . , . . , . . For smaller k the tendency is similar, the leading terms γ i follow the general logarithmic trendand the σ i -s are practically the same. It goes without saying that they diverse in i . Theautoregressive coeﬃcients are the following: 5 able 1. γ = 0 . γ = 0 . γ = 0 . γ = − . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = 0 . γ = − . W for k + 1 is similar: the gammas follow the original pattern foundwith linear regression and the standard deviations are around 0 . . We can test the model in the following way. Having a well parametrized 67 dimensionaldistribution we can generate independent 67 dimensional random vectors as many times asmany samples we have in the unit square. Presently it is 817. Next we use the Hungarianalgorithm to determine the Wasserstein distance of the two point systems: it is around 1440.Finally we generate new random samples and determine their distances. Interestingly it is larger then 1440, its average is 1470 with deviance 10. It means that the structure of the unit-squaresample is a bit tighter than our autoregressive scheme. But a simple trick settles the trouble:if we multiply all standard deviation parameter σ with 0 .

975 then the unit-square Wassersteinget in the middle of model Wassersteins.If X is an arbitrary real random variable and Q ( x ) is its tail probability P ( X > x ) , thenthe distribution of Q ( X ) is uniform in the interval (0 , . The integrated hazard rate R ( x ) = (cid:90) x −∞ r ( t ) dt equals to − log( Q ( x )) , thus the distribution of R ( X ) is standard exponential and the distri-bution of exp( − R ( X )) is again uniform in (0 , . , .

01 they are 0 . , . . The value of the actual statistic is1 .

00 since this new Wassertstein statistics might have Gaussian hazard rate. The correspondingstatistics for points in the unit square are the followings.6 able 2. k No Kolmogorov R1 R2 R3 R4 R50 4902 3.0578 0.6148 0.6036 0.5354 0.6367 0.37761 4902 1.8130 0.7537 0.8154 0.7425 0.6165 0.55952 4902 1.2752 0.6765 3.6339 0.6176 0.7559 0.39693 4902 1.1030 0.4839 0.9168 0.5846 0.6992 5.78714 4902 0.9454 0.9077 0.8388 0.7162 0.5211 0.78785 4902 1.1828 0.5653 0.6978 0.7649 0.5019 0.54676 4902 1.2998 0.5306 0.5985 0.7461 0.5844 0.97787 4902 0.9876 0.8416 0.7448 0.5140 0.6265 0.74598 4902 0.9327 0.7223 0.7323 0.6713 0.9064 0.57829 4902 0.9697 0.8291 0.6565 0.9028 0.5890 0.743810 4902 1.0793 0.6249 0.6609 0.7140 0.5499 0.629011 817 0.6508 0.8411 0.7843 1.0557 1.1890 0.9961Here k is number of number of doublings in the dynamics, No is the sample size. We have817 diﬀerent runs and for k = 11 and the sample size for k <

11 it is 6 ∗ √ No but for k < k = 10we have a borderline result. Of course we are aware the fact that for k = 0 the Wassersteinstatistics has a diﬀerent distribution and perhaps the situation is similar for small k -s. Thistendency is clearly shown by the Kolmogorov statistics.We generated ﬁve times data matrix by the theoretical model. Columns headed by R1,...,R5gives the corresponding Kolmogorov statistics. As one can observe the statistics have ratherlong tail arising perhaps from the interwoven dependence structure. In paper [1] the saddle point method is used to prove inequality (2). Appropriate Lipschitzeanfunctions are developed to prove the left hand side while for the right hand side an ad-hocmatching algorithm due to Mikl´os Ajtai is used. The algorithm is based on the followingelementary concept.

Deﬁnition

Let s be a positive integer, t = 2 s and A = ( a , . . . , a t ) arbitrary reals. Themedian bits B = ( b , . . . , b t ) of A are (0 , t (cid:88) k =1 b k = s ; if (( b i = 0) and ( b j = 1)) , then a i ≤ a j . In case there are no ties in

A, B is uniquely deﬁned.7et k be an arbitrary natural number, n = 4 k , and Z = ( Z i , i = 1 , . . . , n ) be a system ofarbitrary two-dimensional points. In a step-by-step procedure we order a 2 k long bit sequenceto each of the points in Z. As an initial step we construct a set A from the ﬁrst coordinates of Z . Having the corresponding B we divide Z into two subsets according to the bits in B andfor each of that subsets we form the median bits from the second coordinates.From these initializations we proceed in the same manner. Using the bit sequences orderedto the points so far ﬁrst we develop the next bits from the subsets formed of the ﬁrst coordinateshaving the same bit sequences generated so far and next we turn to the second coordinates.Applying the procedure independently for two iid two dimensional sets X and Y in the role of Z the matching is supplied by the identical bit sequences. In course of the algorithm step bystep the size of subsets is halved and ﬁnally it is reduced to one, thus for all possible 2 k longbit sequence we have exactly one point in X and one in Y , and it makes the matching. In acertain sense we construct a two-dimensional ordering merging the orders according the twocoordinates. Let b be an arbitrary 2 k long bit sequence and c = k (cid:88) i =1 b i − , d = k (cid:88) i =1 b i . The expected value of the ﬁrst coordinate of the corresponding points in X or Y are labelled by b is c/ (2 k +1) and for the second coordinate it is d/ (2 k +1) . (Let us note here that in our notation X and Y are two sets of two dimensional points and the coordinates are ( x i , x i ) , ( y i , y i )) . Let us observe that the marginal quantile transform applies for the algorithm: the match-ing itself does not depend on any monotone transform. (The marginal quantile transform is F ( x i ) , i = 1 , . . . , n for the ﬁrst coordinates and G ( x i ) , i = 1 , . . . , n for the second coordinatesif the coordinates are independent and the distribution of the ﬁrst coordinates if F and that is G for the second coordinates.) Utilizing the mean limits oﬀered by the algorithm we are deeplyconcerned that the limit distribution for any distributions has the form n (cid:88) i =1 c i x i where the x i -s are independent standard normals and the c i -s are appropriate coeﬃcients.We guess that such random variables have monotone increasing hazard rates, and for certaincoeﬃcients the hazard rate function is bounded while for others it goes to inﬁnity. If themultiplicity of the maximal coeﬃcients is high then the leading term of the distribution has achi-square distribution with large degree of freedom what is approximately normal. Thus it ispossible for the limit distributions for– Wasserstein distances for uniform distribution,– Wasserstein distances for standard normal distribution,– Ajtai distances for uniform distribution,– Ajtai distances for standard normal distribution,8ll are the normal one but the speed of the convergence is considerable slow. Using the marginalquantile transformation one can ask what is the relation between the distances for standardnormal and uniform distributions. In the next table we give the basic statistics for Ajtaidistances up to 4 . Here D = 1 stands for standard normal and D = 2 for uniform distributionand the correlations come from marginal quantile transformation. Table 3. k D Sample Size Average Stdev Sequeness Kolmogorov Locus Correlations1 1 42161 9.951413 5.606661 1.261530 16.717425 236262 1 40184 19.597105 6.658678 0.877757 11.497639 229923 1 40711 34.650952 7.678610 0.663583 8.510061 211304 1 49367 58.719088 8.880549 0.507017 7.463537 269395 1 20089 97.984671 10.266948 0.334140 3.379643 87896 1 12988 163.848339 12.198350 0.277057 2.205296 48007 1 5739 276.390900 14.609538 0.130521 0.986046 25608 1 2333 471.869129 18.169085 0.130229 0.658314 11041 2 42161 0.251985 0.154344 1.114916 14.115933 21684 0.8916372 2 40184 0.655468 0.217523 0.539991 7.161248 21916 0.8588833 2 40711 1.428708 0.294253 0.420951 5.765374 21836 0.8729254 2 49367 2.519040 0.360076 0.433037 6.159368 29373 0.8582705 2 20089 3.809825 0.400688 0.401140 3.978578 12257 0.8073096 2 12988 5.210258 0.421534 0.390273 3.757436 7405 0.7348157 2 5739 6.669811 0.431007 0.373776 2.231697 2844 0.6456708 2 2333 8.138648 0.438881 0.363513 1.112873 898 0.558825Because the number of points 4 is rather large in case of the Hungarian algorithm, thus wereduced it to 4 . In the next table we present the basic statistics for sample size 309 . Table 4.

Label Sample type Algorithm Average St.deviationNH normal sample Hungarian algorithm 39.184 5.453NA normal sample Ajtai algorithm 97.497 10.485UA uniform sample Ajtai algorithm 2.902 0.541UH uniform sample Hungarian algorithm 1.756 0.451UB uniform sample adhoc improving 2.224 0.500and the correlations are the followings: 9 able 5.

NH NA UA UH UBNH 1.000 0.665 0.681 0.766 0.717NA 0.665 1.000 0.569 0.469 0.530UA 0.681 0.569 1.000 0.896 0.981UH 0.766 0.469 0.896 1.000 0.942UB 0.717 0.530 0.981 0.942 1.000The correlation between Hungarian and Ajtai algorithms for uniform distribution is 0 . .

981 because our ad-hoc improvingstarts with the Ajtai coupling. But its correlation with the optimum is also improved: it is now0 . . The original Ajtai algorithm uses numerical medians and linear transformations ﬁtting themedians to the interval halving. First we apply the median of the whole set of ﬁrst coordinatesand imagine a vertical line dividing the unit square accordingly. The number of points inthe rectangles are equal. Next we divide independently with appropriate horizontal lines theleft hand side and right hand side rectangles in such a way that the number of points in thefour rectangles should be equal. Roughly what happen in the four rectangles are independentfrom each other if it is accordingly scaled. This concept leads to a dynamical equation similarto (3) but in this case the four terms are independent. This type of recursion immediatelyleads to a normal limit distribution. When we believed that the limit distribution is not

Gaussian we desperately struggled against such kind of dynamics but now we are content thatthe limit distribution is Gaussian for both cases of Ajtai and Wasserstein distance. Of coursefor Wasserstein distance the situation is more complicated. The Ajtai algorithm works forarbitrary number of points, we used the power of 4 only for didactical reason.10 eferences [1] M. Ajtai, J. Koml´os, and G. Tusn´ady, On optimal matchings,

Combinatorica (1984),259–264.[2] E. del Barrio, E. Gin´e, and C Matr´an, Central limit theorems for the Wasserstein distancebetween the empirical and the true distributions,

The Annals of Probability (1999),1009–1071.[3] F. Barthe and N. O’Connel, Matchings and variance of Lipsitz functions,

ESAIM: Probab.Statist. (2010) 400–408.[4] G. D. Canas and L. R. Rosasco, Learning probability measures with respect to optimaltransport matrices, arXiv:1209.1077v1 [cs.LG] 5 Sep 2012.[5] E. G. Coﬀman, Jr., M. R. Garey, and D. S. Johnson, Approximatio algorithms for binpacking: an updated survey (in Algorithm design for computer systems design, G. Ausiello,M. Lucerﬁni, and P. Seraﬁni (eds) pp. 49–106) , Springer, 1984.[6] A. E. Holroyd, R. Pemantle, Y. Peres, and O. Schramm, Poisson matchings,

Ann. Inst.Henri Poincar´e, Probab. Stat. (2009) 266-289.[7] P. Major,

On the estimation of multiple random integrals and U–statistics (Lecture Notesin Mathematics) , Springer, 2013.[8] C. Nosz´aly, Experiments on the distance of two–dimensional samples,

Annales Mathemat-icae et Informaticae

Communications in Pure and Applied Mathematics (2005),923–940.[10] M. Talagrand, The Ajtai–Koml´os–Tusn´ady matching theorem for general measures, in

Probability in Banach spaces, 8. (Brunswick, ME, 1991) Volume 30 of Progress in Proba-bility pages 39–54, Birkh¨auser Boston M. 1992 [11] M. Talagrand, The transportation cost from the uniform measure to the empirical measurein dimension ≥ The Annls of Probability (1994), 919–959.[12] M. Talagrand,

The general chaining (Spinger Monographs in Mathematicics) , Springer,2005.[13] G. L. Torrisi, Asymptotic analysis of the optimal cost in some transportation problemswith random locations,

Stochastic Processes and their Applications (2012) 305–333.[14] C. Villani,

Optimal transport: old and new , Springer, 200811 igure 1.

Optimal cost as a function of the number of pairs . . . . . . n W ( n ) * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * theory estimation igure 2. Densities of optimal costs -0.5 . . . . . . . . x y k=0 k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 k=11 igure 3. Pictorial representation of shadow prices for n=2048 igure 4.

Pictorial representation of shadow prices for n=2048 igure 5.igure 5.