[PDF] A generalization for the expected value of the earth mover's distance

Abstract

The earth mover's distance (EMD), also called the first Wasserstein distance, can be naturally extended to compare arbitrarily many probability distributions, rather than only two, on the set [n]={1,…,n} . We present the details for this generalization, along with a highly efficient algorithm inspired by combinatorics; it turns out that in the special case of three distributions, the EMD is half the sum of the pairwise EMD's. Extending the methods of Bourn and Willenbring (arXiv:1903.03673), we compute the expected value of this generalized EMD on random d -tuples of distributions, using a generating function which coincides with the Hilbert series of the Segre embedding. We then use the EMD to analyze a real-world data set of grade distributions.

Full PDF

AA GENERALIZATION FOR THE EXPECTED VALUE OF THE EARTHMOVER’S DISTANCE

WILLIAM Q. ERICKSON

Abstract.

The earth mover’s distance (EMD), also called the ﬁrst Wasserstein distance, canbe naturally extended to compare arbitrarily many probability distributions, rather than onlytwo, on the set [ n ] = { , . . . , n } . We present the details for this generalization, along with ahighly eﬃcient algorithm inspired by combinatorics; it turns out that in the special case of threedistributions, the EMD is half the sum of the pairwise EMD’s. Extending the methods of Bournand Willenbring (2020), we compute the expected value of this generalized EMD on random d -tuples of distributions, using a generating function which coincides with the Hilbert series of theSegre embedding. We then use the EMD to analyze a real-world data set of grade distributions. Introduction

We generalize a result appearing in [4], in which the authors compute the expected value ofthe earth mover’s distance (EMD) between two probability distributions, by means of a generatingfunction. The EMD can be viewed as the solution to a problem in transport theory, ﬁrst consideredin [19] by French geometer Gaspard Monge in 1781. (Although the term “earth mover” seems tohave been coined only in the 1990s, it is pointed out in Villani’s monumental reference [22] that thetitle of Monge’s original treatise translates, more or less, as “On the theory of material extractedfrom the earth and input to a new construction.” Monge, then, truly was the original earth mover.)Nearly 200 years later, in [14], Monge’s name was given to a critical property of certain cost arraysfor which his problem can be solved by a greedy algorithm. Throughout the late 1980s and 1990s,in [2] and [3], this Monge property was generalized to higher-dimensional arrays. (See also [17] fora more recent treatment.) It is an essential fact in this paper (whose proof is reserved for Section 8at the end) that the d -dimensional cost array associated with our EMD d has this Monge property.Section 2, written for those readers unfamiliar with the EMD, presents a simple example andpoints out all the relevant details which will reappear in our generalization.We begin in Section 3 by deﬁning an earth mover’s “distance” EMD d between d distributions;the classical EMD treated in [4] coincides with EMD . We actually ﬁnd that on three distributions,EMD equals half the sum of the the three EMD values, although no such relationship holds for d > d which compares histograms instead of probability distributions, and we describe an eﬃcientcomputational method using a generalization of the RSK correspondence from combinatorics.In Section 5, we encode the values of the discrete EMD d in a generating function, which wemanipulate in order to extract the expected value. Translating this discrete result back into thecontinuous setting, we prove the main theorem of this paper (Theorem 7), which is a recursiveformula to compute the expected value of EMD d . We then apply our theory in Section 6 to analyzea real-world data set of grade distributions.Finally, in Section 7, we mention a connection between our generating function and the Segreembedding in algebraic geometry. We also exhibit a certain inﬁnite-dimensional representation ofthe Lie algebra su ( p, q ), whose action corresponds to manipulating the distributions compared byour EMD. Mathematics Subject Classiﬁcation.

Primary 13P25, 05E14; Secondary 05E40. a r X i v : . [ m a t h . C O ] O c t WILLIAM Q. ERICKSON

Since the appearance of [4], the problem of ﬁnding the expected value of EMD has been solvedfrom an analytical approach in [10]. The setup has also been specialized in [18] to a data set ofdistributions with a ﬁxed average value.We believe that the result in this paper — a method to evaluate the “closeness” of arbitrarily manydistributions — has great potential as a tool in data analysis. In evaluating teaching and assessmentpractices at the university level, for instance, we can now assign a single value to an entire course byevaluating the EMD between the individual sections, and then track the behavior of that course’sEMD for diﬀerent groups of instructors, diﬀerent course coordinators, fall vs. spring semesters, andother variables. We can even assign EMD values to individual exams and other assessments usingthe grade distributions in various sections; or in the other direction, we can compare diﬀerent coursesto each other, both within and outside a given department. In all of these settings, we believe thatthe generalized EMD can contribute to an interesting cluster analysis of the kind proposed in [4]. Acknowledgements:

The author would like to thank Rebecca Bourn and Jeb Willenbring,the authors of [4], for the conversations about their original paper. Jeb’s observations about theconnections to representation theory were especially vital to Section 7.2.

EMD between 2 distributions: summary and an example

For readers unfamiliar with the classical EMD, we summarize the idea here. Consider two prob-ability distributions on the ﬁnite set of integers [ n ] = { , . . . , n } . (More vividly, in place of a“probability distribution,” imagine n bins of earth whose combined mass is one unit, located at1 , . . . , n on the number line.) Intuitively, the EMD between the two distributions measures the“cheapest” cost of moving earth between the bins so as to equalize the distributions, where the“cost” of moving one unit of earth is the distance of the move. For example, the cost of moving 0 . . · (5 −

2) = 0 .

75. To make this precise, we deﬁne the costfunction C : [ n ] × [ n ] −→ Z ≥ , where C ( i, j ) is the cost of moving one unit of earth from bin i tobin j . In this case, clearly C ( i, j ) = | i − j | .Any solution which equalizes the two distributions — whether or not it is the optimal solution— can be encoded in an n × n matrix J . Necessarily, the row sums of J will correspond to the ﬁrstdistribution, and the column sums to the second, so the entries of J must sum to 1.We present a brief example to show how the entries of J give (possibly ambiguous, but equivalent)step-by-step instructions to equalize the two distributions. The procedure we give here is not themost direct (see Section 2 of [4]), but it will provide the best intuition when we generalize to d distributions in the next section. The less-than-rigorous descriptions below will be formalized in thenext section in terms of the taxicab metric. Example.

Consider the two distributions µ = (0 . , . , .

4) and µ = (0 . , , . n = 3.Then one matrix (among inﬁnitely many) with the prescribed row and column sums is J =  . .

20 0 .

30 0 .  . The nonzero entries of J correspond to moving earth as follows: • J , = 0 .

1. Note that the coordinates (1 ,

1) are already equal to each other, so we do nothave to move the 0 . • J , = 0 . . Now the coordinates (1 ,

3) are not equal; in order to make them equal with aslittle cost as possible, we have three valid options, all of which have cost 2: – In the ﬁrst coordinate, we could add 2 to make the change 1 →

3. This corresponds tomoving the 0 . µ , from bin 1 to bin 3. – In the second coordinate, we could subtract 2 to make the change 3 →

1. This corre-sponds to moving the 0 . µ , from bin 3 to bin 1. GENERALIZATION FOR THE EXPECTED VALUE OF EMD 3 – We could add 1 to the ﬁrst coordinate (1 →

2) and subtract 1 from the second coordinate(3 → . µ from bin 1 to bin 2, andthen moving 0 . µ from bin 3 to bin 2. • J , = 0 .

3. The cheapest ways to equalize the coordinates (2 ,

3) are the following two options,each with cost 1: – In the ﬁrst coordinate, we could add 1 to make the change 2 →

3. This corresponds tomoving the 0 . µ , from bin 2 to bin 3. – In the second coordinate, we could subtract 1 to make the change 3 →

2. This corre-sponds to moving the 0 . µ , from bin 3 to bin 2. • J , = 0 .

4. Since the coordinates (3 ,

3) are already equal, we do not have to move the 0 . µ (cid:48) and µ (cid:48) . But within each possible pair, as thereader can check, we always ﬁnish with µ (cid:48) = µ (cid:48) , as desired. Furthermore, the total cost of all theearth moved is independent of our choices, since all options above minimized the cost at each step.(Also note that the cost at each step was always equal to | i − j | , coinciding with the cost function C we deﬁned earlier.) In this case, the total cost of the earth moved was0 . . . . . . The EMD between µ and µ is, by deﬁnition, the inﬁmum (actually the minimum) of the setof total costs, taken over all possible matrices J with the prescribed row and column sums. Inthis example, although not obvious at ﬁrst glance, 0 . µ , µ ) = 0 .

7. This turns out to be a consequence of the fact that the support of J lies in a chain : in other words, if we put the product order (cid:22) on [ n ] × [ n ], we see that(1 , (cid:22) (1 , (cid:22) (2 , (cid:22) (3 , n ] × [ n ]. This fact — that support ina chain implies minimality — is equivalent, on a deeper level (see [14]), to the fact that our costfunction C , if considered as an n × n array, has the “Monge property” alluded to in the introduction;in this case, the greedy algorithm to solve the earth mover’s problem (known as the “northwestcorner rule”; see [3]) eliminates one row or column at each step, meaning the support of the solutionmatrix J is always a chain.There is one phenomenon here in the d = 2 case which will not generalize to d >

2: in theabove example, we could have removed any ambiguity by deciding that we would move earth within µ exclusively, so that both ﬁnal distributions would equal µ . Therefore, we could interpret theproblem as ﬁnding the cheapest way to transport material from a “source” or “supply vector” ( µ )to a “sink” or “demand vector” ( µ ). For d >

2, however, the optimal solution at each step mayrequire moving earth in any or all of the distributions, and so we lose the binary supply-demandinterpretation of the problem.Having presented the big picture, without details, in the d = 2 case, we now proceed to buildup the general case for arbitrary d . Throughout the next section, the reader can verify that thedeﬁnitions and results coincide with those found in this simple example where d = 2.3. Extending EMD to d distributions Deﬁnitions and notation.

Let P n denote the set of probability distributions on [ n ]. Assumethe uniform probability measure on the d -fold product P n × · · · × P n , deﬁned by its embeddinginto R dn . Our goal is to compare an arbitrary number of elements of P n , written as the d -tuple µ := ( µ , . . . , µ d ). We should keep in mind that each µ i is itself an n -tuple whose components sumto 1. Throughout this paper, we write the sum of a vector’s components using absolute value bars,so in this case, | µ i | = 1. We will denote the k th component of µ i by µ i ( k ), which is just the value of WILLIAM Q. ERICKSON the distribution µ i at k ∈ [ n ]. To each µ there corresponds the set J µ of joint distribution arrays,deﬁned as follows.For an array J , we will write J ( m , . . . , m d ) for the entry at position ( m , . . . , m d ). Now, wedeﬁne J µ as the set containing all those arrays J ∈ R n ×···× n ≥ whose sums within the coordinatehyperplanes coincide with µ . Speciﬁcally, ﬁxing m i = k , we must have(1) n (cid:88) m ,..., (cid:98) m i ,...,m d =1 J ( m , . . . , k (cid:124)(cid:123)(cid:122)(cid:125) m i , . . . , m d ) = µ i ( k ) . In other words, summing all the entries whose positions in the array have k as their i th coordinate,we obtain the k th component of µ i . In the familiar d = 2 case, i = 1 gives us the row sums, and i = 2 the column sums. For d = 3, see Figure 1 for an illustration. Figure 1.

An illustration of the conditions in equation (1), in the case where d = 3 and n = 4. Given some µ = ( µ , µ , µ ), every array in J µ satisﬁes the aboverelations, where each arrow represents the sum of the entries in the designated plane.Any array J ∈ J µ can be thought of as a solution to the earth mover’s problem for n bins,determined by the distributions in µ . This means we need a d -dimensional analog of the “cost”function from Section 2, and the natural candidate arises from the taxicab metric on [ n ] d . Speciﬁcally,for each position in a d -dimensional array, we want the associated cost equal the taxicab distance tothe main diagonal, i.e., to the nearest position in the array whose coordinates are all equal. (This“equality of coordinates” property of the main diagonal, as we recall from Section 2, correspondedto zero earth being moved.) Roughly speaking, this cost is the fewest number of ± , , , , , ,

5) is to add 1 to the 4, and then to subtract 2 from the 7,for a total cost of 3. This is precisely the taxicab distance to the main diagonal, speciﬁcally to theposition (5 , , , , , , • When we added 1 to the 2 nd coordinate to make the change 4 →

5, we moved a unit of earthin the 2 nd distibution µ from bin 4 to bin 5. • When we subtracted 2 from the 6 th coordinate to make the change 7 →

5, we moved a unitof earth in the 6 th distibution µ from bin 7 to bin 5. Example.

Consider the three distributions µ = (0 . , . , . ,µ = (0 . , . , . ,µ = (0 . , . , . . GENERALIZATION FOR THE EXPECTED VALUE OF EMD 5

Then one array in J µ is, for instance,(2) J =  [ . , ,

0] [ 0 , ,

0] [0 , , , ,

0] [ . , ,

0] [0 , , , ,

0] [ . , ,

0] [0 , . , .  , ﬂattened so that the ﬁrst coordinate speciﬁes the row, the second coordinate speciﬁes one of thethree main columns, and the third coordinate speciﬁes the position inside the triple at that position.The nonzero entries are J (1 , ,

1) = 0 . J (2 , ,

1) = 0 . J (3 , ,

2) = 0 . J (3 , ,

3) = 0 . . This information tells us how to arrive at the solution corresponding to J : • The cost of (1 , ,

1) is 0 since it is already on the main diagonal, so we do not move the 0 . • The cost of (2 , ,

1) is 1, since in the 3 rd coordinate we must make the change 1 → rd distribution µ , we move 0 . µ (cid:48) = (0 . , . , . • The cost of (3 , ,

1) is 2, since we equalize the coordinates most eﬃciently by subtracting1 from the 1 st coordinate (3 →

2) and adding 1 to the 3 rd coordinate (1 → . µ , and from bin 1 to bin 2 in µ . Now µ (cid:48) = (0 . , . , . µ (cid:48) = (0 . , . , . • The cost of (3 , ,

2) is 1, by adding 1 to the 3 rd coordinate. This corresponds to moving 0 . µ . Now µ (cid:48) = (0 . , . , . • The cost of (3 , ,

3) is 0, so we do not move the 0 . µ (cid:48) = µ (cid:48) = µ (cid:48) =(0 . , . , . J . But of course in each case the total cost is the same.The natural computation now is to ﬁnd that total cost, by multiplying the amount of earth movedat each step by the number of bins it was moved; in other words, multiply each entry in J by thecost of its position, then add these products together:0 . . . . . . . This completes the example.Of course, there is no guarantee that this is the least costly way to equalize the three distributions;this is simply the solution corresponding to one particular array J , and a diﬀerent array in J µ mightgive a diﬀerent total cost. When we ﬁnally deﬁne our generalized EMD, it will be deﬁned as the least possible cost for any J ∈ J µ . First, however, we should record a formula for the cost of anarray position, to improve upon the somewhat sloppy method by inspection we have used so far.The formula for the d -dimensional taxicab distance from a point to a line is derived in [7]. Inour case, the line of interest is the main diagonal, which passes through (1 , . . . ,

1) in the direction (cid:104) , . . . , (cid:105) . This distance, and therefore our cost function C , turns out to be(3) C ( m , . . . , m d ) = min i ∈ [ d ] (cid:88) j (cid:54) = i | m i − m j |  . WILLIAM Q. ERICKSON

This cost function C can also naturally be thought of as an n × · · · × n array, so we will occasionallyrefer to the “cost array” in this paper.There is also a more direct way to compute C , which will be convenient later. Let m :=( m , . . . , m d ), and let (cid:101) m denote the vector whose components are those of m rearranged in as-cending order; e.g., if m = (7 , , , , (cid:101) m = (1 , , , , Proposition 1.

Equation (3) can be computed as C ( m ) = (cid:98) d/ (cid:99) (cid:88) i =1 (cid:101) m d − i +1 − (cid:101) m i . As an example before the proof, take m = (7 , , , ,

1) as above. Then by the proposition, tocompute C ( m ), we instead look at (cid:101) m and sum up the pairwise diﬀerences working outside-in: (cid:101) m = ( − (cid:122) (cid:125)(cid:124) (cid:123) , − (cid:122) (cid:125)(cid:124) (cid:123) , , , , therefore C ( m ) = 6 + 2= 8 . Proof.

For ﬁxed i ∈ [ d ], we have (cid:88) j (cid:54) = i | m i − m j | = ( (cid:101) m − (cid:101) m ) + 2( (cid:101) m − (cid:101) m ) + 3( (cid:101) m − (cid:101) m ) + · · · + ( i − (cid:101) m i − (cid:101) m i − )+ ( (cid:101) m d − (cid:101) m d − ) + 2( (cid:101) m d − − (cid:101) m d − ) + 3( (cid:101) m d − − (cid:101) m d − ) + · · · + ( d − i )( (cid:101) m i +1 − (cid:101) m i ) , which is minimized when i = (cid:98) d +12 (cid:99) . Making this evaluation in the displayed sum, we ﬁnd that thesum telescopes; when d is even, we obtain − (cid:101) m − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99) + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d , and when d is odd, we obtain − (cid:101) m − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99)− + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d . In either case, this simpliﬁes as C ( m ) = (cid:98) d/ (cid:99) (cid:88) i =1 (cid:101) m d − i +1 − (cid:101) m i . (cid:3) Remark.

A recent paper [17] deﬁnes a diﬀerent cost function than ours for the earth mover’sproblem, namely C (cid:48) ( m ) := max { m i } − min { m i } . We can see from Proposition 1 that C (cid:48) agreeswith our C for d = 2 and d = 3, but not for d >

3. For example, letting m = (1 , , , C ( m ) = 2 but C (cid:48) ( m ) = 1. For our purposes, we have chosen our C because it counts every earth-movement required to equalize the distributions. For example, keeping m = (1 , , , µ = µ = (1 ,

0) and µ = µ = (0 , m . Intuitively, we want the EMD of these four distributions tobe 2, not 1, since we must ﬁrst move a unit of earth by 1 bin, and then move another unit by 1 bin.Having built up the necessary intuition and formulas, we are ﬁnally ready to make our maindeﬁnition: Deﬁnition.

Let µ be a d -tuple of probability distributions, as above. Then the generalized earthmover’s distance is deﬁned as(4) EMD d ( µ ) := min J ∈J µ (cid:88) m ∈ [ n ] d C ( m ) J ( m ) . GENERALIZATION FOR THE EXPECTED VALUE OF EMD 7

Existence of a greedy algorithm.

As mentioned in the ﬁrst two sections, ﬁnding the right-hand side of (4) is equivalent to ﬁnding the optimal solution to a d -dimensional transport problem.It is shown in [3] that there exists a greedy algorithm to ﬁnd this solution in O ( d n ) time, preciselywhen the cost array C has the Monge property mentioned in the introduction:

Deﬁnition. A d -dimensional array A has the Monge property if, for all x = ( x , . . . , x d ) and y = ( y , . . . , y d ), we have A (min { x , y } , . . . , min { x d , y d } ) + A (max { x , y } , . . . , max { x d , y d } ) ≤ A ( x ) + A ( y ) . We now state the crucial proposition, whose proof we will give in Section 8.

Proposition 2.

The cost array C deﬁned in (3) has the Monge property. This proposition, then, guarantees the existence of a greedy algorithm to compute EMD d . (Thisjustiﬁes our writing “min” instead of “inf” in our deﬁnition; we also could have used a compactnessargument as in [4].) The greedy algorithm described in [3] is a generalization of the two-dimensional“northwest corner rule.” Just as in the d = 2 case (see Section 2), for generic d this algorithm arrivesat its solution in the form of an array J ∈ J µ whose support is a chain , i.e., pairwise comparableunder the product order on [ n ] d . (In [4], Proposition 4, the “straightening” procedure that convertsthe support of any J into a chain, without increasing the total cost, is valid precisely because thecost array C ( i, j ) = | i − j | has the Monge property.) Rather than describe this greedy algorithm,which is already well-known (see [3] or [17]), our goal is instead to ﬁnd the expected value of EMD d .To this end, the importance of the algorithm is the following: Corollary 3.

The minimum in (4) occurs for some J ∈ J µ whose support is a chain in [ n ] d . Since there is nothing special about the condition | µ i | = 1 from the perspective of transportproblems, Corollary 3 also holds in a discrete setting using integer compositions in place of probabilitydistributions. We will take this discrete approach in the next section, where we use a highly eﬃcientcombinatorial method to ﬁnd the optimal array J for any µ .4. A discrete approach

We follow the method from [4], with a view toward constructing a generating function in the nextsection. In place of P n , we temporarily turn our attention to C ( s, n ), the set of (weak) compositionsof some positive integer s into n parts. That is, elements of C ( s, n ) are n -tuples of nonnegativeintegers whose sum is s (whereas before, the elements of P n were n -tuples of nonnegative realnumbers whose sum was 1); we can also think of compositions as histograms. The cost function C ,however, remains the same as before, since it still describes distances among the n bins in each ofthe d compositions.In this section, µ = ( µ , . . . , µ d ) denotes a sequence of compositions µ i ∈ C ( s, n ). It is temptingsimply to adjust the deﬁnition of J µ so that arrays in the set must have nonnegative integer entriessumming to s ; then we could just re-use the deﬁnition (4) to obtain a deﬁnition for the discreteEMD d . Although this is one viable approach, nevertheless, in light of Corollary 3, we need onlyconsider those arrays whose support is a chain; therefore we will work with the following set of arraysfrom this point forward: J s ( n d ) := (cid:40) J ∈ ( Z ≥ ) n ×···× n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) m J ( m ) = s, and the support of J is a chain in [ n ] d (cid:41) . We will now show that each d -tuple of compositions µ ∈ C ( s, n ) × · · · × C ( s, n ) corresponds toa unique array J µ ∈ J s ( n d ) ; therefore, by the end of the next subsection, we will have a directcomputational deﬁnition for the discrete version of the EMD, which avoids taking the mimimumover a set of arrays as we must in the deﬁnition (4) of the continuous EMD. Once we have thisdeﬁnition for the discrete EMD, we will be able to recover the continuous version by scaling all the WILLIAM Q. ERICKSON µ i by 1 /s . Near the end of the paper, we will do exactly this, and then let s → ∞ , in order totranslate discrete results back into the continuous setting.4.1. Generalized RSK correspondence.

The authors of [4] use the Robinson-Schensted-Knuthcorrespondence to great eﬀect in order to determine a unique optimal matrix J ( µ ,µ ) for an orderedpair of compositions ( µ , µ ). We now apply this same idea to d compositions in order to uniquelydetermine an optimal d -dimensional array. This will allow us to calculate the discrete EMD d directly(and even more eﬃciently, in many cases, than by using the greedy algorithm mentioned above).For non-experts, we summarize here a special case of the correspondence. (For full details, seeChapter 4 of [11].) The Robinson-Schensted-Knuth (RSK) correspondence furnishes a bijection:  ordered pairs of semistandardYoung tableaux of the sameshape, with entries in [ n ]  ←→ (cid:26) n × n matrices withnonnegative integer entries (cid:27) . For our purposes, we will restrict our attention to the special case of one-row tableaux, since anycomposition in C ( s, n ) corresponds uniquely to a one-row tableau containing s boxes with entriesfrom [ n ]. As an example, consider the two compositions µ = (1 , , , ,µ = (5 , , , C (10 , µ i a one-row tableau T ( µ i ), which we ﬁll so that theentry k appears µ i ( k ) times: T ( µ ) = 1 2 2 3 3 3 4 4 4 4 T ( µ ) = 1 1 1 1 1 3 3 4 4 4Considering these tableaux as the two rows of a 2 × s matrix, we then have M ( µ ,µ ) = (cid:20) (cid:21) . Finally, we ﬁll in an n × n array J ( µ ,µ ) whose ( i, j ) entry equals the number of times the column (cid:2) ij (cid:3) appears in M ( µ ,µ ) . For example, [ ] appears three times, so we write a 3 in position (4 , µ , µ ) ←→ (cid:0) T ( µ ) , T ( µ ) (cid:1) ←→ M ( µ ,µ ) ←→ J ( µ ,µ ) =   . Note that we can also reverse the procedure, starting with the array J ( µ ,µ ) , translating its entriesinto a two-row matrix, and ﬁnally recovering the original pair of tableaux (and hence the pair ofcompositions). Therefore this is indeed a bijection. In the context of the EMD, the array J ( µ ,µ ) has two signiﬁcant properties: • The row and column sums coincide with the original compositions µ and µ , so J ( µ ,µ ) isa solution to the discrete earth mover’s problem for µ and µ . • Since both rows of M ( µ ,µ ) are nondecreasing, the support of J ( µ ,µ ) is a chain in [ n ] × [ n ].In summary, we have the following bijective correspondence in the case d = 2: C ( s, n ) × C ( s, n ) ←→ J s ( n ) This RSK correspondence extends naturally to d -tuples of compositions in C ( s, n ). (For experts,details about the existence of this multivariate RSK generalization can be found in [6].) Given µ = ( µ , . . . , µ d ), the tableaux corresponding to the µ i uniquely determine a d × s matrix M µ , GENERALIZATION FOR THE EXPECTED VALUE OF EMD 9 which in turn determines a unique n × · · · × n array J µ ∈ J s ( n d ) . This correspondence is againbijective, establishing the following special case of the generalized RSK correspondence: C ( s, n ) × · · · × C ( s, n ) ←→ J s ( n d ) (5) µ (cid:55)−→ J µ This correspondence leads us to the following deﬁnition of the discrete EMD:

Deﬁnition.

For positive integers d , n , and s , let µ = ( µ , . . . , µ d ) with each µ i ∈ C ( s, n ). Let J µ the unique array corresponding to µ , as in (5). Let C be the cost function on [ n ] d as in (3). Thenwe deﬁne the discrete generalized earth mover’s distance to be(6) EMD sd ( µ ) := (cid:88) m ∈ [ n ] d C ( m ) J µ ( m ) , where we write the superscript s to distinguish this discrete version from the continuous version.This deﬁnition in (6) is largely conceptual; in practice, we can calculate EMD sd ( µ ) directly fromthe matrix M µ , since the support of J µ is determined by the columns of M µ : Theorem 4.

Let µ = ( µ , . . . , µ d ) with each µ i ∈ C ( s, n ) , and let C be the cost function in (3) .Let M µ be the unique d × s array corresponding to µ via the generalized RSK correspondence, asdescribed above, and let M µ ( • , j ) denote the j th column vector in M µ . Then EMD sd ( µ ) = s (cid:88) j =1 C ( M µ ( • , j )) . Proof.

Consider the deﬁnition of EMD sd in (6). By deﬁnition, J µ ( m ) equals the number of occur-rences of m as a column vector of the matrix M µ . Therefore J µ ( m ) = 0 unless m is one of thosecolumn vectors, and so we can simply sum over the s column vectors to obtain the result. (cid:3) Remark.

This construction via RSK is more eﬃcient than the aforementioned greedy algorithmfor computing EMD sd when d or n is suﬃciently large: rather than ﬁlling a d -dimensional array in O ( d n ) time, we need consider only the s column vectors of a d × s matrix. Example.

As an example for d = 3, consider the three compositions µ = (4 , , , µ = (1 , , , µ = (0 , , C (5 , M µ =   . Now using Theorem 4 on the ﬁve columns of M µ , we compute thatEMD ( µ ) = C (1 , ,

2) + C (1 , ,

2) + C (3 , , . Before advancing to the main problem of the paper, we show that EMD can actually be expressedin terms of the classical EMD . (In the following proposition, we suppress the superscript s becausethe result holds for both the discrete and the continuous version of EMD: the equality is independentof s , and therefore still holds after dividing both sides by s and letting s → ∞ .) Proposition 5.

The value of

EMD is half the sum of the three pairwise EMD values: EMD ( µ , µ , µ ) = 12 (cid:32) EMD ( µ , µ ) + EMD ( µ , µ ) + EMD ( µ , µ ) (cid:33) Proof.

Let µ = ( µ , µ , µ ) as usual. In each column j of the matrix M µ , call the three entries a j , b j , c j , labeled so that a j ≤ b j ≤ c j . Each of the three pairs ( a j , b j ), ( a j , c j ), and ( b j , c j ) corre-sponds naturally to one of the pairs ( µ , µ ), ( µ , µ ), and ( µ , µ ), where row i corresponds to µ i .Therefore by Theorem 4, we haveEMD ( µ , µ ) + EMD ( µ , µ ) + EMD ( µ , µ ) = s (cid:88) j =1 C ( a j , b j ) + C ( a j , c j ) + C ( b j , c j )= (cid:88) j ( b j − a j ) + ( c j − a j ) + ( c j − b j )= (cid:88) j c j − a j = 2 (cid:88) j c j − a j = 2 (cid:88) j C ( a j , b j , c j )= 2 · EMD ( µ , µ , µ ) . (cid:3) This relationship does not generalize to d >

3, because in general, the telescoping summand inthe proof does not reduce in terms of a higher-dimensional cost function. For example, when d = 4,the analog of the third line above is (cid:80) j d j + c j − b j − a j , or (cid:80) j C ( a j , b j , c j , d j ) + 2( d j − a j ).5. Expected value of EMD d Again we will follow and extend the methods used in [4] to arbitrary values of d . First we willdeﬁne a generating function of two variables to record the values of EMD sd , which we will thendiﬀerentiate in order to sum up all of these values. This will allow us to compute expected value forEMD sd simply by reading oﬀ coeﬃcients from a generating function of a single variable.Because we are about to make a recursive deﬁnition, we will now need to consider d -tuples µ consisting of compositions with diﬀerent numbers of bins — i.e., diﬀerent values n i such that each µ i ∈ C ( s, n i ). Therefore, n will denote this vector of bin numbers ( n , . . . , n d ), and we will write( n d ) for the special vector ( n, . . . , n ), which arises most frequently in applications.In order to encode an inclusion-exclusion argument, we will also need to deﬁne an indicator vector e ( A ) for a subset A ⊆ [ d ]. Namely, e ( A ) := (cid:88) i ∈ A e i is the vector whose i th component is 1 if i ∈ A and 0 otherwise. For example, if d = 5, and A = { , , } , then e ( A ) = (0 , , , , Generating function for the discrete case.

For ﬁxed n = ( n , . . . , n d ), we ﬁrst deﬁne agenerating function in two indeterminates z and t :(7) H n ( z, t ) := ∞ (cid:88) s =0  (cid:88) µ ∈C ( s,n ) ×···×C ( s,n d ) z EMD sd ( µ )  t s . We observe that the coeﬃcient of z r t s is the number of elements µ ∈ C ( s, n ) × · · · × C ( s, n d ) suchthat EMD sd ( µ ) = r .A recursive deﬁnition of this generating function, for the d = 2 case, is derived in [4], Theorem 3.Our generalization for d > GENERALIZATION FOR THE EXPECTED VALUE OF EMD 11

Proposition 6.

Fix n = ( n , . . . , n d ) . The generating function H n := H n ( z, t ) has the followingrecursive deﬁnition, where the sum is over all nonempty subsets A ⊆ [ d ] : H n = (cid:80) A ( − | A |− · H n − e ( A ) − z C ( n ) t , where H (1 d ) = − t , and H n − e ( A ) = 0 if n − e ( A ) contains a .Proof. Each µ corresponds to a unique monomial(8) µ ←→ (cid:89) m w J µ ( m ) m , where µ ↔ J µ is the RSK correspondence in (5). The variables w m are indexed by multi-indices m ∈ [ n ] × · · · × [ n d ]. Note that the degree of this monomial equals s (the sum of the entries of J µ ).Now making the substitution(9) w m (cid:55)−→ z C ( m ) t, the above correspondence gives us the map µ ←→ (cid:89) m w J µ ( m ) m (cid:55)−→ z EMD sd ( µ ) t s . Therefore the generating function H n is just the image of the formal sum H ∗ n of all monomials of theform (8), under the substitution (9); as s ranges over all nonnegative integers, there is one monomialin H ∗ n for each possible µ ∈ C ( s, n ) × · · · × C ( s, n d ).Since m (cid:22) n for all m , every array J µ is allowed to contain n in its support without violatingthe chain condition. This means that every monomial in H ∗ n is allowed to contain the variable w n ,and so we may factor out the sum of all possible powers of w n , rewriting as H ∗ n = (cid:88) µ (cid:32)(cid:89) m w J µ ( m ) m (cid:33) = ∞ (cid:88) r =0 w r n · f ( w m (cid:54) = n ) = f ( w m (cid:54) = n )1 − w n , where f is an inﬁnite formal sum of monomials in the variables w m where m (cid:54) = n . Now we focuson rewriting this numerator f . Suppose we subtract 1 from exactly one of the coordinates of n ;the possible results are n − e ( i ) for i = 1 , . . . , d . Now, on one hand, any monomial in f containingthe variable w n − e ( i ) appears in H ∗ n − e ( i ) . But on the other hand, note that all of these n − e ( i ) aremutually incomparable under the product order, and so at most one of them can be in the supportof some J µ , because of the chain condition. Therefore any monomial in f contains at most one ofthe variables w n − e ( i ) . But the sum (cid:80) di =1 H ∗ n − e ( i ) still overcounts the monomials appearing in f ,since the same monomial may appear in several distinct summands.In other words, we want f to be the formal sum of the union (without multiplicity) of themonomials which appear in the summands H ∗ n − e ( i ) . We can achieve this by using the inclusion-exclusion principle: subtract those monomials which appear in at least 2 of the summands, then addback the monomials which appear in at least 3 of the summands, then subtract those appearing in atleast 4 summands, and so on, until we arrive at those monomials appearing in all d of the summands.We can write this inclusion-exclusion as an alternating sum over nonempty subsets A ⊆ [ d ], addingwhen | A | is odd and sutracting when | A | is even: f = (cid:88) A ( − | A |− · H ∗ n − e ( A ) . Finally, applying the substitution (9), we obtain H n = H ∗ n (cid:12)(cid:12)(cid:12) w m = z C ( m ) t = (cid:80) A ( − | A |− · H ∗ n − e ( A ) − w n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w m = z C ( m ) t = (cid:80) A ( − | A |− · H n − e ( A ) − z C ( n ) t, proving the recursion.As for the base case H (1 ,..., = − t , there is only one element in C ( s, n i = 1,the inside sum in (7) has only one term; moreover, this unique µ is just d copies of the same trivialcomposition of s into 1 part, meaning that EMD sd ( µ ) = 0. Hence H (1 ,..., ( z, t ) = (cid:80) s z t s = (cid:80) s t s ,whose closed form is − t . Likewise, since C ( s,

0) is empty, we must have H = 0 if any of the n i become 0. (cid:3) Example.

We write out this recursive deﬁnition in a concrete case, where d = 3 and n = (5 , , A . First, for | A | = 1,we add together all possible H n (cid:48) , where n (cid:48) equals n with exactly 1 coordinate decreased; then for | A | = 2, we subtract all possible H n (cid:48)(cid:48) where n (cid:48)(cid:48) equals n with exactly 2 coordinates decreased; ﬁnally,for | A | = 3, we add the one possible H n (cid:48)(cid:48)(cid:48) where n (cid:48)(cid:48)(cid:48) equals n with all 3 coordinates decreased. As forthe denominator, C is the same cost function we deﬁned in (3), meaning that C (5 , ,

2) = 5 − H n looks like this: H (5 , , = H (4 , , + H (5 , , + H (5 , , − H (4 , , − H (4 , , − H (5 , , + H (4 , , − z t Having seen an example, we now mention the important (and very well-studied) specializationthat results from setting z = 1. In this case, the coeﬃcient of t s in H n (1 , t ) is simply the totalnumber of d -tuples µ , which is (cid:81) di =1 |C ( s, n i ) | = (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) :(10) H n (1 , t ) = ∞ (cid:88) s =0 d (cid:89) i =1 (cid:18) s + n i − n i − (cid:19) t s . It is shown in [8] that the closed form of (10) is, after adjusting the index to match our setup,and writing | n | := n + · · · + n d ,(11) H n (1 , t ) = W ( t )(1 − t ) | n |− d +1 , where the numerator W ( t ) is a polynomial whose coeﬃcients are the “Simon Newcomb” numbers.(For more on this natural generalization of Eulerian numbers to multisets, see [1], [8], and [20].)Speciﬁcally, denoting the coeﬃcient of t i in W n by the symbol [ t i ] W n , and adopting the A -notationoriginally used in [8], we have[ t i ] W n = A (cid:0) n − (1 , . . . , , i (cid:1) := { n − , . . . , d n d − } containing i descents= i (cid:88) j =0 ( − j (cid:18) | n | − d + 1 j (cid:19) d (cid:89) k =1 (cid:18) i − j + n k − n k − (cid:19) . The degree of the polynomial W n is shown in [8] to be d (cid:88) i =1 ( n i − − max { n , . . . , n d } . The combinatorial interpretation implies that the coeﬃcients of W n are positive; in the specialcase where n = ( n d ), then W n is also unimodal and palindromic. (This can be shown from a GENERALIZATION FOR THE EXPECTED VALUE OF EMD 13 combinatorial or ring-theoretic approach; for the latter, see [20], or Chapter 5 of [5] on Stanley-Reisner and Gorenstein rings.)From the combinatorial description above of [ t i ] W n , it follows that the evaluation W n (1) equalsthe total number of permutations of the multiset { n − , . . . , d n d − } :(12) W n (1) = (cid:16)(cid:80) di =1 ( n i − (cid:17) ! (cid:81) di =1 ( n i − | n | − d )! (cid:81) ( n i − { n − , . . . , d n d − } corresponds to a unique increasing lattice path in N d , beginning at (1 , . . . ,

1) and ending at ( n , . . . , n d ):reading left to right, each occurrence of i in the permutation signiﬁes adding the standard basis vec-tor e i to the current position in the path. Therefore, W (1) can be interpreted as the total number ofincreasing paths connecting opposite corners of an n × · · · × n d array — in other words, the numberof chains in [ n ] × · · · × [ n d ] with maximal length.5.2. A partial derivative.

Next, in order to transfer the EMD values from the exponents of z intocoeﬃcients, we take the partial derivative of H n with respect to z . Applying the quotient rule to ourdeﬁnition of H n in Proposition 6, we obtain the following, where the sum still ranges over nonemptysubsets A ⊆ [ d ]: ∂H n ∂z = (cid:16) − z C ( n ) t (cid:17) (cid:32)(cid:88) A ( − | A |− · ∂H n − e ( A ) ∂z (cid:33) + C ( n ) · z C ( n ) − · t · (cid:32)(cid:88) A ( − | A |− · H n − e ( A ) (cid:33)(cid:0) − z C ( n ) t (cid:1) Now that the exponents have been changed into coeﬃcients of z , we can set z = 1: H (cid:48) n := ∂H n ∂z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z =1 = ∞ (cid:88) s =0  (cid:88) µ ∈C ( s,n ) ×···×C ( s,n d ) EMD sd ( µ )  t s = (1 − t ) (cid:32)(cid:88) A ( − | A |− · H (cid:48) n − e ( A ) (cid:33) + t · C ( n ) (cid:32)(cid:88) A ( − | A |− · H n − e ( A ) (cid:33) (1 − t ) (13)At this point, z has played out its role, and so from now on we will write H n in place of H n (1 , t ).Note that the coeﬃcient of t s in H (cid:48) n is the sum of the values EMD sd ( µ ) for all valid d -tuples µ .This means that our goal is now in sight: to ﬁnd the expected value of EMD sd for ﬁxed n , we needto divide the sum of all possible EMD sd values (i.e., the coeﬃcient of t s in H (cid:48) n ) by the total numberof possible inputs µ (i.e., the coeﬃcient of t s in H n ). Therefore, once we ﬁnd a way to simplify (13),we will be able to compute the result(14) E (EMD sd ) = [ t s ] H (cid:48) n [ t s ] H n = [ t s ] H (cid:48) n (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) , where [ t s ] denotes the coeﬃcient of t s in a series.In order to make the expression (13) for H (cid:48) n more tractable to program, we will now focus onlyon the numerators of H n and H (cid:48) n . We have already determined W n ( t ), the numerator for H n , in theprevious subsection. We will let N ( t ) denote the numerator of H (cid:48) n . By using software and observingpatterns for small n , we anticipate that the denominator of H (cid:48) n has exponent | n | − d + 2, and so wenow set both(15) W n := (1 − t ) | n |− d +1 H n and N n ( t ) := (1 − t ) | n |− d +2 H (cid:48) n . Therefore, we can clear denominators in (13) by multiplying both sides by (1 − t ) | n |− d +2 . Proceedingcarefully and clearing the remaining denominators using (15), the pattern becomes clear: N n = (cid:88) A ( − | A |− (1 − t ) | A |− N n − e ( A ) + t · C ( n ) · (1 − t ) | n |− d · (1 − t ) · H n = (cid:88) A ( t − | A |− N n − e ( A ) + t · C ( n ) · W n (16)This provides us with a quick recursive code to obtain N n , after which we need only divide by(1 − t ) | n |− d +2 to recover H (cid:48) n . The rest is just a matter of extracting coeﬃcients in order to applythe result (14).5.3. Expected value for continuous version of EMD d . Now that we have a way to determinethe expected value for the discrete EMD, we aim to ﬁnd a formula for the expected value in thecontinuous setting.Starting with the expected value from (14), we scale by 1 /s to normalize, and then let s growasymptotically: E n := E (EMD d ) = lim s →∞ s · E (EMD sd )= lim s →∞ s · [ t s ] H (cid:48) n (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) . First we focus on the [ t s ] H (cid:48) n part, namely the coeﬃcient of t s in H (cid:48) n = N n ( t )(1 − t ) | n |− d +2 . Now, thecoeﬃcient of t s in the series − t ) | n |− d +2 is just (cid:18) s + | n | − d + 1 | n | − d + 1 (cid:19) = s | n |− d +1 ( | n | − d + 1)! + lower-order terms in s. Meanwhile, N n ( t ) is a polynomial, with some ﬁnite degree b . Now, as s → ∞ , we have s − b → s ,and so the coeﬃcient of t s in H (cid:48) n is asymptotic to s | n |− d +1 ( | n |− d +1)! multiplied by the sum of the coeﬃcientsof N n ( t ). But this sum is just N n (1), and so we have:[ t s ] H (cid:48) n ∼ N n (1) · s | n |− d +1 ( | n | − d + 1)!Accounting for the 1 /s , we currently have the following: E n = lim s →∞ N n (1) · s | n |− d ( | n | − d + 1)! (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) Now, since (cid:81) i (cid:0) s + n i − n i − (cid:1) ∼ (cid:81) i s ni − ( n i − = s | n |− d (cid:81) i ( n i − , this becomes(17) E n = N n (1) · (cid:81) di =1 ( n i − | n | − d + 1)!But when we evaluate N n (1) from equation (16), the terms with ( t −

1) all disappear; hence we needonly consider subsets A ⊆ [ d ] with one element, meaning we are now summing from 1 to d : N n (1) = d (cid:88) i =1 N n − e ( i ) (1) + C ( n ) W n (1)Substituting for W n (1) using (12), we have N n (1) = d (cid:88) i =1 N n − e ( i ) (1) + C ( n ) · ( | n | − d )! (cid:81) di =1 ( n i − GENERALIZATION FOR THE EXPECTED VALUE OF EMD 15

Finally, returning to (17) and plugging this all in for N n (1), we conclude with the recursive deﬁnition E n = (cid:34) d (cid:88) i =1 N n − e ( i ) (1) + C ( n ) · ( | n | − d )! (cid:81) di =1 ( n i − (cid:35) · (cid:81) di =1 ( n i − | n | − d + 1)!= (cid:80) di =1 N n − e ( i ) (1) · (cid:81) di =1 ( n i − | n |− d )! | n | − d + 1 + C ( n ) · ( | n | − d )!( | n | − d + 1)!= (cid:80) di =1 ( n i − E n − e ( i ) + C ( n ) | n | − d + 1 , (18)where E n − e ( i ) = 0 if n − e ( i ) contains a 0.We record this as the main theorem of this paper, in the most useful case where n = ( n d ): Theorem 7.

The expected value of

EMD d on P n × · · · × P n is E ( n d ) as deﬁned in (18) . Remark.

Recall from Proposition 5 the special relationship between EMD and EMD , namely,EMD equals half the sum of the three pairwise EMD values. This leads us to anticipate that E ( n ) = E (EMD ) = E (cid:18)

12 (EMD + EMD + EMD ) (cid:19) = E (cid:18)

32 EMD (cid:19) = 32 E (EMD )= 32 E ( n ) . We conﬁrm this in Mathematica: n E ( n ) E ( n ) E ( n ) / E ( n ) Unit normalized EMD.

It is often convenient to unit normalize the EMD d so that its valuefalls between 0 and 1. To this end, we claim that for a given n , the maximum value of EMD d is (cid:98) d/ (cid:99) ( n − sd ( µ ) occurs whenin every column of the matrix M µ (given by the RSK correspondence), half the entries are 1’s andthe other half are n ’s; if d is odd, then “half” means (cid:98) d/ (cid:99) , with the leftover entry being irrelevantby Proposition 1. For such a µ , then, EMD sd ( µ ) equals the cost (cid:98) d/ (cid:99) ( n −

1) multiplied by s (thenumber of columns). After dividing by s to pass to the continuous setting, we see that the maximumvalue of EMD d is (cid:98) d/ (cid:99) ( n − unit normalized EMD d and its expected value:(19) (cid:92) EMD d ( µ ) := EMD d ( µ ) (cid:98) d/ (cid:99) ( n −

1) and (cid:98) E ( n d ) := E ( n d ) (cid:98) d/ (cid:99) ( n − . d = 2 d = 4 d = 3 d = 5 Figure 2.

Histograms of discrete EMD sd values, ﬁxing s = 5 and n = 3. Note themore severe skew to the right when d is even, compared to nearly zero skew when d is odd.We observe a curious phenomenon when we ﬁx n and let d increase: the unit normalized expectedvalue alternately increases ( d changing from even to odd) and decreases ( d changing from odd toeven), as seen in this example for n = 3: d (cid:98) E sd values conﬁrm this impression(see Figure 2): when d is even, the histograms are clearly right-skewed, whereas when d is odd, thehistograms have nearly zero skew (although still slightly right-skewed). It seems that this alternatingphenomenon is a consequence of taxicab geometry in even vs. odd dimensions: speciﬁcally, when d isodd, the median of a vector’s coordinates contributes nothing to its distance from the main diagonal(i.e., the cost function C ). Intuitively, when the other coordinates have a wide range — that is,when C is relatively large — this “free” coordinate can assume more values without aﬀecting C ,than when the range of the coordinates is smaller. This increases the proportion of high-cost arraypositions compared to the case when d is even.6. Real-world data

As a basic example, we now apply this generalized discrete EMD in order to compare the gradedistributions in the spring vs. fall semester of 2019, for the course MATH 232 (Calculus II) atthe University of Wisconsin-Milwaukee. This data is contained in the Section Attrition and Grade

GENERALIZATION FOR THE EXPECTED VALUE OF EMD 17

Report, published by the Oﬃce of Assessment and Institutional Research at UWM; ﬁnal lettergrades are A, B, C, D, and F, so n = 5. We consider all the sections of MATH 232 with more than20 students enrolled; there were seven such sections in each semester in 2019, and so we put d = 7in both cases. We list the grade distributions below, with µ corresponding to the spring sectionsand ν the fall sections. (We will continue to use the term “distribution” rather than “composition”in the context of this application.)The value of s is problematic, since diﬀerent classes in the real world do not have the same numberof students. One solution is to scale each distribution so that all distributions share a common s -value, namely, the least common multiple of their individual s -values. On one hand, this methodretains whole-number distributions whose grade proportions remain unchanged; but on the otherhand, with seven classes of roughly 30 students each, that least common multiple could easily be inthe billions, so this approach is computationally impractical. Instead, we have chosen to scale eachdistribution so that the sum of its components equals the maximum of the seven original s -values;this of course produces non-integer distributions, and so in order to use Theorem 4, we then roundeach component up or down, while respecting the original proportions as much as possible, untilall distributions share the same s -value. In our case, the common s -value for the µ i is 31, and thecommon s -value for the ν i is 33.We list the resulting distributions, with the letter grades in descending order from A to F: Spring 2019 Fall 2019 µ = (7 , , , , ν = (2 , , , , µ = (12 , , , , ν = (4 , , , , µ = (4 , , , , ν = (4 , , , , µ = (6 , , , , ν = (6 , , , , µ = (6 , , , , ν = (4 , , , , µ = (6 , , , , ν = (5 , , , , µ = (8 , , , , ν = (5 , , , , ( µ ) = 49 and EMD ( ν ) = 56 . Dividing by s in each case to normalize (so that µ and ν actually mean s µ and s ν now),EMD ( µ ) = 1 . ( ν ) = 1 . , and ﬁnally dividing by (cid:98) d/ (cid:99) ( n −

1) = 3 · (cid:92) EMD ( µ ) = . and (cid:92) EMD ( ν ) = . . To compare these results to the expected value, we code the recursive deﬁnition in Theorem 7; thenupon unit normalizing as in (19), we ﬁnd that (cid:98) E (5 ) = . . From this very limited data set, at least, it is clear that not only are the EMD values extremelyconsistent between the two semesters (within 0.01 of each other), but also they are signiﬁcantlyless than the expected value — less than half the expected value, in fact. We should not be toosurprised by this, of course, since college grades are (hopefully) not assigned at random. It will beinteresting to track the EMD for multiple courses in multiple semesters, with the goal of performingsome informative cluster analysis on the results. Connections to algebraic geometry and representation theory

In this section, we show that our generating function H n from above is the Hilbert series of theSegre embedding from algebraic geometry. In the d = 2 case, this embedding is a determinantalvariety, whose structure as an inﬁnite-dimensional su ( p, q )-module we can describe in the context ofthe EMD.7.1. A determinantal variety.

In this subsection we let d = 2, and we will write p, q in place ofthe usual n = ( n , n ). In this case, as indicated in [4], the series H p,q := H p,q (1 , t ), from (10), is infact the Hilbert series of the determinantal variety D ≤ p,q := { M ∈ M p,q ( C ) | rank M ≤ } consisting of complex p × q matrices with rank at most 1. To see this, it suﬃces to show thatgiven a nonnegative integer s , the number of elements in C ( s, p ) × C ( s, q ) equals the dimension of C [ D ≤ p,q ] s , the space of homogeneous degree- s polynomial functions on D ≤ p,q . To this end, let w ij bethe coordinate functions on a generic p × q matrix. Since all 2 × D ≤ p,q , we observe that C [ D ≤ p,q ] (cid:39) C [ w ij ] / I , where I is the determinantal ideal generated by the quadratics of the form w ij w i (cid:48) j (cid:48) − w ij (cid:48) w i (cid:48) j for i < i (cid:48) and j < j (cid:48) . It follows that a basis for C [ D ≤ p,q ] s is given by the set of monomials B = (cid:40) s (cid:89) k =1 w i k ,j k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { ( i k , j k ) } is a chain in [ p ] × [ q ] (cid:41) . Now we have an obvious bijective correspondence between B and the set J sp,q (borrowing notationfrom (4) to denote p × q arrays with support in a chain):(20) (cid:89) i,j w J ij ij ←→ J ∈ J sp,q But J sp,q is in bijective correspondence with C ( s, p ) × C ( s, q ), as is clear from a slight generalizationof our RSK correspondence in (5). This proves our claim that H p,q is the Hilbert series of D ≤ p,q .7.2. The Segre embedding.

We extend the previous result to d >

2, and so as before we ﬁx n = ( n , . . . , n d ). The specialization H n := H n (1 , t ) of our generating function from (10) happensalso to be the Hilbert series of the Segre embedding: P ( C n ) × · · · × P ( C n d ) (cid:44) → P ( C n ⊗ · · · ⊗ C n d ) , (cid:16)(cid:104) v (1) (cid:105) , . . . , (cid:104) v ( d ) (cid:105)(cid:17) (cid:55)→ (cid:104) v (1) ⊗ · · · ⊗ v ( d ) (cid:105) . (See [13] and [20].) That is, H n is the Hilbert series of the simple tensors. (In the case d = 2, theset of simple tensors in C p ⊗ C q can be identiﬁed with the determinantal variety D ≤ p,q , coincidingwith the previous subsection.)To sketch this generalization of the d = 2 case, which we presented in detail above, we let m range over all multi-indices ( m , . . . , m d ) ∈ [ n ] × · · · [ n d ], as in the proof of Proposition 6. Nowconsider a simple tensor v (1) ⊗ · · · ⊗ v ( d ) . We can expand this tensor in the standard basis as (cid:88) m (cid:16) v (1) m · · · v ( d ) m d (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) w m e m ⊗ · · · ⊗ e m d , where v ( k ) (cid:96) is the (cid:96) th coordinate of the vector v ( k ) ; the w m are coordinate functions. For any twomulti-indices m and m (cid:48) , we see that the quadratic w m w m (cid:48) is invariant under the exchange of indicescomponent-wise between m and m (cid:48) . Intuitively, then, we can again mod out by the determinantalideal generated by all 2 × d = 2 case above. GENERALIZATION FOR THE EXPECTED VALUE OF EMD 19

The upshot is that a basis for the coordinate ring of the simple tensors is given by those monomials w m · · · w m s such that the set of multi-indices { m i } form a chain. We therefore have the general-ization of (20), so we conclude that H n is the Hilbert series of the simple tensors. (Technically,of course, the Segre embedding is contained in the projectivization of the simple tensors, so we aresweeping a fair amount under the rug here.)7.3. Representation theory.

Returning to the d = 2 case, the coordinate ring C [ D ≤ p,q ] is aninﬁnite-dimensional representation of the Lie algebra su ( p, q ), known as the ﬁrst Wallach represen-tation. (See [9].) We will show how the action on this representation corresponds to manipulatingthe two compositions in our EMD s setting.Consider the polynomial ring C [ x , y ] := C [ x , . . . , x p , y , . . . , y q ]. On one hand, C [ x , y ] admits anaction of GL ( C ) (which is just the multiplicative group of nonzero complex numbers), via( g · f )( x , y ) = f ( g − x , g y )for g ∈ GL ( C ) and f ∈ C [ x , y ]. From now on let G = GL ( C ). Note that the invariants underthe G -action are those polynomials in which the degree of each term is the same with respect to x as it is with respect to y ; in other words, C [ x , y ] G is generated by the monomials x i y j . (This is aspecial case of the First Fundamental Theorem of Invariant Theory; see [12], Section 5.2.1.) Butsince the kernel of the ring homomorphism w ij (cid:55)→ x i y j is precisely the determinantal ideal I , wehave C [ x , y ] G (cid:39) C [ D ≤ p,q ]. (This is a special case of the Second Fundamental Theorem of InvariantTheory; see [12], Lemma 5.2.4.)As a result of Howe duality in Type A — the delicate details of which are expounded in [15] and[16] — the space C [ x , y ] is also a module under the action of the Lie algebra su ( p, q ) by diﬀerentialoperators. Upon complexiﬁcation, this gives rise to an action by gl p + q ( C ), which as a set is just the( p + q ) × ( p + q ) complex matrices. In particular, the invariant subring C [ x , y ] G (cid:39) C [ D ≤ p,q ] is theirreducible, inﬁnite-dimensional gl p + q -module with highest weight ( − , . . . , − , (cid:124) (cid:123)(cid:122) (cid:125) p , . . . , gl p + q is given by diﬀerential operators on C [ x , y ], of the following four forms (see[12], Section 5.6):(1) x i ∂∂x j (Euler operators; technically the action includes the extra term + δ ij );(2) y i ∂∂y j (Euler operators);(3) ∂ ∂x i ∂y j (“raising operators”);(4) x i y j (“lowering operators”).Note that all these operators preserve the diﬀerence between the degree with respect to x andthe degree with respect to y . Therefore the gl p + q -action preserves C [ x , y ] G , which we observed isgenerated by the elements x i y j .This gl p + q -action can be described in terms of our EMD setting in this paper. First, observe thatany degree- s monic monomial in C [ x , y ] G corresponds uniquely to an ordered pair of compositions( µ, ν ) ∈ C ( s, p ) × C ( s, q ), via( µ, ν ) ←→ x µ y ν := x µ (1)1 · · · x µ ( p ) p y ν (1)1 · · · y ν ( q ) q . This is no surprise, of course, since this is just the RSK correspondence we used earlier in thissubsection; written out in all of its guises, we have( µ, ν ) ←→ x µ y ν ←→ J ( µ,ν ) ∈ J sp,q ←→ (cid:89) i,j w J ij ij . Now we can see how each type (1)-(4) of diﬀerential operator has an interpretation in the EMD context. Consider the monomial x µ y ν as deﬁned above. Then, up to scaling by coeﬃcients, weobserve the following: (1) The Euler operator x i ∂∂x j corresponds to moving 1 unit in µ , from bin j to bin i , since theexponent of x j decreases by 1 and the exponent of x i increases by 1.(2) The Euler operator y i ∂∂y j corresponds to moving 1 unit in ν , from bin j to bin i .(3) The raising operator ∂ ∂x i ∂y j corresponds to removing 1 unit from each composition: frombin i in µ and from bin j in ν .(4) The lowering operator x j y i corresponds to adding 1 unit to each composition: to bin i in µ and to bin j in ν .It will be interesting to study further whether this connection to representation theory might beexploited in existing applications of EMD .8. Proof of Proposition 2

The methods in this paper depended heavily upon the fact that we need consider only thosearrays J whose support is a chain. This followed from the statement in Proposition 2 — yet to beproved — that our cost array C has the Monge property. Before proving this here, we state threeuseful lemmas, the ﬁrst of which is proved in [2] and [21]: Lemma 8. An n × · · · × n array A has the Monge property if and only if every two-dimensionalplane of A has the Monge property. To make this explicit, we choose any two distinct indices i, j from { , . . . , d } , and then ﬁx theremaining d − m , . . . , m i − , m i +1 , . . . , m j − , m j +1 , . . . , m d ∈ [ n ]. Thenwe will write m i,jk,(cid:96) := ( m , . . . , m i − , k, m i +1 , . . . , m j − , (cid:96), m j +1 , . . . , m d ). In other words, m i,jk,(cid:96) isthe vector in which the i th coordinate is k , the j th coordinate is (cid:96) , and the remaining coordinatesare the ﬁxed values m , . . . , m d . Now we can naturally deﬁne the two-dimensional subarray A i,j inwhich(21) A i,j ( k, (cid:96) ) := A (cid:16) m i,jk,(cid:96) (cid:17) . Then Lemma 8 states that A has the Monge property if and only if A i,j has the Monge property forevery choice of distinct i and j .This reduction to the two-dimensional case is extremely useful because of the following charac-terization of two-dimensional Monge arrays, proved in [21]: Lemma 9.

Let A be an n × n array. Then A has the Monge property if and only if A ( k, (cid:96) ) + A ( k + 1 , (cid:96) + 1) ≤ A ( k + 1 , (cid:96) ) + A ( k, (cid:96) + 1) for all k, (cid:96) ∈ [ n − . In other words, choose a position ( k, (cid:96) ) and then consider the 2 × A ( k, (cid:96) )and its three neighbors to the east, south, and southeast. The condition displayed in the lemmameans that the sum of the upper-left and lower-right entries must never be greater than the sum ofthe lower-left and upper-right entries.We will need one ﬁnal lemma, speciﬁc to the cost function C in this paper. Recall from Proposition1 that if we let (cid:101) m denote a vector m with its coordinates rearranged in ascending order, then C ( m ) = − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99) + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d ( d even)or C ( m ) = − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99)− + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d ( d odd) . The index (cid:98) d +12 (cid:99) gave a kind of “median” of the coordinates in m ; from now on, however, we will workinstead with M := (cid:98) d +12 (cid:99) + 1 = (cid:100) d +22 (cid:101) . Intuitively, this index M gives the next-greatest coordinate GENERALIZATION FOR THE EXPECTED VALUE OF EMD 21 after the “median.” The picture is the following, where the vertical lines divide the coordinates intotwo equal sets (with one leftover coordinate in the middle if d is odd: d even : (cid:101) m = ( (cid:101) m , . . . , (cid:101) m M − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) m M , . . . , (cid:101) m d ) d odd : (cid:101) m = ( (cid:101) m , . . . , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) m M − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) m M , . . . , (cid:101) m d )With this indexing in mind, we state our ﬁnal lemma, which records the eﬀect on C ( m ) of adding1 to a single coordinate m i . Recall from earlier that e ( i ) denotes the vector whose coordinates areall 0 except for a 1 in the i th component. Lemma 10.

Adding to a single coordinate m i of m has one of three eﬀects on C ( m ) : it eitherincreases by 1, decreases by 1, or remains the same. The eﬀect depends on the value of m i relativeto the other coordinates of m : (1) C ( m + e ( i )) = C ( m ) + 1 if m i ≥ (cid:101) m M .(2) C ( m + e ( i )) = C ( m ) − d is even and m i < (cid:101) m M ; or(b) d is odd and m i < (cid:101) m M − .(3) C ( m + e ( i )) = C ( m ) if d is odd and m i = (cid:101) m M − < (cid:101) m M . Proof.

We prove each of the three cases; the reader may ﬁnd it helpful to keep an eye on the twopossible “pictures” of (cid:101) m displayed before this lemma, along with the two possible sums for C ( m )displayed just before that.(1) Assume m i ≥ (cid:101) m M . Then m i + 1 > (cid:101) m M , and so in the sum deﬁning C ( m ), we must havepositive m i replaced by positive ( m i + 1). Hence C ( m ) has increased by 1.(2) (a) Assume d is even and m i < (cid:101) m M . Then m i +1 ≤ (cid:101) m M , and so in the sum deﬁning C ( m ),we must have negative m i replaced by negative ( m i + 1). Hence C ( m ) has decreasedby 1.(b) Assume d is odd and m i < (cid:101) m M − . Then m i +1 ≤ (cid:101) m M − , and so we must have negative m i replaced by negative ( m i + 1). Hence C ( m ) has decreased by 1.(3) Assume d is odd and m i = (cid:101) m M − < (cid:101) m M ; note that (cid:101) m M − does not appear in the sumdeﬁning C ( m ). Then (cid:101) m M − < m i + 1 ≤ (cid:101) m M , and so m i + 1 still does not appear in thesum deﬁning C ( m + e ( i )). Hence C ( m ) remains unchanged. (cid:3) We are now ready for the proof, in which we show that an arbitrary two-dimensional subarray of C has the Monge property. Proof of Proposition 2.

Let i, j be two distinct indices in { , . . . , d } . Fix the remaining coordinates m , . . . , m d as above, and let C i,j be the corresponding two-dimensional subarray of C deﬁned in(21). Now let m i , m j ∈ [ n − C i,j ( m i , m j ) + C i,j ( m i + 1 , m j + 1) ≤ C i,j ( m i + 1 , m j ) + C i,j ( m i , m j + 1) . But this condition can be rewritten as the following, where we simply write m for m i,jm i ,m j :(22) C ( m ) + C ( m + e ( i ) + e ( j )) ≤ C ( m + e ( i )) + C ( m + e ( j ))To show that this condition holds true, we need to examine six possible cases, depending on whetheradding 1 to m i and m j (independently) causes C to increase, decrease, or remain the same: C ( m + e ( i )) = C ( m ) + 1 C ( m + e ( i )) = C ( m ) − C ( m + e ( i )) = C ( m ) C ( m + e ( j )) = C ( m ) + 1 Case 1 C ( m + e ( j )) = C ( m ) − C ( m + e ( j )) = C ( m ) Case 3 Case 5 Case 6 In each case below, all simpliﬁcations are directly justiﬁed by the results in Lemma 10. • Case 1:

In this case, the right-hand side of (22) is 2 · C ( m ) + 2. For the left-hand side, weknow in general that C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) , which by Lemma 10 canbe no greater than C ( m ) + 2. Hence the inequality in (22) must hold. • Case 2:

In this case, the right-hand side of (22) is 2 · C ( m ). As for the second term onthe left-hand side, by Lemma 10, we must have m i ≥ (cid:101) m M ; meanwhile, m j is strictly lessthan either (cid:101) m M (if d is even) or (cid:101) m M − (if d is odd), and so neither inequality is aﬀected byadding 1 to m i . Therefore we have C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i )) − C ( m ) + 1 − C ( m ) . Hence we have an equality in (22). • Case 3:

Similar to Case 2, the two additions are independent of each other. The right-handside of (22) is 2 · C ( m ) + 1. In this case, we must have d odd; also, m i ≥ (cid:101) m M , along with m j = (cid:101) m M − < (cid:101) m M . Then C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i ))= C ( m ) + 1 . Again we obtain an equality in (22). • Case 4:

The right-hand side of (22) is 2 · C ( m ) −

2. If d is even, then both m i and m j arestrictly less than (cid:101) m M , and if d is odd, then both are strictly less than (cid:101) m M − . Either way,after adding 1 to m i , the same inequality still holds for m j , and so again we have C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i )) − C ( m ) − − C ( m ) − , and we get an equality in (22). • Case 5:

The right-hand side of (22) is 2 · C ( m ) −

1. In this case, d must be odd, with m i < (cid:101) m M − = m j < (cid:101) m M . After adding 1 to m j , we still have m i less than the ( M − th componentin the new rearranged vector, and so the eﬀects of the two additions are independent. Weobtain C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( j )) + e ( i ) (cid:1) = C (( m + e ( j )) − C ( m ) − • Case 6:

This is the slightly surprising case, in which the two additions are not independentof each other. The right-hand side of (22) is 2 · C ( m ), and we know that d must be odd, with m i = m j = (cid:101) m M − < (cid:101) m M . After adding 1 to m i , we obtain a vector m (cid:48) in which m (cid:48) j = m j GENERALIZATION FOR THE EXPECTED VALUE OF EMD 23 is now strictly less than (cid:101) m (cid:48) M − , and so now adding 1 to m j results in an overall decrease by1. Hence we have C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i )) − C ( m ) − . Hence the left-hand side of (22) is less than the right-hand side, and the condition is stillsatisﬁed.We have exhausted all possible cases, and so since (22) holds in each of them, the two-dimensionalarray C i,j has the Monge property. Since i and j were arbitrary, every two-dimensional subarray of C has the Monge property, and so by Lemma 8, we conclude that C itself has the Monge property. (cid:3) References

1. M. Abramson,

A simple solution of Simon Newcomb’s problem , J. Combin. Theory Ser. A (1975), 223–225.2. A. Aggarwal and J.K. Park, Sequential searching in multidimensional monotone arrays , Research Report RC15128, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1989.3. W. Bein, P. Brucker, J. Park, and P. Pathak,

A Monge property for the d-dimensional transport problem , DiscreteAppl. Math. (1995), no. 2, 97–109.4. R. Bourn and J. Willenbring, Expected value of the one-dimensional earth mover’s distance , A. Stat. (2020),no. 1, 53–78.5. W. Bruns and J. Herzog, Cohen-Macaulay rings , Cambridge University Press, 1993.6. F. Caselli,

On the multivariate Robinson-Schensted correspondence , Bollettino dell’Unione Matematica Italiana (2009), no. 1, 591–602.7. H. C¸ olako˘glu, On the distance formulae in the generalized taxicab geometry , Turkish J. Math. (2019), no. 3,1578–1594.8. J. Dillon and D. Roselle, Simon Newcomb’s problem , SIAM J. Appl. Math. (1969), no. 6, 1086–1093.9. T. Enright and J. Willenbring, Hilbert series, Howe duality, and branching for classical groups , Ann. of Math. (2004), no. 1, 337–375.10. A. Frohmader and H. Volkmer, , arXiv:1912.04945, 2019.11. W. Fulton,

Young tableaux , Cambridge University Press, 1997.12. R. Goodman and N. Wallach,

Symmetry, representations, and invariants , Springer, 2009.13. J. Harris,

Algebraic geometry: a ﬁrst course , Springer-Verlag, 1995.14. A. Hoﬀman,

On simple linear programming problems , Convexity: Proceedings of the Seventh Symposium in PureMathematics of the AMS (V. Klee, ed.), American Mathematical Society, Providence, RI, 1963, pp. 317–327.15. Roger Howe,

Remarks on classical invariant theory , Trans. Amer. Math. Soc. (1989), no. 2, 539–570.16. Roger Howe, Eng-Chye Tan, and Jeb F. Willenbring,

Stable branching rules for classical symmetric pairs , Trans.Amer. Math. Soc. (2005), no. 4, 1601–1626.17. J. Kline,

Properties of the d-dimensional earth mover’s problem , Discrete Appl. Math. (2019), 128–141.18. J. Kretschmann,

Earth mover’s distance between grade distribution data with ﬁxed mean , Master’s thesis, Uni-versity of Wisconsin-Milwaukee, 2020.19. G. Monge,

M´emoire sur la th´eorie des d´eblais et des remblais , Histoire de l’Acad´emie Royale des Sciences deParis, 1781, pp. 666–704.20. M. Morales,

Segre embeddings, Hilbert series, and Newcomb’s problem , HAL ID: hal-00839652, 2013.21. J. Park,

The Monge array: an abstraction and its applications , Ph.D. thesis, Massachusetts Institute of Technol-ogy, 1991.22. C. Villani,

Optimal transport, old and new , Springer, 2008.

Email address ::