A generalization for the expected value of the earth mover's distance
AA GENERALIZATION FOR THE EXPECTED VALUE OF THE EARTHMOVER’S DISTANCE
WILLIAM Q. ERICKSON
Abstract.
The earth mover’s distance (EMD), also called the first Wasserstein distance, canbe naturally extended to compare arbitrarily many probability distributions, rather than onlytwo, on the set [ n ] = { , . . . , n } . We present the details for this generalization, along with ahighly efficient algorithm inspired by combinatorics; it turns out that in the special case of threedistributions, the EMD is half the sum of the pairwise EMD’s. Extending the methods of Bournand Willenbring (2020), we compute the expected value of this generalized EMD on random d -tuples of distributions, using a generating function which coincides with the Hilbert series of theSegre embedding. We then use the EMD to analyze a real-world data set of grade distributions. Introduction
We generalize a result appearing in [4], in which the authors compute the expected value ofthe earth mover’s distance (EMD) between two probability distributions, by means of a generatingfunction. The EMD can be viewed as the solution to a problem in transport theory, first consideredin [19] by French geometer Gaspard Monge in 1781. (Although the term “earth mover” seems tohave been coined only in the 1990s, it is pointed out in Villani’s monumental reference [22] that thetitle of Monge’s original treatise translates, more or less, as “On the theory of material extractedfrom the earth and input to a new construction.” Monge, then, truly was the original earth mover.)Nearly 200 years later, in [14], Monge’s name was given to a critical property of certain cost arraysfor which his problem can be solved by a greedy algorithm. Throughout the late 1980s and 1990s,in [2] and [3], this Monge property was generalized to higher-dimensional arrays. (See also [17] fora more recent treatment.) It is an essential fact in this paper (whose proof is reserved for Section 8at the end) that the d -dimensional cost array associated with our EMD d has this Monge property.Section 2, written for those readers unfamiliar with the EMD, presents a simple example andpoints out all the relevant details which will reappear in our generalization.We begin in Section 3 by defining an earth mover’s “distance” EMD d between d distributions;the classical EMD treated in [4] coincides with EMD . We actually find that on three distributions,EMD equals half the sum of the the three EMD values, although no such relationship holds for d > d which compares histograms instead of probability distributions, and we describe an efficientcomputational method using a generalization of the RSK correspondence from combinatorics.In Section 5, we encode the values of the discrete EMD d in a generating function, which wemanipulate in order to extract the expected value. Translating this discrete result back into thecontinuous setting, we prove the main theorem of this paper (Theorem 7), which is a recursiveformula to compute the expected value of EMD d . We then apply our theory in Section 6 to analyzea real-world data set of grade distributions.Finally, in Section 7, we mention a connection between our generating function and the Segreembedding in algebraic geometry. We also exhibit a certain infinite-dimensional representation ofthe Lie algebra su ( p, q ), whose action corresponds to manipulating the distributions compared byour EMD. Mathematics Subject Classification.
Primary 13P25, 05E14; Secondary 05E40. a r X i v : . [ m a t h . C O ] O c t WILLIAM Q. ERICKSON
Since the appearance of [4], the problem of finding the expected value of EMD has been solvedfrom an analytical approach in [10]. The setup has also been specialized in [18] to a data set ofdistributions with a fixed average value.We believe that the result in this paper — a method to evaluate the “closeness” of arbitrarily manydistributions — has great potential as a tool in data analysis. In evaluating teaching and assessmentpractices at the university level, for instance, we can now assign a single value to an entire course byevaluating the EMD between the individual sections, and then track the behavior of that course’sEMD for different groups of instructors, different course coordinators, fall vs. spring semesters, andother variables. We can even assign EMD values to individual exams and other assessments usingthe grade distributions in various sections; or in the other direction, we can compare different coursesto each other, both within and outside a given department. In all of these settings, we believe thatthe generalized EMD can contribute to an interesting cluster analysis of the kind proposed in [4]. Acknowledgements:
The author would like to thank Rebecca Bourn and Jeb Willenbring,the authors of [4], for the conversations about their original paper. Jeb’s observations about theconnections to representation theory were especially vital to Section 7.2.
EMD between 2 distributions: summary and an example
For readers unfamiliar with the classical EMD, we summarize the idea here. Consider two prob-ability distributions on the finite set of integers [ n ] = { , . . . , n } . (More vividly, in place of a“probability distribution,” imagine n bins of earth whose combined mass is one unit, located at1 , . . . , n on the number line.) Intuitively, the EMD between the two distributions measures the“cheapest” cost of moving earth between the bins so as to equalize the distributions, where the“cost” of moving one unit of earth is the distance of the move. For example, the cost of moving 0 . . · (5 −
2) = 0 .
75. To make this precise, we define the costfunction C : [ n ] × [ n ] −→ Z ≥ , where C ( i, j ) is the cost of moving one unit of earth from bin i tobin j . In this case, clearly C ( i, j ) = | i − j | .Any solution which equalizes the two distributions — whether or not it is the optimal solution— can be encoded in an n × n matrix J . Necessarily, the row sums of J will correspond to the firstdistribution, and the column sums to the second, so the entries of J must sum to 1.We present a brief example to show how the entries of J give (possibly ambiguous, but equivalent)step-by-step instructions to equalize the two distributions. The procedure we give here is not themost direct (see Section 2 of [4]), but it will provide the best intuition when we generalize to d distributions in the next section. The less-than-rigorous descriptions below will be formalized in thenext section in terms of the taxicab metric. Example.
Consider the two distributions µ = (0 . , . , .
4) and µ = (0 . , , . n = 3.Then one matrix (among infinitely many) with the prescribed row and column sums is J = . .
20 0 .
30 0 . . The nonzero entries of J correspond to moving earth as follows: • J , = 0 .
1. Note that the coordinates (1 ,
1) are already equal to each other, so we do nothave to move the 0 . • J , = 0 . . Now the coordinates (1 ,
3) are not equal; in order to make them equal with aslittle cost as possible, we have three valid options, all of which have cost 2: – In the first coordinate, we could add 2 to make the change 1 →
3. This corresponds tomoving the 0 . µ , from bin 1 to bin 3. – In the second coordinate, we could subtract 2 to make the change 3 →
1. This corre-sponds to moving the 0 . µ , from bin 3 to bin 1. GENERALIZATION FOR THE EXPECTED VALUE OF EMD 3 – We could add 1 to the first coordinate (1 →
2) and subtract 1 from the second coordinate(3 → . µ from bin 1 to bin 2, andthen moving 0 . µ from bin 3 to bin 2. • J , = 0 .
3. The cheapest ways to equalize the coordinates (2 ,
3) are the following two options,each with cost 1: – In the first coordinate, we could add 1 to make the change 2 →
3. This corresponds tomoving the 0 . µ , from bin 2 to bin 3. – In the second coordinate, we could subtract 1 to make the change 3 →
2. This corre-sponds to moving the 0 . µ , from bin 3 to bin 2. • J , = 0 .
4. Since the coordinates (3 ,
3) are already equal, we do not have to move the 0 . µ (cid:48) and µ (cid:48) . But within each possible pair, as thereader can check, we always finish with µ (cid:48) = µ (cid:48) , as desired. Furthermore, the total cost of all theearth moved is independent of our choices, since all options above minimized the cost at each step.(Also note that the cost at each step was always equal to | i − j | , coinciding with the cost function C we defined earlier.) In this case, the total cost of the earth moved was0 . . . . . . The EMD between µ and µ is, by definition, the infimum (actually the minimum) of the setof total costs, taken over all possible matrices J with the prescribed row and column sums. Inthis example, although not obvious at first glance, 0 . µ , µ ) = 0 .
7. This turns out to be a consequence of the fact that the support of J lies in a chain : in other words, if we put the product order (cid:22) on [ n ] × [ n ], we see that(1 , (cid:22) (1 , (cid:22) (2 , (cid:22) (3 , n ] × [ n ]. This fact — that support ina chain implies minimality — is equivalent, on a deeper level (see [14]), to the fact that our costfunction C , if considered as an n × n array, has the “Monge property” alluded to in the introduction;in this case, the greedy algorithm to solve the earth mover’s problem (known as the “northwestcorner rule”; see [3]) eliminates one row or column at each step, meaning the support of the solutionmatrix J is always a chain.There is one phenomenon here in the d = 2 case which will not generalize to d >
2: in theabove example, we could have removed any ambiguity by deciding that we would move earth within µ exclusively, so that both final distributions would equal µ . Therefore, we could interpret theproblem as finding the cheapest way to transport material from a “source” or “supply vector” ( µ )to a “sink” or “demand vector” ( µ ). For d >
2, however, the optimal solution at each step mayrequire moving earth in any or all of the distributions, and so we lose the binary supply-demandinterpretation of the problem.Having presented the big picture, without details, in the d = 2 case, we now proceed to buildup the general case for arbitrary d . Throughout the next section, the reader can verify that thedefinitions and results coincide with those found in this simple example where d = 2.3. Extending EMD to d distributions Definitions and notation.
Let P n denote the set of probability distributions on [ n ]. Assumethe uniform probability measure on the d -fold product P n × · · · × P n , defined by its embeddinginto R dn . Our goal is to compare an arbitrary number of elements of P n , written as the d -tuple µ := ( µ , . . . , µ d ). We should keep in mind that each µ i is itself an n -tuple whose components sumto 1. Throughout this paper, we write the sum of a vector’s components using absolute value bars,so in this case, | µ i | = 1. We will denote the k th component of µ i by µ i ( k ), which is just the value of WILLIAM Q. ERICKSON the distribution µ i at k ∈ [ n ]. To each µ there corresponds the set J µ of joint distribution arrays,defined as follows.For an array J , we will write J ( m , . . . , m d ) for the entry at position ( m , . . . , m d ). Now, wedefine J µ as the set containing all those arrays J ∈ R n ×···× n ≥ whose sums within the coordinatehyperplanes coincide with µ . Specifically, fixing m i = k , we must have(1) n (cid:88) m ,..., (cid:98) m i ,...,m d =1 J ( m , . . . , k (cid:124)(cid:123)(cid:122)(cid:125) m i , . . . , m d ) = µ i ( k ) . In other words, summing all the entries whose positions in the array have k as their i th coordinate,we obtain the k th component of µ i . In the familiar d = 2 case, i = 1 gives us the row sums, and i = 2 the column sums. For d = 3, see Figure 1 for an illustration. Figure 1.
An illustration of the conditions in equation (1), in the case where d = 3 and n = 4. Given some µ = ( µ , µ , µ ), every array in J µ satisfies the aboverelations, where each arrow represents the sum of the entries in the designated plane.Any array J ∈ J µ can be thought of as a solution to the earth mover’s problem for n bins,determined by the distributions in µ . This means we need a d -dimensional analog of the “cost”function from Section 2, and the natural candidate arises from the taxicab metric on [ n ] d . Specifically,for each position in a d -dimensional array, we want the associated cost equal the taxicab distance tothe main diagonal, i.e., to the nearest position in the array whose coordinates are all equal. (This“equality of coordinates” property of the main diagonal, as we recall from Section 2, correspondedto zero earth being moved.) Roughly speaking, this cost is the fewest number of ± , , , , , ,
5) is to add 1 to the 4, and then to subtract 2 from the 7,for a total cost of 3. This is precisely the taxicab distance to the main diagonal, specifically to theposition (5 , , , , , , • When we added 1 to the 2 nd coordinate to make the change 4 →
5, we moved a unit of earthin the 2 nd distibution µ from bin 4 to bin 5. • When we subtracted 2 from the 6 th coordinate to make the change 7 →
5, we moved a unitof earth in the 6 th distibution µ from bin 7 to bin 5. Example.
Consider the three distributions µ = (0 . , . , . ,µ = (0 . , . , . ,µ = (0 . , . , . . GENERALIZATION FOR THE EXPECTED VALUE OF EMD 5
Then one array in J µ is, for instance,(2) J = [ . , ,
0] [ 0 , ,
0] [0 , , , ,
0] [ . , ,
0] [0 , , , ,
0] [ . , ,
0] [0 , . , . , flattened so that the first coordinate specifies the row, the second coordinate specifies one of thethree main columns, and the third coordinate specifies the position inside the triple at that position.The nonzero entries are J (1 , ,
1) = 0 . J (2 , ,
1) = 0 . J (3 , ,
1) = 0 . J (3 , ,
2) = 0 . J (3 , ,
3) = 0 . . This information tells us how to arrive at the solution corresponding to J : • The cost of (1 , ,
1) is 0 since it is already on the main diagonal, so we do not move the 0 . • The cost of (2 , ,
1) is 1, since in the 3 rd coordinate we must make the change 1 → rd distribution µ , we move 0 . µ (cid:48) = (0 . , . , . • The cost of (3 , ,
1) is 2, since we equalize the coordinates most efficiently by subtracting1 from the 1 st coordinate (3 →
2) and adding 1 to the 3 rd coordinate (1 → . µ , and from bin 1 to bin 2 in µ . Now µ (cid:48) = (0 . , . , . µ (cid:48) = (0 . , . , . • The cost of (3 , ,
2) is 1, by adding 1 to the 3 rd coordinate. This corresponds to moving 0 . µ . Now µ (cid:48) = (0 . , . , . • The cost of (3 , ,
3) is 0, so we do not move the 0 . µ (cid:48) = µ (cid:48) = µ (cid:48) =(0 . , . , . J . But of course in each case the total cost is the same.The natural computation now is to find that total cost, by multiplying the amount of earth movedat each step by the number of bins it was moved; in other words, multiply each entry in J by thecost of its position, then add these products together:0 . . . . . . . This completes the example.Of course, there is no guarantee that this is the least costly way to equalize the three distributions;this is simply the solution corresponding to one particular array J , and a different array in J µ mightgive a different total cost. When we finally define our generalized EMD, it will be defined as the least possible cost for any J ∈ J µ . First, however, we should record a formula for the cost of anarray position, to improve upon the somewhat sloppy method by inspection we have used so far.The formula for the d -dimensional taxicab distance from a point to a line is derived in [7]. Inour case, the line of interest is the main diagonal, which passes through (1 , . . . ,
1) in the direction (cid:104) , . . . , (cid:105) . This distance, and therefore our cost function C , turns out to be(3) C ( m , . . . , m d ) = min i ∈ [ d ] (cid:88) j (cid:54) = i | m i − m j | . WILLIAM Q. ERICKSON
This cost function C can also naturally be thought of as an n × · · · × n array, so we will occasionallyrefer to the “cost array” in this paper.There is also a more direct way to compute C , which will be convenient later. Let m :=( m , . . . , m d ), and let (cid:101) m denote the vector whose components are those of m rearranged in as-cending order; e.g., if m = (7 , , , , (cid:101) m = (1 , , , , Proposition 1.
Equation (3) can be computed as C ( m ) = (cid:98) d/ (cid:99) (cid:88) i =1 (cid:101) m d − i +1 − (cid:101) m i . As an example before the proof, take m = (7 , , , ,
1) as above. Then by the proposition, tocompute C ( m ), we instead look at (cid:101) m and sum up the pairwise differences working outside-in: (cid:101) m = ( − (cid:122) (cid:125)(cid:124) (cid:123) , − (cid:122) (cid:125)(cid:124) (cid:123) , , , , therefore C ( m ) = 6 + 2= 8 . Proof.
For fixed i ∈ [ d ], we have (cid:88) j (cid:54) = i | m i − m j | = ( (cid:101) m − (cid:101) m ) + 2( (cid:101) m − (cid:101) m ) + 3( (cid:101) m − (cid:101) m ) + · · · + ( i − (cid:101) m i − (cid:101) m i − )+ ( (cid:101) m d − (cid:101) m d − ) + 2( (cid:101) m d − − (cid:101) m d − ) + 3( (cid:101) m d − − (cid:101) m d − ) + · · · + ( d − i )( (cid:101) m i +1 − (cid:101) m i ) , which is minimized when i = (cid:98) d +12 (cid:99) . Making this evaluation in the displayed sum, we find that thesum telescopes; when d is even, we obtain − (cid:101) m − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99) + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d , and when d is odd, we obtain − (cid:101) m − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99)− + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d . In either case, this simplifies as C ( m ) = (cid:98) d/ (cid:99) (cid:88) i =1 (cid:101) m d − i +1 − (cid:101) m i . (cid:3) Remark.
A recent paper [17] defines a different cost function than ours for the earth mover’sproblem, namely C (cid:48) ( m ) := max { m i } − min { m i } . We can see from Proposition 1 that C (cid:48) agreeswith our C for d = 2 and d = 3, but not for d >
3. For example, letting m = (1 , , , C ( m ) = 2 but C (cid:48) ( m ) = 1. For our purposes, we have chosen our C because it counts every earth-movement required to equalize the distributions. For example, keeping m = (1 , , , µ = µ = (1 ,
0) and µ = µ = (0 , m . Intuitively, we want the EMD of these four distributions tobe 2, not 1, since we must first move a unit of earth by 1 bin, and then move another unit by 1 bin.Having built up the necessary intuition and formulas, we are finally ready to make our maindefinition: Definition.
Let µ be a d -tuple of probability distributions, as above. Then the generalized earthmover’s distance is defined as(4) EMD d ( µ ) := min J ∈J µ (cid:88) m ∈ [ n ] d C ( m ) J ( m ) . GENERALIZATION FOR THE EXPECTED VALUE OF EMD 7
Existence of a greedy algorithm.
As mentioned in the first two sections, finding the right-hand side of (4) is equivalent to finding the optimal solution to a d -dimensional transport problem.It is shown in [3] that there exists a greedy algorithm to find this solution in O ( d n ) time, preciselywhen the cost array C has the Monge property mentioned in the introduction:
Definition. A d -dimensional array A has the Monge property if, for all x = ( x , . . . , x d ) and y = ( y , . . . , y d ), we have A (min { x , y } , . . . , min { x d , y d } ) + A (max { x , y } , . . . , max { x d , y d } ) ≤ A ( x ) + A ( y ) . We now state the crucial proposition, whose proof we will give in Section 8.
Proposition 2.
The cost array C defined in (3) has the Monge property. This proposition, then, guarantees the existence of a greedy algorithm to compute EMD d . (Thisjustifies our writing “min” instead of “inf” in our definition; we also could have used a compactnessargument as in [4].) The greedy algorithm described in [3] is a generalization of the two-dimensional“northwest corner rule.” Just as in the d = 2 case (see Section 2), for generic d this algorithm arrivesat its solution in the form of an array J ∈ J µ whose support is a chain , i.e., pairwise comparableunder the product order on [ n ] d . (In [4], Proposition 4, the “straightening” procedure that convertsthe support of any J into a chain, without increasing the total cost, is valid precisely because thecost array C ( i, j ) = | i − j | has the Monge property.) Rather than describe this greedy algorithm,which is already well-known (see [3] or [17]), our goal is instead to find the expected value of EMD d .To this end, the importance of the algorithm is the following: Corollary 3.
The minimum in (4) occurs for some J ∈ J µ whose support is a chain in [ n ] d . Since there is nothing special about the condition | µ i | = 1 from the perspective of transportproblems, Corollary 3 also holds in a discrete setting using integer compositions in place of probabilitydistributions. We will take this discrete approach in the next section, where we use a highly efficientcombinatorial method to find the optimal array J for any µ .4. A discrete approach
We follow the method from [4], with a view toward constructing a generating function in the nextsection. In place of P n , we temporarily turn our attention to C ( s, n ), the set of (weak) compositionsof some positive integer s into n parts. That is, elements of C ( s, n ) are n -tuples of nonnegativeintegers whose sum is s (whereas before, the elements of P n were n -tuples of nonnegative realnumbers whose sum was 1); we can also think of compositions as histograms. The cost function C ,however, remains the same as before, since it still describes distances among the n bins in each ofthe d compositions.In this section, µ = ( µ , . . . , µ d ) denotes a sequence of compositions µ i ∈ C ( s, n ). It is temptingsimply to adjust the definition of J µ so that arrays in the set must have nonnegative integer entriessumming to s ; then we could just re-use the definition (4) to obtain a definition for the discreteEMD d . Although this is one viable approach, nevertheless, in light of Corollary 3, we need onlyconsider those arrays whose support is a chain; therefore we will work with the following set of arraysfrom this point forward: J s ( n d ) := (cid:40) J ∈ ( Z ≥ ) n ×···× n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) m J ( m ) = s, and the support of J is a chain in [ n ] d (cid:41) . We will now show that each d -tuple of compositions µ ∈ C ( s, n ) × · · · × C ( s, n ) corresponds toa unique array J µ ∈ J s ( n d ) ; therefore, by the end of the next subsection, we will have a directcomputational definition for the discrete version of the EMD, which avoids taking the mimimumover a set of arrays as we must in the definition (4) of the continuous EMD. Once we have thisdefinition for the discrete EMD, we will be able to recover the continuous version by scaling all the WILLIAM Q. ERICKSON µ i by 1 /s . Near the end of the paper, we will do exactly this, and then let s → ∞ , in order totranslate discrete results back into the continuous setting.4.1. Generalized RSK correspondence.
The authors of [4] use the Robinson-Schensted-Knuthcorrespondence to great effect in order to determine a unique optimal matrix J ( µ ,µ ) for an orderedpair of compositions ( µ , µ ). We now apply this same idea to d compositions in order to uniquelydetermine an optimal d -dimensional array. This will allow us to calculate the discrete EMD d directly(and even more efficiently, in many cases, than by using the greedy algorithm mentioned above).For non-experts, we summarize here a special case of the correspondence. (For full details, seeChapter 4 of [11].) The Robinson-Schensted-Knuth (RSK) correspondence furnishes a bijection: ordered pairs of semistandardYoung tableaux of the sameshape, with entries in [ n ] ←→ (cid:26) n × n matrices withnonnegative integer entries (cid:27) . For our purposes, we will restrict our attention to the special case of one-row tableaux, since anycomposition in C ( s, n ) corresponds uniquely to a one-row tableau containing s boxes with entriesfrom [ n ]. As an example, consider the two compositions µ = (1 , , , ,µ = (5 , , , C (10 , µ i a one-row tableau T ( µ i ), which we fill so that theentry k appears µ i ( k ) times: T ( µ ) = 1 2 2 3 3 3 4 4 4 4 T ( µ ) = 1 1 1 1 1 3 3 4 4 4Considering these tableaux as the two rows of a 2 × s matrix, we then have M ( µ ,µ ) = (cid:20) (cid:21) . Finally, we fill in an n × n array J ( µ ,µ ) whose ( i, j ) entry equals the number of times the column (cid:2) ij (cid:3) appears in M ( µ ,µ ) . For example, [ ] appears three times, so we write a 3 in position (4 , µ , µ ) ←→ (cid:0) T ( µ ) , T ( µ ) (cid:1) ←→ M ( µ ,µ ) ←→ J ( µ ,µ ) = . Note that we can also reverse the procedure, starting with the array J ( µ ,µ ) , translating its entriesinto a two-row matrix, and finally recovering the original pair of tableaux (and hence the pair ofcompositions). Therefore this is indeed a bijection. In the context of the EMD, the array J ( µ ,µ ) has two significant properties: • The row and column sums coincide with the original compositions µ and µ , so J ( µ ,µ ) isa solution to the discrete earth mover’s problem for µ and µ . • Since both rows of M ( µ ,µ ) are nondecreasing, the support of J ( µ ,µ ) is a chain in [ n ] × [ n ].In summary, we have the following bijective correspondence in the case d = 2: C ( s, n ) × C ( s, n ) ←→ J s ( n ) This RSK correspondence extends naturally to d -tuples of compositions in C ( s, n ). (For experts,details about the existence of this multivariate RSK generalization can be found in [6].) Given µ = ( µ , . . . , µ d ), the tableaux corresponding to the µ i uniquely determine a d × s matrix M µ , GENERALIZATION FOR THE EXPECTED VALUE OF EMD 9 which in turn determines a unique n × · · · × n array J µ ∈ J s ( n d ) . This correspondence is againbijective, establishing the following special case of the generalized RSK correspondence: C ( s, n ) × · · · × C ( s, n ) ←→ J s ( n d ) (5) µ (cid:55)−→ J µ This correspondence leads us to the following definition of the discrete EMD:
Definition.
For positive integers d , n , and s , let µ = ( µ , . . . , µ d ) with each µ i ∈ C ( s, n ). Let J µ the unique array corresponding to µ , as in (5). Let C be the cost function on [ n ] d as in (3). Thenwe define the discrete generalized earth mover’s distance to be(6) EMD sd ( µ ) := (cid:88) m ∈ [ n ] d C ( m ) J µ ( m ) , where we write the superscript s to distinguish this discrete version from the continuous version.This definition in (6) is largely conceptual; in practice, we can calculate EMD sd ( µ ) directly fromthe matrix M µ , since the support of J µ is determined by the columns of M µ : Theorem 4.
Let µ = ( µ , . . . , µ d ) with each µ i ∈ C ( s, n ) , and let C be the cost function in (3) .Let M µ be the unique d × s array corresponding to µ via the generalized RSK correspondence, asdescribed above, and let M µ ( • , j ) denote the j th column vector in M µ . Then EMD sd ( µ ) = s (cid:88) j =1 C ( M µ ( • , j )) . Proof.
Consider the definition of EMD sd in (6). By definition, J µ ( m ) equals the number of occur-rences of m as a column vector of the matrix M µ . Therefore J µ ( m ) = 0 unless m is one of thosecolumn vectors, and so we can simply sum over the s column vectors to obtain the result. (cid:3) Remark.
This construction via RSK is more efficient than the aforementioned greedy algorithmfor computing EMD sd when d or n is sufficiently large: rather than filling a d -dimensional array in O ( d n ) time, we need consider only the s column vectors of a d × s matrix. Example.
As an example for d = 3, consider the three compositions µ = (4 , , , µ = (1 , , , µ = (0 , , C (5 , M µ = . Now using Theorem 4 on the five columns of M µ , we compute thatEMD ( µ ) = C (1 , ,
2) + C (1 , ,
2) + C (1 , ,
2) + C (1 , ,
2) + C (3 , , . Before advancing to the main problem of the paper, we show that EMD can actually be expressedin terms of the classical EMD . (In the following proposition, we suppress the superscript s becausethe result holds for both the discrete and the continuous version of EMD: the equality is independentof s , and therefore still holds after dividing both sides by s and letting s → ∞ .) Proposition 5.
The value of
EMD is half the sum of the three pairwise EMD values: EMD ( µ , µ , µ ) = 12 (cid:32) EMD ( µ , µ ) + EMD ( µ , µ ) + EMD ( µ , µ ) (cid:33) Proof.
Let µ = ( µ , µ , µ ) as usual. In each column j of the matrix M µ , call the three entries a j , b j , c j , labeled so that a j ≤ b j ≤ c j . Each of the three pairs ( a j , b j ), ( a j , c j ), and ( b j , c j ) corre-sponds naturally to one of the pairs ( µ , µ ), ( µ , µ ), and ( µ , µ ), where row i corresponds to µ i .Therefore by Theorem 4, we haveEMD ( µ , µ ) + EMD ( µ , µ ) + EMD ( µ , µ ) = s (cid:88) j =1 C ( a j , b j ) + C ( a j , c j ) + C ( b j , c j )= (cid:88) j ( b j − a j ) + ( c j − a j ) + ( c j − b j )= (cid:88) j c j − a j = 2 (cid:88) j c j − a j = 2 (cid:88) j C ( a j , b j , c j )= 2 · EMD ( µ , µ , µ ) . (cid:3) This relationship does not generalize to d >
3, because in general, the telescoping summand inthe proof does not reduce in terms of a higher-dimensional cost function. For example, when d = 4,the analog of the third line above is (cid:80) j d j + c j − b j − a j , or (cid:80) j C ( a j , b j , c j , d j ) + 2( d j − a j ).5. Expected value of EMD d Again we will follow and extend the methods used in [4] to arbitrary values of d . First we willdefine a generating function of two variables to record the values of EMD sd , which we will thendifferentiate in order to sum up all of these values. This will allow us to compute expected value forEMD sd simply by reading off coefficients from a generating function of a single variable.Because we are about to make a recursive definition, we will now need to consider d -tuples µ consisting of compositions with different numbers of bins — i.e., different values n i such that each µ i ∈ C ( s, n i ). Therefore, n will denote this vector of bin numbers ( n , . . . , n d ), and we will write( n d ) for the special vector ( n, . . . , n ), which arises most frequently in applications.In order to encode an inclusion-exclusion argument, we will also need to define an indicator vector e ( A ) for a subset A ⊆ [ d ]. Namely, e ( A ) := (cid:88) i ∈ A e i is the vector whose i th component is 1 if i ∈ A and 0 otherwise. For example, if d = 5, and A = { , , } , then e ( A ) = (0 , , , , Generating function for the discrete case.
For fixed n = ( n , . . . , n d ), we first define agenerating function in two indeterminates z and t :(7) H n ( z, t ) := ∞ (cid:88) s =0 (cid:88) µ ∈C ( s,n ) ×···×C ( s,n d ) z EMD sd ( µ ) t s . We observe that the coefficient of z r t s is the number of elements µ ∈ C ( s, n ) × · · · × C ( s, n d ) suchthat EMD sd ( µ ) = r .A recursive definition of this generating function, for the d = 2 case, is derived in [4], Theorem 3.Our generalization for d > GENERALIZATION FOR THE EXPECTED VALUE OF EMD 11
Proposition 6.
Fix n = ( n , . . . , n d ) . The generating function H n := H n ( z, t ) has the followingrecursive definition, where the sum is over all nonempty subsets A ⊆ [ d ] : H n = (cid:80) A ( − | A |− · H n − e ( A ) − z C ( n ) t , where H (1 d ) = − t , and H n − e ( A ) = 0 if n − e ( A ) contains a .Proof. Each µ corresponds to a unique monomial(8) µ ←→ (cid:89) m w J µ ( m ) m , where µ ↔ J µ is the RSK correspondence in (5). The variables w m are indexed by multi-indices m ∈ [ n ] × · · · × [ n d ]. Note that the degree of this monomial equals s (the sum of the entries of J µ ).Now making the substitution(9) w m (cid:55)−→ z C ( m ) t, the above correspondence gives us the map µ ←→ (cid:89) m w J µ ( m ) m (cid:55)−→ z EMD sd ( µ ) t s . Therefore the generating function H n is just the image of the formal sum H ∗ n of all monomials of theform (8), under the substitution (9); as s ranges over all nonnegative integers, there is one monomialin H ∗ n for each possible µ ∈ C ( s, n ) × · · · × C ( s, n d ).Since m (cid:22) n for all m , every array J µ is allowed to contain n in its support without violatingthe chain condition. This means that every monomial in H ∗ n is allowed to contain the variable w n ,and so we may factor out the sum of all possible powers of w n , rewriting as H ∗ n = (cid:88) µ (cid:32)(cid:89) m w J µ ( m ) m (cid:33) = ∞ (cid:88) r =0 w r n · f ( w m (cid:54) = n ) = f ( w m (cid:54) = n )1 − w n , where f is an infinite formal sum of monomials in the variables w m where m (cid:54) = n . Now we focuson rewriting this numerator f . Suppose we subtract 1 from exactly one of the coordinates of n ;the possible results are n − e ( i ) for i = 1 , . . . , d . Now, on one hand, any monomial in f containingthe variable w n − e ( i ) appears in H ∗ n − e ( i ) . But on the other hand, note that all of these n − e ( i ) aremutually incomparable under the product order, and so at most one of them can be in the supportof some J µ , because of the chain condition. Therefore any monomial in f contains at most one ofthe variables w n − e ( i ) . But the sum (cid:80) di =1 H ∗ n − e ( i ) still overcounts the monomials appearing in f ,since the same monomial may appear in several distinct summands.In other words, we want f to be the formal sum of the union (without multiplicity) of themonomials which appear in the summands H ∗ n − e ( i ) . We can achieve this by using the inclusion-exclusion principle: subtract those monomials which appear in at least 2 of the summands, then addback the monomials which appear in at least 3 of the summands, then subtract those appearing in atleast 4 summands, and so on, until we arrive at those monomials appearing in all d of the summands.We can write this inclusion-exclusion as an alternating sum over nonempty subsets A ⊆ [ d ], addingwhen | A | is odd and sutracting when | A | is even: f = (cid:88) A ( − | A |− · H ∗ n − e ( A ) . Finally, applying the substitution (9), we obtain H n = H ∗ n (cid:12)(cid:12)(cid:12) w m = z C ( m ) t = (cid:80) A ( − | A |− · H ∗ n − e ( A ) − w n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w m = z C ( m ) t = (cid:80) A ( − | A |− · H n − e ( A ) − z C ( n ) t, proving the recursion.As for the base case H (1 ,..., = − t , there is only one element in C ( s, n i = 1,the inside sum in (7) has only one term; moreover, this unique µ is just d copies of the same trivialcomposition of s into 1 part, meaning that EMD sd ( µ ) = 0. Hence H (1 ,..., ( z, t ) = (cid:80) s z t s = (cid:80) s t s ,whose closed form is − t . Likewise, since C ( s,
0) is empty, we must have H = 0 if any of the n i become 0. (cid:3) Example.
We write out this recursive definition in a concrete case, where d = 3 and n = (5 , , A . First, for | A | = 1,we add together all possible H n (cid:48) , where n (cid:48) equals n with exactly 1 coordinate decreased; then for | A | = 2, we subtract all possible H n (cid:48)(cid:48) where n (cid:48)(cid:48) equals n with exactly 2 coordinates decreased; finally,for | A | = 3, we add the one possible H n (cid:48)(cid:48)(cid:48) where n (cid:48)(cid:48)(cid:48) equals n with all 3 coordinates decreased. As forthe denominator, C is the same cost function we defined in (3), meaning that C (5 , ,
2) = 5 − H n looks like this: H (5 , , = H (4 , , + H (5 , , + H (5 , , − H (4 , , − H (4 , , − H (5 , , + H (4 , , − z t Having seen an example, we now mention the important (and very well-studied) specializationthat results from setting z = 1. In this case, the coefficient of t s in H n (1 , t ) is simply the totalnumber of d -tuples µ , which is (cid:81) di =1 |C ( s, n i ) | = (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) :(10) H n (1 , t ) = ∞ (cid:88) s =0 d (cid:89) i =1 (cid:18) s + n i − n i − (cid:19) t s . It is shown in [8] that the closed form of (10) is, after adjusting the index to match our setup,and writing | n | := n + · · · + n d ,(11) H n (1 , t ) = W ( t )(1 − t ) | n |− d +1 , where the numerator W ( t ) is a polynomial whose coefficients are the “Simon Newcomb” numbers.(For more on this natural generalization of Eulerian numbers to multisets, see [1], [8], and [20].)Specifically, denoting the coefficient of t i in W n by the symbol [ t i ] W n , and adopting the A -notationoriginally used in [8], we have[ t i ] W n = A (cid:0) n − (1 , . . . , , i (cid:1) := { n − , . . . , d n d − } containing i descents= i (cid:88) j =0 ( − j (cid:18) | n | − d + 1 j (cid:19) d (cid:89) k =1 (cid:18) i − j + n k − n k − (cid:19) . The degree of the polynomial W n is shown in [8] to be d (cid:88) i =1 ( n i − − max { n , . . . , n d } . The combinatorial interpretation implies that the coefficients of W n are positive; in the specialcase where n = ( n d ), then W n is also unimodal and palindromic. (This can be shown from a GENERALIZATION FOR THE EXPECTED VALUE OF EMD 13 combinatorial or ring-theoretic approach; for the latter, see [20], or Chapter 5 of [5] on Stanley-Reisner and Gorenstein rings.)From the combinatorial description above of [ t i ] W n , it follows that the evaluation W n (1) equalsthe total number of permutations of the multiset { n − , . . . , d n d − } :(12) W n (1) = (cid:16)(cid:80) di =1 ( n i − (cid:17) ! (cid:81) di =1 ( n i − | n | − d )! (cid:81) ( n i − { n − , . . . , d n d − } corresponds to a unique increasing lattice path in N d , beginning at (1 , . . . ,
1) and ending at ( n , . . . , n d ):reading left to right, each occurrence of i in the permutation signifies adding the standard basis vec-tor e i to the current position in the path. Therefore, W (1) can be interpreted as the total number ofincreasing paths connecting opposite corners of an n × · · · × n d array — in other words, the numberof chains in [ n ] × · · · × [ n d ] with maximal length.5.2. A partial derivative.
Next, in order to transfer the EMD values from the exponents of z intocoefficients, we take the partial derivative of H n with respect to z . Applying the quotient rule to ourdefinition of H n in Proposition 6, we obtain the following, where the sum still ranges over nonemptysubsets A ⊆ [ d ]: ∂H n ∂z = (cid:16) − z C ( n ) t (cid:17) (cid:32)(cid:88) A ( − | A |− · ∂H n − e ( A ) ∂z (cid:33) + C ( n ) · z C ( n ) − · t · (cid:32)(cid:88) A ( − | A |− · H n − e ( A ) (cid:33)(cid:0) − z C ( n ) t (cid:1) Now that the exponents have been changed into coefficients of z , we can set z = 1: H (cid:48) n := ∂H n ∂z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z =1 = ∞ (cid:88) s =0 (cid:88) µ ∈C ( s,n ) ×···×C ( s,n d ) EMD sd ( µ ) t s = (1 − t ) (cid:32)(cid:88) A ( − | A |− · H (cid:48) n − e ( A ) (cid:33) + t · C ( n ) (cid:32)(cid:88) A ( − | A |− · H n − e ( A ) (cid:33) (1 − t ) (13)At this point, z has played out its role, and so from now on we will write H n in place of H n (1 , t ).Note that the coefficient of t s in H (cid:48) n is the sum of the values EMD sd ( µ ) for all valid d -tuples µ .This means that our goal is now in sight: to find the expected value of EMD sd for fixed n , we needto divide the sum of all possible EMD sd values (i.e., the coefficient of t s in H (cid:48) n ) by the total numberof possible inputs µ (i.e., the coefficient of t s in H n ). Therefore, once we find a way to simplify (13),we will be able to compute the result(14) E (EMD sd ) = [ t s ] H (cid:48) n [ t s ] H n = [ t s ] H (cid:48) n (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) , where [ t s ] denotes the coefficient of t s in a series.In order to make the expression (13) for H (cid:48) n more tractable to program, we will now focus onlyon the numerators of H n and H (cid:48) n . We have already determined W n ( t ), the numerator for H n , in theprevious subsection. We will let N ( t ) denote the numerator of H (cid:48) n . By using software and observingpatterns for small n , we anticipate that the denominator of H (cid:48) n has exponent | n | − d + 2, and so wenow set both(15) W n := (1 − t ) | n |− d +1 H n and N n ( t ) := (1 − t ) | n |− d +2 H (cid:48) n . Therefore, we can clear denominators in (13) by multiplying both sides by (1 − t ) | n |− d +2 . Proceedingcarefully and clearing the remaining denominators using (15), the pattern becomes clear: N n = (cid:88) A ( − | A |− (1 − t ) | A |− N n − e ( A ) + t · C ( n ) · (1 − t ) | n |− d · (1 − t ) · H n = (cid:88) A ( t − | A |− N n − e ( A ) + t · C ( n ) · W n (16)This provides us with a quick recursive code to obtain N n , after which we need only divide by(1 − t ) | n |− d +2 to recover H (cid:48) n . The rest is just a matter of extracting coefficients in order to applythe result (14).5.3. Expected value for continuous version of EMD d . Now that we have a way to determinethe expected value for the discrete EMD, we aim to find a formula for the expected value in thecontinuous setting.Starting with the expected value from (14), we scale by 1 /s to normalize, and then let s growasymptotically: E n := E (EMD d ) = lim s →∞ s · E (EMD sd )= lim s →∞ s · [ t s ] H (cid:48) n (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) . First we focus on the [ t s ] H (cid:48) n part, namely the coefficient of t s in H (cid:48) n = N n ( t )(1 − t ) | n |− d +2 . Now, thecoefficient of t s in the series − t ) | n |− d +2 is just (cid:18) s + | n | − d + 1 | n | − d + 1 (cid:19) = s | n |− d +1 ( | n | − d + 1)! + lower-order terms in s. Meanwhile, N n ( t ) is a polynomial, with some finite degree b . Now, as s → ∞ , we have s − b → s ,and so the coefficient of t s in H (cid:48) n is asymptotic to s | n |− d +1 ( | n |− d +1)! multiplied by the sum of the coefficientsof N n ( t ). But this sum is just N n (1), and so we have:[ t s ] H (cid:48) n ∼ N n (1) · s | n |− d +1 ( | n | − d + 1)!Accounting for the 1 /s , we currently have the following: E n = lim s →∞ N n (1) · s | n |− d ( | n | − d + 1)! (cid:81) di =1 (cid:0) s + n i − n i − (cid:1) Now, since (cid:81) i (cid:0) s + n i − n i − (cid:1) ∼ (cid:81) i s ni − ( n i − = s | n |− d (cid:81) i ( n i − , this becomes(17) E n = N n (1) · (cid:81) di =1 ( n i − | n | − d + 1)!But when we evaluate N n (1) from equation (16), the terms with ( t −
1) all disappear; hence we needonly consider subsets A ⊆ [ d ] with one element, meaning we are now summing from 1 to d : N n (1) = d (cid:88) i =1 N n − e ( i ) (1) + C ( n ) W n (1)Substituting for W n (1) using (12), we have N n (1) = d (cid:88) i =1 N n − e ( i ) (1) + C ( n ) · ( | n | − d )! (cid:81) di =1 ( n i − GENERALIZATION FOR THE EXPECTED VALUE OF EMD 15
Finally, returning to (17) and plugging this all in for N n (1), we conclude with the recursive definition E n = (cid:34) d (cid:88) i =1 N n − e ( i ) (1) + C ( n ) · ( | n | − d )! (cid:81) di =1 ( n i − (cid:35) · (cid:81) di =1 ( n i − | n | − d + 1)!= (cid:80) di =1 N n − e ( i ) (1) · (cid:81) di =1 ( n i − | n |− d )! | n | − d + 1 + C ( n ) · ( | n | − d )!( | n | − d + 1)!= (cid:80) di =1 ( n i − E n − e ( i ) + C ( n ) | n | − d + 1 , (18)where E n − e ( i ) = 0 if n − e ( i ) contains a 0.We record this as the main theorem of this paper, in the most useful case where n = ( n d ): Theorem 7.
The expected value of
EMD d on P n × · · · × P n is E ( n d ) as defined in (18) . Remark.
Recall from Proposition 5 the special relationship between EMD and EMD , namely,EMD equals half the sum of the three pairwise EMD values. This leads us to anticipate that E ( n ) = E (EMD ) = E (cid:18)
12 (EMD + EMD + EMD ) (cid:19) = E (cid:18)
32 EMD (cid:19) = 32 E (EMD )= 32 E ( n ) . We confirm this in Mathematica: n E ( n ) E ( n ) E ( n ) / E ( n ) Unit normalized EMD.
It is often convenient to unit normalize the EMD d so that its valuefalls between 0 and 1. To this end, we claim that for a given n , the maximum value of EMD d is (cid:98) d/ (cid:99) ( n − sd ( µ ) occurs whenin every column of the matrix M µ (given by the RSK correspondence), half the entries are 1’s andthe other half are n ’s; if d is odd, then “half” means (cid:98) d/ (cid:99) , with the leftover entry being irrelevantby Proposition 1. For such a µ , then, EMD sd ( µ ) equals the cost (cid:98) d/ (cid:99) ( n −
1) multiplied by s (thenumber of columns). After dividing by s to pass to the continuous setting, we see that the maximumvalue of EMD d is (cid:98) d/ (cid:99) ( n − unit normalized EMD d and its expected value:(19) (cid:92) EMD d ( µ ) := EMD d ( µ ) (cid:98) d/ (cid:99) ( n −
1) and (cid:98) E ( n d ) := E ( n d ) (cid:98) d/ (cid:99) ( n − . d = 2 d = 4 d = 3 d = 5 Figure 2.
Histograms of discrete EMD sd values, fixing s = 5 and n = 3. Note themore severe skew to the right when d is even, compared to nearly zero skew when d is odd.We observe a curious phenomenon when we fix n and let d increase: the unit normalized expectedvalue alternately increases ( d changing from even to odd) and decreases ( d changing from odd toeven), as seen in this example for n = 3: d (cid:98) E sd values confirm this impression(see Figure 2): when d is even, the histograms are clearly right-skewed, whereas when d is odd, thehistograms have nearly zero skew (although still slightly right-skewed). It seems that this alternatingphenomenon is a consequence of taxicab geometry in even vs. odd dimensions: specifically, when d isodd, the median of a vector’s coordinates contributes nothing to its distance from the main diagonal(i.e., the cost function C ). Intuitively, when the other coordinates have a wide range — that is,when C is relatively large — this “free” coordinate can assume more values without affecting C ,than when the range of the coordinates is smaller. This increases the proportion of high-cost arraypositions compared to the case when d is even.6. Real-world data
As a basic example, we now apply this generalized discrete EMD in order to compare the gradedistributions in the spring vs. fall semester of 2019, for the course MATH 232 (Calculus II) atthe University of Wisconsin-Milwaukee. This data is contained in the Section Attrition and Grade
GENERALIZATION FOR THE EXPECTED VALUE OF EMD 17
Report, published by the Office of Assessment and Institutional Research at UWM; final lettergrades are A, B, C, D, and F, so n = 5. We consider all the sections of MATH 232 with more than20 students enrolled; there were seven such sections in each semester in 2019, and so we put d = 7in both cases. We list the grade distributions below, with µ corresponding to the spring sectionsand ν the fall sections. (We will continue to use the term “distribution” rather than “composition”in the context of this application.)The value of s is problematic, since different classes in the real world do not have the same numberof students. One solution is to scale each distribution so that all distributions share a common s -value, namely, the least common multiple of their individual s -values. On one hand, this methodretains whole-number distributions whose grade proportions remain unchanged; but on the otherhand, with seven classes of roughly 30 students each, that least common multiple could easily be inthe billions, so this approach is computationally impractical. Instead, we have chosen to scale eachdistribution so that the sum of its components equals the maximum of the seven original s -values;this of course produces non-integer distributions, and so in order to use Theorem 4, we then roundeach component up or down, while respecting the original proportions as much as possible, untilall distributions share the same s -value. In our case, the common s -value for the µ i is 31, and thecommon s -value for the ν i is 33.We list the resulting distributions, with the letter grades in descending order from A to F: Spring 2019 Fall 2019 µ = (7 , , , , ν = (2 , , , , µ = (12 , , , , ν = (4 , , , , µ = (4 , , , , ν = (4 , , , , µ = (6 , , , , ν = (6 , , , , µ = (6 , , , , ν = (4 , , , , µ = (6 , , , , ν = (5 , , , , µ = (8 , , , , ν = (5 , , , , ( µ ) = 49 and EMD ( ν ) = 56 . Dividing by s in each case to normalize (so that µ and ν actually mean s µ and s ν now),EMD ( µ ) = 1 . ( ν ) = 1 . , and finally dividing by (cid:98) d/ (cid:99) ( n −
1) = 3 · (cid:92) EMD ( µ ) = . and (cid:92) EMD ( ν ) = . . To compare these results to the expected value, we code the recursive definition in Theorem 7; thenupon unit normalizing as in (19), we find that (cid:98) E (5 ) = . . From this very limited data set, at least, it is clear that not only are the EMD values extremelyconsistent between the two semesters (within 0.01 of each other), but also they are significantlyless than the expected value — less than half the expected value, in fact. We should not be toosurprised by this, of course, since college grades are (hopefully) not assigned at random. It will beinteresting to track the EMD for multiple courses in multiple semesters, with the goal of performingsome informative cluster analysis on the results. Connections to algebraic geometry and representation theory
In this section, we show that our generating function H n from above is the Hilbert series of theSegre embedding from algebraic geometry. In the d = 2 case, this embedding is a determinantalvariety, whose structure as an infinite-dimensional su ( p, q )-module we can describe in the context ofthe EMD.7.1. A determinantal variety.
In this subsection we let d = 2, and we will write p, q in place ofthe usual n = ( n , n ). In this case, as indicated in [4], the series H p,q := H p,q (1 , t ), from (10), is infact the Hilbert series of the determinantal variety D ≤ p,q := { M ∈ M p,q ( C ) | rank M ≤ } consisting of complex p × q matrices with rank at most 1. To see this, it suffices to show thatgiven a nonnegative integer s , the number of elements in C ( s, p ) × C ( s, q ) equals the dimension of C [ D ≤ p,q ] s , the space of homogeneous degree- s polynomial functions on D ≤ p,q . To this end, let w ij bethe coordinate functions on a generic p × q matrix. Since all 2 × D ≤ p,q , we observe that C [ D ≤ p,q ] (cid:39) C [ w ij ] / I , where I is the determinantal ideal generated by the quadratics of the form w ij w i (cid:48) j (cid:48) − w ij (cid:48) w i (cid:48) j for i < i (cid:48) and j < j (cid:48) . It follows that a basis for C [ D ≤ p,q ] s is given by the set of monomials B = (cid:40) s (cid:89) k =1 w i k ,j k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { ( i k , j k ) } is a chain in [ p ] × [ q ] (cid:41) . Now we have an obvious bijective correspondence between B and the set J sp,q (borrowing notationfrom (4) to denote p × q arrays with support in a chain):(20) (cid:89) i,j w J ij ij ←→ J ∈ J sp,q But J sp,q is in bijective correspondence with C ( s, p ) × C ( s, q ), as is clear from a slight generalizationof our RSK correspondence in (5). This proves our claim that H p,q is the Hilbert series of D ≤ p,q .7.2. The Segre embedding.
We extend the previous result to d >
2, and so as before we fix n = ( n , . . . , n d ). The specialization H n := H n (1 , t ) of our generating function from (10) happensalso to be the Hilbert series of the Segre embedding: P ( C n ) × · · · × P ( C n d ) (cid:44) → P ( C n ⊗ · · · ⊗ C n d ) , (cid:16)(cid:104) v (1) (cid:105) , . . . , (cid:104) v ( d ) (cid:105)(cid:17) (cid:55)→ (cid:104) v (1) ⊗ · · · ⊗ v ( d ) (cid:105) . (See [13] and [20].) That is, H n is the Hilbert series of the simple tensors. (In the case d = 2, theset of simple tensors in C p ⊗ C q can be identified with the determinantal variety D ≤ p,q , coincidingwith the previous subsection.)To sketch this generalization of the d = 2 case, which we presented in detail above, we let m range over all multi-indices ( m , . . . , m d ) ∈ [ n ] × · · · [ n d ], as in the proof of Proposition 6. Nowconsider a simple tensor v (1) ⊗ · · · ⊗ v ( d ) . We can expand this tensor in the standard basis as (cid:88) m (cid:16) v (1) m · · · v ( d ) m d (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) w m e m ⊗ · · · ⊗ e m d , where v ( k ) (cid:96) is the (cid:96) th coordinate of the vector v ( k ) ; the w m are coordinate functions. For any twomulti-indices m and m (cid:48) , we see that the quadratic w m w m (cid:48) is invariant under the exchange of indicescomponent-wise between m and m (cid:48) . Intuitively, then, we can again mod out by the determinantalideal generated by all 2 × d = 2 case above. GENERALIZATION FOR THE EXPECTED VALUE OF EMD 19
The upshot is that a basis for the coordinate ring of the simple tensors is given by those monomials w m · · · w m s such that the set of multi-indices { m i } form a chain. We therefore have the general-ization of (20), so we conclude that H n is the Hilbert series of the simple tensors. (Technically,of course, the Segre embedding is contained in the projectivization of the simple tensors, so we aresweeping a fair amount under the rug here.)7.3. Representation theory.
Returning to the d = 2 case, the coordinate ring C [ D ≤ p,q ] is aninfinite-dimensional representation of the Lie algebra su ( p, q ), known as the first Wallach represen-tation. (See [9].) We will show how the action on this representation corresponds to manipulatingthe two compositions in our EMD s setting.Consider the polynomial ring C [ x , y ] := C [ x , . . . , x p , y , . . . , y q ]. On one hand, C [ x , y ] admits anaction of GL ( C ) (which is just the multiplicative group of nonzero complex numbers), via( g · f )( x , y ) = f ( g − x , g y )for g ∈ GL ( C ) and f ∈ C [ x , y ]. From now on let G = GL ( C ). Note that the invariants underthe G -action are those polynomials in which the degree of each term is the same with respect to x as it is with respect to y ; in other words, C [ x , y ] G is generated by the monomials x i y j . (This is aspecial case of the First Fundamental Theorem of Invariant Theory; see [12], Section 5.2.1.) Butsince the kernel of the ring homomorphism w ij (cid:55)→ x i y j is precisely the determinantal ideal I , wehave C [ x , y ] G (cid:39) C [ D ≤ p,q ]. (This is a special case of the Second Fundamental Theorem of InvariantTheory; see [12], Lemma 5.2.4.)As a result of Howe duality in Type A — the delicate details of which are expounded in [15] and[16] — the space C [ x , y ] is also a module under the action of the Lie algebra su ( p, q ) by differentialoperators. Upon complexification, this gives rise to an action by gl p + q ( C ), which as a set is just the( p + q ) × ( p + q ) complex matrices. In particular, the invariant subring C [ x , y ] G (cid:39) C [ D ≤ p,q ] is theirreducible, infinite-dimensional gl p + q -module with highest weight ( − , . . . , − , (cid:124) (cid:123)(cid:122) (cid:125) p , . . . , gl p + q is given by differential operators on C [ x , y ], of the following four forms (see[12], Section 5.6):(1) x i ∂∂x j (Euler operators; technically the action includes the extra term + δ ij );(2) y i ∂∂y j (Euler operators);(3) ∂ ∂x i ∂y j (“raising operators”);(4) x i y j (“lowering operators”).Note that all these operators preserve the difference between the degree with respect to x andthe degree with respect to y . Therefore the gl p + q -action preserves C [ x , y ] G , which we observed isgenerated by the elements x i y j .This gl p + q -action can be described in terms of our EMD setting in this paper. First, observe thatany degree- s monic monomial in C [ x , y ] G corresponds uniquely to an ordered pair of compositions( µ, ν ) ∈ C ( s, p ) × C ( s, q ), via( µ, ν ) ←→ x µ y ν := x µ (1)1 · · · x µ ( p ) p y ν (1)1 · · · y ν ( q ) q . This is no surprise, of course, since this is just the RSK correspondence we used earlier in thissubsection; written out in all of its guises, we have( µ, ν ) ←→ x µ y ν ←→ J ( µ,ν ) ∈ J sp,q ←→ (cid:89) i,j w J ij ij . Now we can see how each type (1)-(4) of differential operator has an interpretation in the EMD context. Consider the monomial x µ y ν as defined above. Then, up to scaling by coefficients, weobserve the following: (1) The Euler operator x i ∂∂x j corresponds to moving 1 unit in µ , from bin j to bin i , since theexponent of x j decreases by 1 and the exponent of x i increases by 1.(2) The Euler operator y i ∂∂y j corresponds to moving 1 unit in ν , from bin j to bin i .(3) The raising operator ∂ ∂x i ∂y j corresponds to removing 1 unit from each composition: frombin i in µ and from bin j in ν .(4) The lowering operator x j y i corresponds to adding 1 unit to each composition: to bin i in µ and to bin j in ν .It will be interesting to study further whether this connection to representation theory might beexploited in existing applications of EMD .8. Proof of Proposition 2
The methods in this paper depended heavily upon the fact that we need consider only thosearrays J whose support is a chain. This followed from the statement in Proposition 2 — yet to beproved — that our cost array C has the Monge property. Before proving this here, we state threeuseful lemmas, the first of which is proved in [2] and [21]: Lemma 8. An n × · · · × n array A has the Monge property if and only if every two-dimensionalplane of A has the Monge property. To make this explicit, we choose any two distinct indices i, j from { , . . . , d } , and then fix theremaining d − m , . . . , m i − , m i +1 , . . . , m j − , m j +1 , . . . , m d ∈ [ n ]. Thenwe will write m i,jk,(cid:96) := ( m , . . . , m i − , k, m i +1 , . . . , m j − , (cid:96), m j +1 , . . . , m d ). In other words, m i,jk,(cid:96) isthe vector in which the i th coordinate is k , the j th coordinate is (cid:96) , and the remaining coordinatesare the fixed values m , . . . , m d . Now we can naturally define the two-dimensional subarray A i,j inwhich(21) A i,j ( k, (cid:96) ) := A (cid:16) m i,jk,(cid:96) (cid:17) . Then Lemma 8 states that A has the Monge property if and only if A i,j has the Monge property forevery choice of distinct i and j .This reduction to the two-dimensional case is extremely useful because of the following charac-terization of two-dimensional Monge arrays, proved in [21]: Lemma 9.
Let A be an n × n array. Then A has the Monge property if and only if A ( k, (cid:96) ) + A ( k + 1 , (cid:96) + 1) ≤ A ( k + 1 , (cid:96) ) + A ( k, (cid:96) + 1) for all k, (cid:96) ∈ [ n − . In other words, choose a position ( k, (cid:96) ) and then consider the 2 × A ( k, (cid:96) )and its three neighbors to the east, south, and southeast. The condition displayed in the lemmameans that the sum of the upper-left and lower-right entries must never be greater than the sum ofthe lower-left and upper-right entries.We will need one final lemma, specific to the cost function C in this paper. Recall from Proposition1 that if we let (cid:101) m denote a vector m with its coordinates rearranged in ascending order, then C ( m ) = − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99) + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d ( d even)or C ( m ) = − (cid:101) m − · · · − (cid:101) m (cid:98) d +12 (cid:99)− + (cid:101) m (cid:98) d +12 (cid:99) +1 + · · · + (cid:101) m d ( d odd) . The index (cid:98) d +12 (cid:99) gave a kind of “median” of the coordinates in m ; from now on, however, we will workinstead with M := (cid:98) d +12 (cid:99) + 1 = (cid:100) d +22 (cid:101) . Intuitively, this index M gives the next-greatest coordinate GENERALIZATION FOR THE EXPECTED VALUE OF EMD 21 after the “median.” The picture is the following, where the vertical lines divide the coordinates intotwo equal sets (with one leftover coordinate in the middle if d is odd: d even : (cid:101) m = ( (cid:101) m , . . . , (cid:101) m M − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) m M , . . . , (cid:101) m d ) d odd : (cid:101) m = ( (cid:101) m , . . . , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) m M − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) m M , . . . , (cid:101) m d )With this indexing in mind, we state our final lemma, which records the effect on C ( m ) of adding1 to a single coordinate m i . Recall from earlier that e ( i ) denotes the vector whose coordinates areall 0 except for a 1 in the i th component. Lemma 10.
Adding to a single coordinate m i of m has one of three effects on C ( m ) : it eitherincreases by 1, decreases by 1, or remains the same. The effect depends on the value of m i relativeto the other coordinates of m : (1) C ( m + e ( i )) = C ( m ) + 1 if m i ≥ (cid:101) m M .(2) C ( m + e ( i )) = C ( m ) − d is even and m i < (cid:101) m M ; or(b) d is odd and m i < (cid:101) m M − .(3) C ( m + e ( i )) = C ( m ) if d is odd and m i = (cid:101) m M − < (cid:101) m M . Proof.
We prove each of the three cases; the reader may find it helpful to keep an eye on the twopossible “pictures” of (cid:101) m displayed before this lemma, along with the two possible sums for C ( m )displayed just before that.(1) Assume m i ≥ (cid:101) m M . Then m i + 1 > (cid:101) m M , and so in the sum defining C ( m ), we must havepositive m i replaced by positive ( m i + 1). Hence C ( m ) has increased by 1.(2) (a) Assume d is even and m i < (cid:101) m M . Then m i +1 ≤ (cid:101) m M , and so in the sum defining C ( m ),we must have negative m i replaced by negative ( m i + 1). Hence C ( m ) has decreasedby 1.(b) Assume d is odd and m i < (cid:101) m M − . Then m i +1 ≤ (cid:101) m M − , and so we must have negative m i replaced by negative ( m i + 1). Hence C ( m ) has decreased by 1.(3) Assume d is odd and m i = (cid:101) m M − < (cid:101) m M ; note that (cid:101) m M − does not appear in the sumdefining C ( m ). Then (cid:101) m M − < m i + 1 ≤ (cid:101) m M , and so m i + 1 still does not appear in thesum defining C ( m + e ( i )). Hence C ( m ) remains unchanged. (cid:3) We are now ready for the proof, in which we show that an arbitrary two-dimensional subarray of C has the Monge property. Proof of Proposition 2.
Let i, j be two distinct indices in { , . . . , d } . Fix the remaining coordinates m , . . . , m d as above, and let C i,j be the corresponding two-dimensional subarray of C defined in(21). Now let m i , m j ∈ [ n − C i,j ( m i , m j ) + C i,j ( m i + 1 , m j + 1) ≤ C i,j ( m i + 1 , m j ) + C i,j ( m i , m j + 1) . But this condition can be rewritten as the following, where we simply write m for m i,jm i ,m j :(22) C ( m ) + C ( m + e ( i ) + e ( j )) ≤ C ( m + e ( i )) + C ( m + e ( j ))To show that this condition holds true, we need to examine six possible cases, depending on whetheradding 1 to m i and m j (independently) causes C to increase, decrease, or remain the same: C ( m + e ( i )) = C ( m ) + 1 C ( m + e ( i )) = C ( m ) − C ( m + e ( i )) = C ( m ) C ( m + e ( j )) = C ( m ) + 1 Case 1 C ( m + e ( j )) = C ( m ) − C ( m + e ( j )) = C ( m ) Case 3 Case 5 Case 6 In each case below, all simplifications are directly justified by the results in Lemma 10. • Case 1:
In this case, the right-hand side of (22) is 2 · C ( m ) + 2. For the left-hand side, weknow in general that C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) , which by Lemma 10 canbe no greater than C ( m ) + 2. Hence the inequality in (22) must hold. • Case 2:
In this case, the right-hand side of (22) is 2 · C ( m ). As for the second term onthe left-hand side, by Lemma 10, we must have m i ≥ (cid:101) m M ; meanwhile, m j is strictly lessthan either (cid:101) m M (if d is even) or (cid:101) m M − (if d is odd), and so neither inequality is affected byadding 1 to m i . Therefore we have C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i )) − C ( m ) + 1 − C ( m ) . Hence we have an equality in (22). • Case 3:
Similar to Case 2, the two additions are independent of each other. The right-handside of (22) is 2 · C ( m ) + 1. In this case, we must have d odd; also, m i ≥ (cid:101) m M , along with m j = (cid:101) m M − < (cid:101) m M . Then C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i ))= C ( m ) + 1 . Again we obtain an equality in (22). • Case 4:
The right-hand side of (22) is 2 · C ( m ) −
2. If d is even, then both m i and m j arestrictly less than (cid:101) m M , and if d is odd, then both are strictly less than (cid:101) m M − . Either way,after adding 1 to m i , the same inequality still holds for m j , and so again we have C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i )) − C ( m ) − − C ( m ) − , and we get an equality in (22). • Case 5:
The right-hand side of (22) is 2 · C ( m ) −
1. In this case, d must be odd, with m i < (cid:101) m M − = m j < (cid:101) m M . After adding 1 to m j , we still have m i less than the ( M − th componentin the new rearranged vector, and so the effects of the two additions are independent. Weobtain C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( j )) + e ( i ) (cid:1) = C (( m + e ( j )) − C ( m ) − • Case 6:
This is the slightly surprising case, in which the two additions are not independentof each other. The right-hand side of (22) is 2 · C ( m ), and we know that d must be odd, with m i = m j = (cid:101) m M − < (cid:101) m M . After adding 1 to m i , we obtain a vector m (cid:48) in which m (cid:48) j = m j GENERALIZATION FOR THE EXPECTED VALUE OF EMD 23 is now strictly less than (cid:101) m (cid:48) M − , and so now adding 1 to m j results in an overall decrease by1. Hence we have C (cid:0) m + e ( i ) + e ( j ) (cid:1) = C (cid:0) ( m + e ( i )) + e ( j ) (cid:1) = C (( m + e ( i )) − C ( m ) − . Hence the left-hand side of (22) is less than the right-hand side, and the condition is stillsatisfied.We have exhausted all possible cases, and so since (22) holds in each of them, the two-dimensionalarray C i,j has the Monge property. Since i and j were arbitrary, every two-dimensional subarray of C has the Monge property, and so by Lemma 8, we conclude that C itself has the Monge property. (cid:3) References
1. M. Abramson,
A simple solution of Simon Newcomb’s problem , J. Combin. Theory Ser. A (1975), 223–225.2. A. Aggarwal and J.K. Park, Sequential searching in multidimensional monotone arrays , Research Report RC15128, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1989.3. W. Bein, P. Brucker, J. Park, and P. Pathak,
A Monge property for the d-dimensional transport problem , DiscreteAppl. Math. (1995), no. 2, 97–109.4. R. Bourn and J. Willenbring, Expected value of the one-dimensional earth mover’s distance , A. Stat. (2020),no. 1, 53–78.5. W. Bruns and J. Herzog, Cohen-Macaulay rings , Cambridge University Press, 1993.6. F. Caselli,
On the multivariate Robinson-Schensted correspondence , Bollettino dell’Unione Matematica Italiana (2009), no. 1, 591–602.7. H. C¸ olako˘glu, On the distance formulae in the generalized taxicab geometry , Turkish J. Math. (2019), no. 3,1578–1594.8. J. Dillon and D. Roselle, Simon Newcomb’s problem , SIAM J. Appl. Math. (1969), no. 6, 1086–1093.9. T. Enright and J. Willenbring, Hilbert series, Howe duality, and branching for classical groups , Ann. of Math. (2004), no. 1, 337–375.10. A. Frohmader and H. Volkmer, , arXiv:1912.04945, 2019.11. W. Fulton,
Young tableaux , Cambridge University Press, 1997.12. R. Goodman and N. Wallach,
Symmetry, representations, and invariants , Springer, 2009.13. J. Harris,
Algebraic geometry: a first course , Springer-Verlag, 1995.14. A. Hoffman,
On simple linear programming problems , Convexity: Proceedings of the Seventh Symposium in PureMathematics of the AMS (V. Klee, ed.), American Mathematical Society, Providence, RI, 1963, pp. 317–327.15. Roger Howe,
Remarks on classical invariant theory , Trans. Amer. Math. Soc. (1989), no. 2, 539–570.16. Roger Howe, Eng-Chye Tan, and Jeb F. Willenbring,
Stable branching rules for classical symmetric pairs , Trans.Amer. Math. Soc. (2005), no. 4, 1601–1626.17. J. Kline,
Properties of the d-dimensional earth mover’s problem , Discrete Appl. Math. (2019), 128–141.18. J. Kretschmann,
Earth mover’s distance between grade distribution data with fixed mean , Master’s thesis, Uni-versity of Wisconsin-Milwaukee, 2020.19. G. Monge,
M´emoire sur la th´eorie des d´eblais et des remblais , Histoire de l’Acad´emie Royale des Sciences deParis, 1781, pp. 666–704.20. M. Morales,
Segre embeddings, Hilbert series, and Newcomb’s problem , HAL ID: hal-00839652, 2013.21. J. Park,
The Monge array: an abstraction and its applications , Ph.D. thesis, Massachusetts Institute of Technol-ogy, 1991.22. C. Villani,
Optimal transport, old and new , Springer, 2008.
Email address ::