# Matching with Trade-offs: Revealed Preferences over Competing Characteristics

aa r X i v : . [ ec on . GN ] F e b Matching with Trade-oﬀs:

Revealed Preferences over Competing Characteristics

Alfred Galichon Bernard Salani´e First version dated December 6, 2008. The present version is of October 14, 2009 . Economics Department, ´Ecole polytechnique; e-mail: [email protected] Department of Economics, Columbia University; e-mail: [email protected]. The authors are grateful to Guillaume Carlier, Pierre-Andr´e Chiappori, Piet Gauthier, JimHeckman, Guy Laroque, Rob Shimer as well as seminar participants at Crest, Ecole Polytechnique,s´eminaire Roy, University of Chicago, and University of Alicante for useful comments and discussions.This paper is now superseded by ‘Cupids invisible hand’ by the same authors. bstract

We investigate in this paper the theory and econometrics of optimal matchings with com-peting criteria. The surplus from a marriage match, for instance, may depend both onthe incomes and on the educations of the partners, as well as on characteristics that theanalyst does not observe. Even if the surplus is complementary in incomes, and complemen-tary in educations, imperfect correlation between income and education at the individuallevel implies that the social optimum must trade oﬀ matching on incomes and matchingon educations. Given a ﬂexible speciﬁcation of the surplus function, we characterize undermild assumptions the properties of the set of feasible matchings and of the socially optimalmatching. Then we show how data on the covariation of the types of the partners in ob-served matches can be used to test that the observed matches are socially optimal for thisspeciﬁcation, and to estimate the parameters that deﬁne social preferences over matches.

Keywords : matching, logit, generalized linear models, revealed preferences, contingency tables.

JEL codes : C78, D61, C13. ntroduction

Louisa was naturally ill-tempered and cunning; but she had been taught todisguise her real disposition, under the appearance of insinuating sweetness, bya father who but too well knew that to be married would be the only chanceshe would have of not being starved, and who ﬂattered himself that with suchan extraordinary share of personal beauty, joined to a gentleness of manners,and an engaging address, she might stand a good chance of pleasing some youngman who might aﬀord to marry a girl without a shilling.Jane Austen,

Lesley Castle (1792).Starting with Becker (1973), most of the economic theory of one-to-one matching hasfocused on the case when the surplus created by a match is a function of just two numbers:the one-dimensional types of the two partners. As is well-known, if the types of the partnersare one-dimensional and are complementary in producing surplus then the socially optimalmatches exhibit positive assortative matching. Moreover, the resulting conﬁguration isstable, it is in the core of the corresponding matching game, and it can be implemented bythe celebrated Gale and Shapley (1962) deferred acceptance algorithm.While this result is both simple and powerful, its implications are also quite unrealistic.If we focus on marriage and type is education for instance, then positive assortative matchinghas the most educated woman marrying the most educated man, then the second mosteducated woman marrying marrying the second most educated man, and so on. In practicethe most educated woman would weigh several criteria in deciding upon a match; evenin the frictionless world studied by theory, the social surplus her match creates may behigher if she marries a man with less education but, say, a similar income. Since incomeand education are only imperfectly correlated, the optimal match must trade oﬀ assortativematching along these two dimensions. This point is quite general: with multiple types, thestark predictions of the one-dimensional case break down.Empirical analysts of matching have long felt the need to accommodate the imperfect1ssortative matching observed in the data, of course. This can be done by introducing noise,in the form of heterogeneity in creation of surplus that is unobserved by the analyst (seeChoo and Siow (2006).) Models with multidimensional types can also be estimated from thedata, as in Chiappori et al. (2008). But as far as we know, there has been little theoreticalwork exploring the properties of optimal or equilibrium matches in such models. We showin this paper that these properties can be summed up in simple measures of covariation oftypes across partners; we analyze the set of values of such measures that can be rationalizedby a matching model; and we show how to estimate this set from data and to test that theobserved matching is socially optimal .While we use the language of the economic theory of marriage in our illustrations,nothing we do actually depends on it. The methods proposed in this paper apply just aswell to any one-to-one matching problem—or bipartite matchings, to use the terminologyof applied mathematics. In fact, we can even extend them to problems in which the setsof partners are determined endogenously—as with same-sex unions. This is investigated inSection 7, where we consider possible extensions of our setting.We do require, however, that utility be transferable across partners. Our primitive func-tion is indeed the surplus created by a match. We posit that it is an unknown function ofthe types of the partners only, plus preference shocks that are observed by all participantsbut not by the analyst—in the nature of unobserved heterogeneity. When utility is trans-ferable, all optimal matchings must maximize the joint surplus; and so does the equilibriumof the assignment game.As is well-known, this model is too general to be empirically testable: even without unob-served heterogeneity, any observed assignment can be rationalized by a well-chosen surplusfunction. This is a consequence of a more general theorem by Blair (1984). Echenique(2008) shows that on the other hand, some collections of matchings are not rationalizable:if the analyst can observe identical populations on several assignments, then these assign- A word on terminology: like most of the literature, we call a “match” the pairing of two partners, anda “matching” the list of all realized matches.

Summary of the notation used in the paper.

For the reader’s convenience, weregroup here the notation introduced in the text. We consider matches between N men and N women. S N is the set of permutations of { , ..., N } . A man has a full type ˜ x = ( x, ε ),where the econometrician observes x but not ε ; we use ˜ y = ( y, η ) for a woman. x is arandom vector with distribution P , and ˜ x is distributed according to ˜ P ; we use Q and ˜ Q fora woman. We denote M ( P, Q ) the set of probability distributions with margins P and Q ;we use M ( ˜ P , ˜ Q ) for the full types. We denote P ⊗ Q the product measure which matchesmen and women randomly. A feasible matching generates a probability ˜Π ∈ M (cid:16) ˜ P , ˜ Q (cid:17) ,which assesses the odds that a man with full type ˜ x is married to a woman with full type˜ y . A man with full type ˜ x and a woman with full type ˜ y generate together a full surplus˜Φ (˜ x, ˜ y ). We call Φ( x, y ) = E (cid:16) ˜Φ( ˜ X, ˜ Y ) | X = x, Y = y (cid:17) the observable surplus; in some ofthe paper we take it to be the structural quadratic form Φ ( x, y ) = x ′ Λ y . Throughout the paper, we assume that two subpopulations M and W of equal size must bematched each man (as we will call the members of M ) must be matched with one and onlyone member of W (we will call them women.) Thus we do not model the determination ofthe unmatched population (the singles) in this paper; we take it as data. We elaborate onthis point in our concluding remarks. Note also that we assumed bipartite matching: the twosubpopulations which deﬁne admissible partners are exogenously given. This assumption6an also be relaxed; see Section 7.Throughout the paper, we illustrate results on the education/income example sketchedin the Introduction, which we denote (ER). Each man m has an r -dimensional type x m of observable characteristics, and a vector ofunobserved characteristics ε m . Denote ˜ x m = ( x m , ε m ) the full description of man m ’scharacteristics, which we call his full type. Each woman w similarly has an s -dimensionaltype y w of observed characteristics, and a full type ˜ y w = ( y w , η w ).We denote ˜ P (resp. ˜ Q ) the distribution of full types ˜ x (resp. ˜ y ) in the subpopulation M (resp. W ), and P (resp. Q ) the distribution of observable types x (resp. y .) Thus P is aprobability distribution on IR r and Q is a distribution on IR s . In observed datasets we willhave a ﬁnite number N of men and women, so that P and Q are the empirical distribu-tions over the characteristics samples of the men { x , ..., x N } and the women { y , ..., y N } ,respectively.Take the education/income example: there r = s = 2, the ﬁrst dimension of types iseducation E ∈ { D, G } (dropout or graduate), and the second dimension is income class R ,which takes values 1 to n R . P describes both the number of graduates among men andthe distributions of income among graduate men and among dropout men. The intuitive deﬁnition of a matching is the speciﬁcation of “who marries whom”: given aman of index m ∈ { , ..., N } , it is simply the index of the woman he marries, w = σ ( m ) ∈{ , ..., N } . Imposing that each man be married to one and only one woman at a giventime translates into the requirement that σ be a permutation of { , ..., N } , which we denote σ ∈ S N . This deﬁnition is too restrictive in so far as we would like to allow for somerandomization. This could arise because a given type is indiﬀerent between several partner7ypes; or because the analyst only observes a subset of relevant characteristics, and theunobserved heterogeneity induces apparent randomness.A feasible matching (or assignment ) is therefore deﬁned in all generality as a jointdistribution ˜Π over types of partners ˜ X and ˜ Y , such that the marginal distribution of ˜ X is ˜ P and the marginal distribution of ˜ Y is ˜ Q . We denote M (cid:16) ˜ P , ˜ Q (cid:17) the set of such jointdistributions. Note that when x and y are univariate, a feasible matching can be equivalentlyspeciﬁed through a copula.A matching is said to be pure if all conditional distributions ˜Π ( . | ˜ x ) and ˜Π ( . | ˜ y ) are pointmass distributions. In a pure matching ˜Π, there exists an invertible map T (˜ x ) such thata man with type ˜ x almost surely marries a woman of type ˜ y = T (˜ x ), and conversely, awoman with type ˜ y almost surely marries a man of type ˜ x = T − (˜ y ). (Of course, in thediscrete case this map can be represented as a permutation on indices.)In the education/income example (ER), a pure matching is described by (2 n r ) − n r −

1) constraints, so that (2 n r − numbers areto be determined. The basic assumption of the model is that matching man m of type ˜ x m and woman w oftype ˜ y w generates a joint surplus ˜Φ(˜ x m , ˜ y w ), where ˜Φ is a deterministic function. Alongwith most of the matching literature, we assume that Assumption (O): Observability.

Each agent observes the full characteristics ˜ x and˜ y of all men and all women, but the econometrician only observes the subvectors x and y .Assumption (O) rules out asymmetric information between participants in the market,as the economics of matching with incomplete information is a subject of its own. On theother hand, we do not need to assume full information as the notation seems to imply: ˜Φ8ould for instance be reinterpreted as the expectation of a random variable conditional on˜ x, ˜ y , as long as all participants evaluate it in the same way.Given Assumption (O), we need to deﬁne the observable surplus as the best predictorof ˜Φ(˜ x, ˜ y ) conditional on x and y , that isΦ( x, y ) = E h ˜Φ( ˜ X, ˜ Y ) | X = x, Y = y i and we can write the decomposition˜Φ(˜ x, ˜ y ) = Φ( x, y ) + k (˜ x, ˜ y )where k (˜ x, ˜ y ) is the idiosyncratic surplus .Following the insight of Choo and Siow (2006), formalized by Chiappori et al. (2008),we now assume: Assumption (S): Separability . Let ˜ x and ˜ x ′ have the same observable type: x = x ′ .Similarly, let ˜ y and ˜ y ′ be such that y = y ′ . Then˜Φ(˜ x, ˜ y ) + ˜Φ(˜ x ′ , ˜ y ′ ) = ˜Φ(˜ x, ˜ y ′ ) + ˜Φ(˜ x ′ , ˜ y ) . While much of the literature on matching emphasizes complementarity, assumption (S) infact requires that conditional on observable types, the surplus exhibit no complementarityacross unobservable types.It is easy to see that imposing assumption (S) is equivalent to requiring that the id-iosyncratic surplus from a match must be additively separable, in the following sense: k (˜ x, ˜ y ) = χ (˜ x, y ) + ξ (˜ y, x ) , where χ and ξ are two deterministic functions and E ( χ ( ˜ X , Y ) | X = x, Y = y ) = E ( ξ ( ˜ Y , X ) | X = x, Y = y ) = 0 . Then the surplus function ˜Φ can be rewritten as˜Φ(˜ x, ˜ y ) = Φ( x, y ) + χ (˜ x, y ) + ξ (˜ y, x ) . x ) has a preference ξ (˜ x, y ) for a particular classof observable characteristics (say y ), but he is indiﬀerent between all partners which havethe same y but a diﬀerent η .In fact, the optimal matching is characterized by two functions of observable character-istics U ( x, y ) and V ( x, y ) that sum up to Φ( x, y ) such that if a man ˜ x = ( x, ε ) is matchedwith a woman of characteristics ˜ y = ( y, η ), he will get utility U ( x, y ) + χ (˜ x, y )while his match gets utility V ( x, y ) + ξ (˜ y, x ) . Chiappori et al. (2008) showed that given assumption (S), the matching problem boils downto a set of discrete choice models for each type of man and of woman: for instance, man ˜ x is matched in equilibrium to a woman ˜ y whose observable type y maximizes U ( x, y ) + ξ (˜ x, y )over all values in the support of Q .While this is already quite useful, we need to add more restrictions on the speciﬁcationof the components of the idiosyncratic surplus χ (˜ x, y ) and ξ (˜ y, x ). Following Choo and Siow (2006) and Chiappori et al. (2008), we introduce the followingassumption : Assumption GUI: Gumbel Unobserved Interactions.

It is assumed that: We deﬁne the scale factor to be 1 for the standard Gumbel, which has variance π /

6; thus e.g. χ hasvariance σ π /

10 There are an inﬁnite number of individuals with a given observable type in the popu-lation- Fix the observable characteristics x of a man, and let (cid:16) y ∗ , ..., y ∗ T y (cid:17) be the possiblevalues of the observable characteristics of women. Then the vector of preference shocks χ ( x, ε, y ∗ ) , ..., χ (cid:16) x, ε, y ∗ T y (cid:17) are distributed as T y independent and centered Gumbel randomvariables with scale factor σ ;similarly,- Fix the observable characteristics y of a man, and let (cid:0) x ∗ , ..., x ∗ T x (cid:1) be the possiblevalues of the observable characteristics of men. Then the vector of preference shocks ξ ( y, η, x ∗ ) , ..., ξ (cid:0) y, η, x ∗ T x (cid:1) are distributed as T x independent and centered Gumbel randomvariables with scale factor σ . (cid:4) In short: men of a given observable type have conditionally Gumbel iid draws of the χ ’sfor diﬀerent individuals; and conversely for women of a given observable type.We use (GUI) for the Independence of Irrelevant Alternatives property: without it, theodds ratio of the probability that a man with observable type x ends up in a match witha woman of observable type y rather than with z would also depend on the types of otherwomen, and the model would become unmanageable.(GUI) underlies the standard multinomial logit model of discrete choice. It has well-known limitations, one of which is that it does not extend directly to continuous choice. Weare currently exploring alternative speciﬁcations that would allow us to deal with continuouscharacteristics; but at this stage, we assume Assumption (DD):

The distributions of observed types P and Q are discrete, withprobability mass functions p ( x ) and q ( y ). (cid:4) In the (ER) example for instance, p ( D,

3) is the proportion of men who are dropoutsand whose income lies in class 3. For simplicity, we now denote i P = 1 , . . . , n P the possiblevalues of types of men, and i Q = 1 , . . . , n Q for women.11 .5 Specifying the observable surplus We now introduce sets of assumptions on the observable surplus ranging from non-restrictive(NPOI below, suited for nonparametric identiﬁcation) to more restrictive (SLOI below,convenient for a more concise analysis).Let us ﬁrst impose a normalization convention on the observable surplus. Notice thatthe optimal matching (but not the value of the social surplus) is left unchanged if weadd an additively decomposable function f ( x ) + g ( y ) to Φ ( x, y ). Therefore, without anyloss of generality, we impose some identifying restriction on Φ, using the two-way ANOVAdecomposition, accoding to which any vector Φ ( x, y ) admits the following orthogonal de-composition in L ( π ) as Φ ( x, y ) = ¯Φ ( x, y ) + f ( x ) + g ( y ) + c where E p [ f ( X )] = E q [ g ( Y )] = 0 and E (cid:2) ¯Φ ( X, Y ) | X (cid:3) = E (cid:2) ¯Φ ( X, Y ) | Y (cid:3) = 0. We shalltherefore often take the following convention when using a nonparametric approach: Convention (ZMOI): Zero-mean Observable interactions.

The observable sur-plus satisﬁes E [Φ ( X, Y ) | X ] = E [Φ ( X, Y ) | Y ] = 0 . It will sometimes be useful to assume more structure on the function Φ (the observablejoint surplus.) To do this, we consider K given basis assorting functions φ ( x, y ) , ..., φ K ( x, y )whose values are interpreted as the utility beneﬁt of interaction between type x and type y .Given assorting weights Λ ∈ R K , we focus on observable surplus functions Φ Λ ( x, y ) whichare linear combinations of the basis assorting functions with weights Λ. That is, Assumption (SLOI): Semilinear Observable Interactions.

The observable sur-plus function can be written Φ Λ ( x, y ) = K X k =1 Λ k φ k ( x, y ) (1.1)where the sign of each Λ k is unrestricted. (cid:4) not restrictive. Indeed, one can choose K = T x × T y and chose φ ij ( x, y ) =1 { x = x i ,y = y j } so that φ ij ( x, y ) captures interaction between observable man type x i andobservable woman type y j . We shall refer to this speciﬁcation as the: Speciﬁcation (NPOI): Nonparametric Observable Interactions.

The observablesurplus function is expanded in all generalityΦ Λ ( x, y ) = T x X i =1 T y X j =1 Λ ij { x = x i ,y = y j } . (1.2)in which case social weight Λ ij coincide with Φ ( x i , y j ). (cid:4) However we favor parsimonious models for the sake of analysis, so in general we shallonly assume (SLOI), unless explicitely stated.To return to the education/income example (ER): we could for instance assume that amatch between man m and woman w creates a surplus that depends on the similarity of thepartners in both education and income dimensions. The corresponding speciﬁcation wouldbe (with education levels E = ( D, G ) coded as (0 , x m , y w ) = X e m =0 , e w =0 , Λ e m ,e w E m = e m , E w = e w ) + X i =1 ,...,n r ; j =1 ,...,n r Λ ij R m = i, R w = j ) . This speciﬁcation only has ( n r + 4) parameters, while an unrestricted speciﬁcation wouldhave 4 n r . Such an unrestricted speciﬁcation would for instance allow the eﬀect of matchingpartners in income class 3 to depend on both of their education levels.An even more restrictive, “diagonal” speciﬁcation would beΦ( x m , y w ) = X e =0 , Λ Ee E m = E w = e ) + X i =1 ,...,n r Λ Ri R m = R w = i ) . In this last form, it is clear that the relative importance of the Λ’s reﬂects the relativeimportance of the criteria. Thus Λ Ri measures the preference for matching partners whoare both in income class i , while Λ E measures the preference for matching dropouts. Therelative values of these numbers indicate how social preferences value complementarity of13ncomes of partners more, relative to complementarity in educations. We will not need toassume such a diagonal structure in the following, although our results easily specialize tothis case. Under assumptions (O), (S), (SLOI) and (GUI), the model is fully parametrized; its pa-rameters can be collected in a vector θ = (Λ , σ , σ ) , where Λ is the assorting weight matrix, and σ (resp. σ ) is the scale factor of the unobserv-able characteristics of the men (resp. of women). Without loss of generality, all componentsof θ can be multiplied by any positive number; hence we shall need to impose some normal-ization on θ .Most of the results in the next section in fact only require assumptions (O), (S) and(GUI), with a general function Φ( x, y ). In this case θ is just (Φ , σ , σ ), and again it isdeﬁned up to a scale factor.As we will see, the total heterogeneity ( σ + σ ) plays a key role in our results; thus weintroduce a speciﬁc notation for it: σ = σ + σ . In this section we only assume (O), (S), and (GUI), and we consider the problem of optimalmatching: W ( θ ) = sup ˜Π ∈M ( ˜ P , ˜ Q ) E ˜Π h ˜Φ (cid:16) ˜ X, ˜ Y (cid:17)i . (2.1)Our modeling strategy in this section and the next is to assume that the number ofmen and women in the population is large enough that averages can be replaced with14xpectations. When we describe our estimators in section 5, we of course take into accountthe fact that we only have a ﬁnite sample. Let us provide some intuition before we state a formal theorem. Under (O), (S) and (GUI),standard formulæ of the multinomial logit model give the expected utility of a man ofobservable type x at the optimal matching: E (cid:20) max y (cid:16) U ( x, y ) + χ ( ˜ X, y ) (cid:17) | X = x (cid:21) = σ log X y exp ( U ( x, y ) /σ ) . Therefore the expected social surplus from the optimal matching is simply (adding theequivalent formula for women of observable type y ): σ E P log X y exp( U ( X, y ) /σ ) + σ E Q log X x exp( V ( x, Y ) /σ ) . Now recall that U ( x, y ) is the mean utility of a man with observable type x who endsup being matched to a woman with observable type y at the optimum. As in the generaldevelopment of the theory of matching, U is the value of the multiplier of the populationconstraints; and as such, it (along with V ) is the unknown function in the dual programin which the expression for the social surplus above is minimized over all U, V such that U + V ≥ Φ. We now state this as a theorem (proved in the Appendix):

Theorem 1 (Social welfare-primal version)

Assume (O), (S), (GUI) and (DD). Then W ( θ ) = inf ( U,V ) ∈ A σ E P log X y exp( U ( X, y ) /σ ) + σ E Q log X x exp( V ( x, Y ) /σ ) ! (2.2) where the constraint set A is deﬁned by the inequalities ∀ x, y, U ( x, y ) + V ( x, y ) ≥ Φ ( x, y ) . Since this formula may not be entirely transparent, we develop one term below: E P log X y exp( U ( X, y ) /σ ) = X x p ( x ) log X y exp ( U ( x, y ) /σ ) .

15t an optimal matching, men with observable type x will be found in matches withwomen with observable types y such that U ( x, y ) + V ( x, y ) = Φ ( x, y ). The expected utilityof men with observable type x matched with women of observable type y is U ( x, y ).This theorem also has a primal version, of course. While deriving it takes a bit morework (again, see the Appendix), the intuition is simple. First, if there were no unobservedheterogeneity (with σ close to zero) the optimal matching would coincide with the optimalobservable matching Π, which solves W ( θ ) = sup Π ∈M ( P,Q ) E Π Φ (

X, Y ) . Going to the polar opposite, in the limit when σ goes to inﬁnity only unobserved hetero-geneity would count; and since it is just noise, the optimal matching would simply assignpartners randomly, yielding the product measure P ⊗ Q .As it turns out, when σ takes any intermediate value the optimal matching maximizesa weighted sum of these two extreme cases: Theorem 2 (Social welfare-dual version)

Under the assumptions of Theorem 1 W ( θ ) = sup Π ∈M ( P,Q ) X x,y π ( x, y )Φ ( x, y ) − σI (Π) ! + σ S ( Q ) + σ S ( P ) , (2.3) where S ( P ) and S ( Q ) are the entropies of P and Q given by S ( P ) = − X x p ( x ) log p ( x ); and S ( Q ) = − X y p ( y ) log p ( y ); and I (Π) is the mutual information of joint distribution Π , given by I (Π) = X x,y π ( x, y ) log π ( x, y ) p ( x ) q ( y ) . The mutual information I (Π) is nothing else than the Kullback-Leibler divergence ofΠ from the independent product P ⊗ Q to Π. Recall two important information-theoreticproperties of I :1. The map π → I ( π ) is strictly convex. 16. One has ∀ Π ∈ M ( P, Q ) , S ( P ) + S ( Q ) ≥ I (Π) ≥ P ⊗ Q ,as we shall see below.Mutual information is a measure of the covariation of types x and y . Now P ⊗ Q isthe independent product of P and Q , which corresponds to a completely random matchingΠ = P ⊗ Q . Thus a large positive I (Π) indicates that the matching Π induces strongcorrelation across types; I (Π) = S ( P ) + S ( Q ) if and only if Π = P ⊗ Q . If σ is very largethen the Theorem suggests that I (Π) should be minimized, which can only occur for theindependent matching Π = P ⊗ Q ; whereas if σ is negligible then Π should be chosen so asto maximize the expected observable surplus E Π Φ( X, Y ). This corroborates the intuitiongiven earlier.Now the optimal matchings coincide with the solutions to this maximization problem.Since we only observe the realized Π over observable variables, Theorem 2 deﬁnes theempirical content of the model: a combination of the parameters θ = (Φ , σ , σ ) is identiﬁedif and only if the solution Π depends non-trivially on it.We already knew that θ can be rescaled by any positive constant without altering thesolution. We can now go one step further: while all components of θ ﬁgure in this theorem, σ and σ only enter through their sum σ . Thus and as announced, σ and σ are notseparately identiﬁed.Accordingly, we redeﬁne the parameter vector θ as θ = (Φ , σ ) , or θ = (Λ , σ ) under (SLOI). 17 .2 The homogeneous limit In this section we consider the limit behavior of our model when σ goes to zero, so thatunobservable heterogeneity vanishes. We denote W (Φ) ≡ W (Φ , . By taking the limit in Theorem 1, we obtain:

Theorem 3 (Homogeneous social welfare)

Assume (O) and (DD); thena) The value of the social optimum when θ = (Φ , is given both by W (Φ) = max Π ∈M ( P,Q ) X x,y π ( x, y )Φ ( x, y ) , (2.4) and by W (Φ) = inf ( u,v ) ∈ A X x p ( x ) u ( x ) + X y q ( y ) v ( y ) ! (2.5) where the constraint set A is given by ∀ x, y, u ( x ) + v ( y ) ≥ Φ ( x, y ) ;

A matching ( X, Y ) ∼ Π is optimal for Φ if and only if the equality u ( X ) + v ( Y ) = Φ ( X, Y ) holds Π -almost surely, where u and v solve the optimization problem (2.5). Thus in the limit we recover the standard primal and dual formulation of the matchingproblem; since all men with observable characteristics x have the same tastes, they allobtain the same utility at the optimum and U ( x, y ) becomes a function of x only, which wedenoted u ( x ) above; and this is just the Lagrange multiplier on the population constraint X y π ( x, y ) = p ( x )which is implicit in the notation Π ∈ M ( P, Q ).18

Qualitative properties of the optimum

In this section we ﬁrst introduce the various statistics on which our analysis shall rest. Wethen provide comparative statics which help understanding the inﬂuences on the modelparameters on these statistics; last, we study the inﬂuence on qualitative properties of theequilibria such as uniqueness and purity of the equilibria.

Feasible summaries.

Recall that under (SLOI), there exists an unknown vector Λ suchthat the observable surplus function takes the formΦ( x, y ) = K X k =1 Λ k φ k ( x, y )with known basis functions φ k . Now consider a hypothetical matching Π; under this match-ing, the basis functions have expected values C k (Π) = X x,y π ( x, y ) φ k ( x, y ) . We call each C k a covariation . Take the (ER) example; then • C is the proportion of matches among graduate partners under Π • C is the expected income of a graduate man’s wife multiplied by the proportion ofgraduate men; C is deﬁned similarly • and C is the expected product of the partners’ incomes.Random matching, as represented by Π ∞ = P ⊗ Q , plays a special role in our analysis,as it obtains in the limit when heterogeneity becomes very large. We denote the corre-sponding covariations as C k ∞ . At the polar opposite is the matching Π that obtains in thehomogenous limit σ = 0; we denote the implied covariations C k (Λ). Note that C ∞ doesnot depend on Λ, but C does. 19e know from Theorem 2 that under (SLOI), the observable optimal matching Π max-imizes Λ · C (Π) − σI (Π) . Thus the vector ( C (Π) , I (Π)) summarizes all the relevant information about matching Π.We call each such vector a matching summary ; matching summary vectors are set in sum-mary space , which is a subset of IR K +1 .Given an observed matching, it is of course very easy to estimate the associated covaria-tion vector and mutual information. Again, the model is scale-invariant and we may imposean arbitrary normalization on θ = (Λ , σ ). For that purpose we choose a vector C ∗ and weimpose Λ · C ∗ = 1. Later we make the choice of C ∗ more speciﬁc.Given population distributions P and Q , we deﬁne the set of feasible summaries F as theset of summary vectors ( C, I ) that are generated by some feasible matching π ∈ M ( P, Q ),that is F = n ( C, I ) ∈ R K × [0 , S ( P ) + S ( Q )] : ∃ Π ∈ M ( P, Q ) , C k = C k (Π) , I = I (Π) o Similarly, deﬁne the covariogram F c as the set of covariations C that are implied bysome feasible matching; that is, F c = n C : ∃ Π ∈ M ( P, Q ) , C k = C k (Π) o . Covariograms provide us with a nice graphical representation of the properties of amatching. Figure 1 illustrates their relevant properties, and the reader should refer to itas we go along. To ﬁt it within two dimensions, we assume that there are only two basisfunctions; e.g. in the (ER) example we could haveΦ( E m , E w , R m , R w ) = Λ E m = E w ) + Λ R m = R w ) , so that Λ (resp. Λ ) measures the preference for assortative matching on educations (resp.income classes.) Remember that I (Π) ≥ C C ! ( " = ! , I=0) F C R C (I')( ! =0,I=I ) R C (I'') [ " i , " i+1 ] " i " i+1 ﬁxed " , ! increases Figure 1: The covariogram and related objects21 roposition 1

Under (O), (S), (GUI) and (SLOI), the sets F and F c are nonempty closedconvex sets, and their support functions are W (Λ , σ ) and W (Λ , , respectively. As will soon become clear, the boundaries of the convex sets F and F c have specialsigniﬁcance in our analysis. For now, let us simply note that the boundary of F c exhibitskinks when these distributions of characteristics are discrete—which is always the case inour setting. The reason for these kinks is that in the discrete case, the optimal matching forhomogenous types is generically stable under a small perturbation of the assorting weightsΛ; starting from almost every Λ’s, a small change in Λ leaves covariations unchanged. Anysuch value of Λ generates a covariation vector on a vertex of the polytope. On the otherhand, there exist a ﬁnite number of values of Λ where the optimal matching problem hasmultiple solutions, with corresponding multiple covariations; each such value of Λ generatesa facet of the polytope. This is shown on Figure 1 with all λ = Λ / Λ in an interval[ λ i , λ i +1 ] generating the same covariations in the homogeneous case. Remarkably, suchkinks disappear as soon as there is enough positive amount of heterogeneity; we will comeback to this in section 3.3. The previous discussion suggests an intimate connection between the boundaries of thesets described above and optimal matchings. To make this clear, we now deﬁne the setof rationalizable summaries R as the set of ( K + 1)-uples ( C, I ) such that C and I arecovariations and mutual information corresponding to an optimal matching Π ∈ M ( P, Q )for some parameter values (Λ , σ ): R = (cid:8) ( C, I ) ∈ F : ∃ (Λ , σ ) ∈ R K × [0 , S ( P ) + S ( Q )] , Λ · C − σI = W (Λ , σ ) (cid:9) . Obviously, rationalizable summaries are feasible and

R ⊂ F . This deﬁnitions allow us tostate that rationalizable summaries and extreme feasible summaries coincide. Or, to put itmore formally:

Proposition 2 R is the frontier of F in R K × [0 , S ( P ) + S ( Q )] . utual information level sets. Now consider a covariation vector C in the covari-ogram F c , and deﬁne the rationalizing mutual information I r ( C ) := sup { I ∈ [0 , S ( P ) + S ( Q )] : ( C, I ) ∈ R} ;clearly from the deﬁnition of R and positive homogeneity of W , we see that the implicitmutual information function I r ( C ) = sup λ { λ · C − W ( λ, } (3.1)so I r ( C ) is the Legendre-Fenchel transform of W (Λ ,

1) which is strictly convex; in particular I r ( C ) is a C function. Conversely, for any mutual information I ≥ setof rationalizable covariations by R c ( I ) = (cid:8) C ∈ R K : ∃ I ∈ [0 , S ( P ) + S ( Q )] , ( C, I ) ∈ R (cid:9) = I − r ( { I } ) . It follows directly from the convexity of I r that R c ( I ) is the boundary of the set I − r ([0 , I ]), which is convex and increasing (for inclusion) with respect to I ∈ [0 , S ( P ) + S ( Q )].Note the two limiting cases: when mutual information I is zero (corresponding torandom matching), then R c (0) = { C ∞ } , where C k ∞ = E p ⊗ q (cid:2) φ k ( X, Y ) (cid:3) . When I = S ( P ) + S ( Q ), R c ( S ( P ) + S ( Q )) consists of the extreme points of the covariogram F c .The following result combines the linearity embodied in (SLOI) and the convex structureof the problem: Proposition 3

Under (O), (S), (GUI) and (SLOI),a) The social welfare function W is positive homogeneous of degree one in θ = (Λ , σ ) .It is convex on R K × [0 , + ∞ ) and strictly convex on its interior.b) The subdiﬀerential of W at (Λ , σ ) is given by the set of ( K + 1) -uples ∂ W = { ( C (Π) , − I (Π)) } generated by an optimal matching Π when parameter values θ = (Λ , σ ) vary. n particular, when the optimal matching Π is unique for some θ , then W is diﬀerentiableat θ and C k (Π) = ∂ W ∂ Λ k ( θ ) , I (Π) = − ∂ W ∂σ ( θ ) , in which case we deﬁne C k ( θ ) := ∂ W ∂ Λ k ( θ ) , and I ( θ ) := − ∂ W ∂σ ( θ ) . c) The function I r ( C ) is C on the interior of F c , and one has ∂I r ∂C k = Λ k σ . As a corollary, the limiting homogeneous case also has interesting comparative statics,which closely parallel the results above. When σ = 0 the mutual information does notplay a role anymore in Theorem 2; so we focus on the covariogram F C ; and we deﬁne W (Λ) = W (Λ , C (Λ) when Λ varies. Corollary 1 (Homogeneous comparative statics)

Under (O), (S), (GUI) and (SLOI),a) The function W is convex and positive homogeneous of degree one in Λ .b) The subdiﬀerential of W at Λ is given by the set of K -uples ∂ W = { C (Π) } generated by an optimal matching Π when Λ varies and σ = 0 .In particular, when the solution Π is unique for some Λ , then W is diﬀerentiable at Λ and ∂ W ∂ Λ k (Λ) = C (Π) . Our basic result is that any vector of covariations C that is feasible (that belongs to F C ) can be rationalized for a well-chosen value of total heterogeneity. This is a byproductof the following result, which sums up the relationships between the sets we introduced:24 roposition 4 Under (O), (S), (GUI) and (SLOI),a) The sets R c ( I ) are the set of extreme points of nested closed and convex sets thatexpand from { C ∞ } to F c as mutual information I goes from 0 to I .b) Any point ˆ C ∈ F c belongs to exactly one frontier R c ( I ) , associated to the mutualinformation I = I r (cid:16) ˆ C (cid:17) .c) For a point C such that I r ( C ) is smooth, letting Λ k = ∂I r ( C ) ∂C k ; then along ∂ R c ( I ) dC i dC j = − Λ j Λ i (3.2)Proposition 4 is illustrated on ﬁgure 1. Note that when we ﬁx Λ and increase σ from0 to + ∞ , the summary vector ( C, I ) for the optimal matching moves continuously from( C (Λ) , I (Λ)) to ( C ∞ , σ for given Λ moves us froma point on the boundary of F C to C ∞ .Proposition 4 may come as a surprise to the reader: we have imposed quite a fewassumptions on the way, and yet it seems that our model still cannot rule out any feasiblecovariation of types across partners! (Observing a ˆ C that is outside of F is impossibleby construction.) Proposition 4 tells us that observing ˆ C in the interior of F C rejects thehomogeneous model; but that any such ˆ C can be rationalized by adding the right amountof unobserved heterogeneity.The interpretation of part c) is simplest when the matrix Λ is diagonal. With severaldimensions for types, the optimal matching must sacriﬁce some covariation in one dimensionto the beneﬁt of some covariation in another. The implied sacriﬁce ratio, quite naturally,is exactly the ratio of the assorting weights along these dimensions. Take for instance thehomogeneous case with only two characteristics, and set Λ = 1 and Λ = ε . Thenthe function ε → C (1 , ε ) is decreasing, and the function ε → C (1 , ε ) is increasing.Therefore, when one puts more weight on the second dimension, the covariation of thecharacteristics in the second dimension increases, while the covariation on the ﬁrst dimensiondecreases. Quite intuitively, in the limit where all the weights are put on one dimension,the classical Beckerian theory of positive assortative matching obtains.25ore precisely, Carlier et al. (2008) have shown in an r -type homogeneous model thatwhen Λ = 1 and Λ jj → j ≥

2, if Π ∗ (Λ) is the Λ-optimal matching, and ( X, Y ) ∼ Π ∗ (Λ), then the joint distribution of the ﬁrst characteristics (cid:0) X , Y (cid:1) converges towardsthe maximally correlated distribution. Equivalently, X and Y become comonotonic inthe limit, just as in classical positive assortative matching. Uniqueness.

As mentioned earlier, the boundary of F c has kinks when types are discrete.In the homogeneous model ( σ = 0), the optimal matching is pure for almost all values of Λ.Start from such a value Λ . A small change in the value of Λ will not change the optimalmatching Π , or the covariations it generates. Pick some direction in Λ-space and movefurther away for Λ . At some point Λ , the optimal matching will change to a diﬀerent purematching, say Π ; but this new pure matching will vary with the direction we used to moveaway from Λ . This is what generates kinks. Note also that in Λ , any matching that is aconvex combination of Π and of Π is also optimal. So kinks are related to non-uniquenessof the optimal matching. More formally: Proposition 5

Assume (DD): the distributions P and Q are discrete. Then feasible set F c is a polytope with a ﬁnite number of vertices that correspond to pure matchings. When there is enough observed heterogeneity, the optimal matching is unique. Indeed,Decker et al. (2009) have shown that when the total heterogeneity σ is large enough (sothat I is small enough), the solution to Eq. (4.1) is unique. Purity.

A matching is pure if a given type of man cannot be matched to more thanone type of women and conversely. Intuition suggests that given suﬃcient heterogeneity,the optimal matching will not be pure, and its probability weights will react to even smallchanges in Λ. In fact, we have an even stronger result: even tiny levels of heterogeneity willmake the optimal matching impure. To see this, reason by contradiction: take a σ > X x,y (cid:18) π ( x, y ) Λ · φ k ( x, y ) − σπ ( x, y ) log π ( x, y ) p ( x ) q ( y ) (cid:19) . Note that the derivative with respect to any π ( x, y ) is inﬁnite in π ( x, y ) = 0 and is ﬁniteanywhere else. Since Π is pure, for any x there is only one y for which π ( x, y ) is nonzero.Subtract a positive ε from each such π ( x, y ), and spread it over all zero elements. The newjoint distribution is still a feasible matching, and the gain in social surplus from formerlyzero probabilities outweighs the loss from other matches. Therefore a pure matching cannotbe optimal. The results in the previous sections give a very useful description of the optimal matchings,and they show that σ and σ cannot be identiﬁed separately. On the other hand, we havenot provided a proof of identiﬁcation of the remaining parameters yet. We now set out todo so make use for identiﬁcation purposes of the geometrical interpretation of the matchingproblem when the observable surplus is a linear combination of known basis functions—thisis assumption (SLOI), which we impose throughout this section. Remember that given assumptions (O) and (S), there exist two functions U ( x, y )+ V ( x, y ) =Φ( x, y ) such that the optimal matching obtains when man ˜ x maximizes U ( x, y )+ χ (˜ x, y ) over y and woman ˜ x maximizes V ( x, y ) + ξ (˜ y, x ) over x . Now if π is the observable componentof an optimal matching, it was showed in Section 2 that given assumption (GUI), U ( x, y ) = σ log π ( x, y ) + σ log n ( x ) p ( x ) ; and similarly, V ( x, y ) = σ log π ( x, y ) + σ log n ( y ) q ( y ) . U and V depend on θ and are not easy to characterize as we will see; but we knowthat they sum up to Φ, so thatΦ( x, y ) = σ log π ( x, y ) + σ (log n ( x ) − log p ( x )) + σ (log n ( y ) − log q ( y )) . In this formula n and n still depend on θ in a complex way; but they only appear in termsthat depend only on characteristics of one partner. This means that the surplus function Φis identiﬁed up to an additive function of the form a ( x ) + b ( y ).To state this more formally, deﬁne the cross-diﬀerence operator as∆ F ( x, y ; x ′ , y ′ ) = (cid:0) F ( x ′ , y ′ ) − F ( x ′ , y ) (cid:1) − (cid:0) F ( x ′ , y ) − F ( x, y ) (cid:1) , for any function F of ( x, y ). Then we have: Theorem 4 (Cross-diﬀerences are identiﬁed up to scale)

Assume (O), (S), (GUI)and (DD). For θ = (Λ , σ , σ ) with σ = σ + σ > , one has:(i) There exists a unique optimal observable matching π which maximizes the socialwelfare (2.3).(ii) There exist three vectors π ( x, y ) , u ( x ) and v ( y ) , and a constant c normalized by E p [ u ( X )] = E q [ v ( Y )] = 0 , which are unique solutions to the following system π ( x, y ) = p ( x ) q ( y ) exp (cid:16) Φ( x,y ) − u ( x ) − v ( y ) − cσ (cid:17) ,π ∈ M ( P, Q ) . (4.1) Further, the constant c so deﬁned coincides with the value of the social welfare c = W .(iii) The probability π deﬁned in (ii) coincides with the optimal matching solution of(2.3). This result expresses that by adjusting the functions u and v at the right level, onemanages to satisfy the “budget constraint” that the matching has the right marginals dis-tributions π ∈∈ M ( P, Q ): hence, these functions u and v can be interpreted as “shadowprices” of men and women’s observable characteristics.28heorem 4 has another consequence: the complementarity of dimensions i of the observ-able types of the partners in ( x, y ) can be tested directly on log π , since ∆ log π and ∆ Φhave the same sign. Moreover, the relative strengths of complementarities along dimensions i and j at a point ( x, y ) can be estimated by evaluating ∆ log π for values of ( x ′ , y ′ ) thatdiﬀer from ( x, y ) along these dimensions.Theorem 4 immediately gives us an estimator of the observable joint surplus function Φ,up to additive functions of x and of y . But adding any combination a ( x ) + b ( y ) to the jointsurplus does not change the optimal matching, as long as we are determined not to havesingles—as we assume throughout the paper; and the positive scale factor σ is irrelevant.So for instance log ˆ π is a perfectly good estimator of Φ if ˆ π consistently estimates π .When we add a parametric structure under (SLOI), Theorem 4 also gives us an estimatorof the assorting weights Λ and the total heterogeneity σ . In fact, the cross-diﬀerenceoperator is linear and so under (SLOI),∆ log π = ∆ Φ σ = K X k =1 Λ k σ ∆ φ k ;if the cross-diﬀerences of the φ k are linearly independent, then observing π gives us the Λ’s(along with overidentifying restrictions.) This is a very weak requirement; having linearlydependent basis functions would indeed be a modelling mistake.This can be very simple in practice; to illustrate, take the diagonal version of the (ER)example. Then if in ( x , y ) man and woman are both dropouts, keeping their incomeclasses unchanged and moving them to graduate level in ( x ′ , y ′ ) generates∆ Φ( x , y ; x ′ , y ′ ) = Λ E − Λ E . On the other hand, taking man and woman to have diﬀerent education levels in ( x , y ) andswapping their educations to create ( x ′ , y ′ ) (again keeping income classes ﬁxed) generates∆ Φ( x , y ; x ′ , y ′ ) = − Λ E − Λ E . Recall that σ and σ are not separately identiﬁed. E Λ E = ∆ log π ( x , y ; x ′ , y ′ ) − ∆ log π ( x , y ; x ′ , y ′ )∆ log π ( x , y ; x ′ , y ′ ) + ∆ log π ( x , y ; x ′ , y ′ ) , which is readily estimated from the observed matching.These results are reminiscent of those in Fox (2009), although we obtained them underquite a diﬀerent set of assumptions: we do not use variation across subpopulations, neitherdoes his rank-order condition apply to our model. Note also that when specialized to one-dimensional types, our result yields that of Siow (2009) on testing complementarity of thesurplus function by examining log-supermodularity of the match distribution. Our parametric identiﬁcation strategy will be either based on the knowledge of the matchingsummaries (cid:16) ˆ C, ˆ I (cid:17) , which are the suﬃcient statistics for our model, or of just the covariationˆ C , with the assumption that (cid:16) ˆ C, ˆ I (cid:17) lies on the eﬃcient frontier, that is ˆ I = I r (cid:16) ˆ C (cid:17) . Ineither cases, positive homogeneity imposes the need for a normalization of the parameter (cid:16) ˆΛ , ˆ σ (cid:17) . Once again, θ is only identiﬁed up to a positive scale factor. Take (SLOI) for instance: Λwas only used to specify the objective function, and so it can be multiplied by any positiveconstant without any side-eﬀect. In particular, W ( t Λ , tσ ) = t W (Λ , σ ) for t ≥

0. Thereforeis is quite clear that (Λ , σ ) cannot be identiﬁed without ﬁxing some normalisation. So wenormalize (Λ , σ ) by the choice

Normalization convention : σI (Λ , σ ) = 1 , (4.2)where as we recall, I (Λ , σ ) = − ∂ W (Λ ,σ ) ∂σ . 30ur general approach will be to identify the parameter value (cid:16) ˆ λ, (cid:17) , and then rescaleˆΛ = ˆ λI (cid:16) ˆ λ, (cid:17) , ˆ σ = 1 I (cid:16) ˆ λ, (cid:17) . σ Note that I r ( C ( λ, I ( λ, σ isidentiﬁed by ˆ σ = 1 I r (cid:16) ˆ C (cid:17) . (4.3) ΛAs described above, we look for identifying ˆ λ among the parameters of the form ( λ, I r ( C ) = sup λ { λ · C − W ( λ, } so by the enveloppe theorem, ˆ λ = ∂I r ( ˆ C ) ∂C is such that C (cid:16) ˆ λ, (cid:17) = ∂ W ( ˆ λ, ) ∂λ . Hence, Λ isidentiﬁed by ˆΛ = 1 I r (cid:16) ˆ C (cid:17) ∂I r (cid:16) ˆ C (cid:17) ∂C . (4.4) We deﬁne the best additive projector P h of a vector h ( x, y ) as P h ( x, y ) = f ( x ) + g ( y )where f and g minimize E π h ( h ( X, Y ) − E [ h ( X, Y )] − f ( X ) − g ( Y )) i . We have immediately that E P [ f ( X )] = 0 and E Q [ g ( Y )] = 0, and introducing the residue ε ε ( X, Y ) = h ( X, Y ) − E [ h ( X, Y )] − f ( X ) − g ( Y )31e get E [ ε ( X, Y ) | X ] = 0 and E [ ε ( X, Y ) | Y ] = 0. The decomposition h ( X, Y ) = E [ h ( X, Y )] + f ( X ) + g ( Y ) + ε ( X, Y )is the two-way ANOVA decomposition of h ( X, Y ). The following proposition will be thefundamental tool for inference. It expresses that the projection residue in the two-wayANOVA decomposition of φ k is the score function σ ∂ log π∂ Λ k . Proposition 6 (Score function)

Under (O), (S), (GUI), and (SLOI), the score functionis given by ∂ log π∂ Λ k ( x, y ) = φ k ( x, y ) − P φ k ( x, y ) − E (cid:2) φ k ( X, Y ) (cid:3) σ , that is ∂u ( x ) ∂ Λ k + ∂v ( y ) ∂ Λ k = P φ k ( x, y ) , where u and v are solution to Equation (4.1). As a result, we get an expression for the computation of the Hessian of the social welfarefunction at ﬁxed σ . Proposition 7 (Fisher information matrix)

Under (O), (S), (GUI), and (SLOI), wher-ever W (Λ , σ ) is derivable, we get ∂ W (Λ , σ ) ∂ Λ k ∂ Λ l = σ I kl ( θ ) , where I kl ( θ ) := E (cid:20) ∂ log π∂ Λ k ( X, Y ) ∂ log π∂ Λ l ( X, Y ) (cid:21) is the Fisher information matrix. Further, I kl ( θ ) := cov (cid:0) φ k ( X, Y ) , φ l ( X, Y ) (cid:1) − cov (cid:0) P φ k ( X, Y ) , P φ l ( X, Y ) (cid:1) σ . (4.5) We now turn to the problem of inference. Our data will consist of matched characteristicsof N pairs { ( x , y ) , ..., ( x N , y N ) } , and our null hypothesis is that they were generated by32n optimal matching consistent with assumptions (O), (S), (GUI), and (SLOI). Given aproposed speciﬁcation for the basis functions φ k , and our estimates of the marginal distri-butions of types ˆ P N and ˆ Q N , we would therefore like to infer the values of Λ and σ whichcome closest to rationalizing the observed matching. We use our theory to answer twoquestions:1. is the observed matching optimal?2. which parameter vector (Λ , σ ) best rationalizes the observed matching (exactly if theobserved matching is optimal, approximately if it is not)?The primary object of our investigation will be the empirical moments of φ k ,ˆ C kN = 1 N N X n =1 φ k ( x n , y n ) . Let C k denote the expectation of φ k ( X, Y ) under the joint distribution Π of (

X, Y ).Standard asymptotic theory of the empirical process (van der Vaart (1998)) implies theconvergence in distribution √ N (cid:16) ˆ C kN − C k (cid:17) = ⇒ ξ k where ξ k = R φ k ( x, y ) dG ( x, y ), G being a Π-Brownian bridge. In particular, cov (cid:16) √ N (cid:16) ˆ C kN − C k (cid:17) , √ N (cid:16) ˆ C lN − C l (cid:17)(cid:17) = cov Π (cid:16) φ k ( X, Y ) , φ l ( X, Y ) (cid:17) for all 1 ≤ k, l ≤ K .We shall call W N ( θ ) the value of the social surplus at parameter θ obtained with theempirical distributions of observable types P N and Q N . Normalization.

Recall that because of positive homogeneity, models θ = (Λ , σ ) and tθ = ( t Λ , tσ ) are observationally indistinguishable. Just as in the previous section, weimpose the normalization convention σI (Λ , σ ) = 1. When we describe estimators below,we ﬁrst compute an estimator of the assorting weights Λ for total heterogeneity σ = 1; we33enote it ˆ λ N . We then shall get an estimator of the mutual information ˆ I N . To obtain thenormalized estimator in each case, the reader should divide the vector (ˆ λ N ,

1) by the scalarˆ I . The results we obtained in sections 2 and 4 suggest two estimation strategies, which wewill now deﬁne and compare. Theorem 4 and its corollary immediately suggest a very simple nonparametric approach. Inthis discrete case, a nonparametric estimator ˆ π N ( x, y ) is readily obtained, by counting theproportion of matches between a man of characteristics x and a woman of characteristics y . We could pick arbitrary functions a ( x ) and b ( y ) and deﬁneˆΦ N ( x, y ) = log ˆ π N ( x, y ) + a ( x ) + b ( y ) , without any reference to basis functions—imposing σ = 1 on the way. Then if we furtherassume (SLOI) with basis functions φ k , we can apply minimum-distance techniques torecover an estimator ˆ λ SPN , which minimizes some norm k ˆΦ N ( x, y ) − λ · φ ( x, y ) k . Note that as usual, the minimum value of the norm allows us to construct a test statisticfor the hypothesis that Φ is a linear combination of the φ k .More generally, we know that under (O), (S) and (GUI) only,log π = Φ σ ;thus a nonparametric estimate ˆ π N can be used as a heuristic device to decide on a set ofbasis functions, and/or to test for the adequacy of such a set.We now turn to parametric estimators. 34 .2 Parametric inference: The Moment Matching Estimator Our second estimator is based solely on the statistics of the matching covariations ˆ C . Itrests on identiﬁcation of Λ provided by Eq. (4.4). Therefore ˆ λ is taken as a maximizer ofΛ · ˆ C − W N (Λ ,

1) (5.1)over all possible Λ. This being a strictly concave function, its minimizer is unique; furthereﬃcient computation is available. Letting ˆ I the value of expression (5.1) at the optimal valueˆ λ , we obtain the Moment Matching (MM) estimator, denoted ˆΛ MM and ˆ σ MM , bysetting ˆΛ MM = ˆ λ ˆ I , ˆ σ MM = 1ˆ I .

Now if our data was generated by an optimal matching Π for parameters (cid:16) ˆΛ MM , ˆ σ MM (cid:17) ,the empirical covariations ˆ C N would coincide with the optimal correlations C (cid:16) ˆΛ MM , ˆ σ MM (cid:17) .By construction, the MM estimator is the value of assorting weights λ such that the pre-dicted covariations coincide with the observed covariations. The Moment matching estima-tor is consistent and asymptotically Gaussian, and Theorem 5

Under (O), (S), (GUI) and (SLOI), √ N (cid:16) ˆ λ N − λ (cid:17) = ⇒ I − ξ where ξ is the Brownian bridge characterized at the beginning of this section and the ma-trix I kl is the Fisher information matrix expressed above in (4.5). In particular, the MMestimator is asymptotically eﬃcient. With the exception of the semiparametric estimator (SP), our inferential methods requiresolving for the optimal matching for potentially large populations, and a large number ofparameter vectors during optimization. This may seem to be a forbidding task: there exist35ell-known algorithms to ﬁnd an optimal matching, and they are reasonably fast; but withlarge populations the required computer resources may still be large.Fortunately, it turns out that introducing (our type of) heterogeneity actually makescomputing optimal matchings much simpler; this is a boon for the ML and MM estimators .To see this, choose a parameter vector θ = (Φ , σ ) and return to the characterization ofoptimal matchings in equation 2.3, in the continuous case (CD) for simplicity. Dividing by σ and taking the logarithm, optimal matchings can also be obtained by solving the followingminimization program: min Π ∈M ( P,Q ) X x,y π ( x, y ) log π ( x, y ) p ( x ) q ( y ) exp(Φ( x, y ) /σ ) . Now deﬁne a set of probabilities r by r ( x, y ) = p ( x ) q ( y ) exp(Φ( x, y ) /σ ) P x,y p ( x ) q ( y ) exp(Φ( x, y ) /σ ) ;and note that given any choice of parameters θ and known marginals ( p, q ), the probability r itself is known.Determining the optimal matchings therefore boils down to ﬁnding the joint probabilities π with known marginals p and q which minimize the Kullback-Leibler distance to r : X x,y π ( x, y ) log π ( x, y ) r ( x, y ) . (6.1)Equivalently, we are looking for the Kullback-Leibler projection of r on M ( P, Q ).This is a well-known problem in various ﬁelds, and algorithms to solve it have beenaround for a long time. National accountants, for instance, use RAS algorithms to ﬁllcells of a two-dimensional table whose margins are known; here the choice of r reﬂects priornotions of the correlations of the two dimensions of the table. These RAS algorithms belongto a family called Iterative Projection Fitting Procedures (IPFP). They are very fast, andare guaranteed to converge under weak conditions. We only describe the application ofIPFP to our model here; we direct the reader to R¨uschendorf (1995) for more information. The BP estimator is designed for the homogeneous case and so the following does not apply to it. σ is very large, has π ( x, y ) = p ( x ) q ( y ). For smaller σ ’ s the probability of a match between x and y must increase with the surplus it creates, Φ( x, y ); and given our assumption (GUI)on the distribution of unobserved heterogeneity, it should not come as a surprise that thecorresponding factor is multiplicative and exponential.To describe the algorithm, we split π into π ( x, y ) = r ( x, y ) exp( − ( u ( x ) + v ( y )) /σ ) . The functions u and v of course will only be determined up to a common constant. Thealgorithm iterates over values ( u k , v k ). We start from u ≡ − σ log p and v ≡

0. Then atstep ( k + 1) we computeexp( − v k +1 ( y ) /σ ) = q ( y ) P x r ( x, y ) exp( − u k ( x ) /σ )and exp( − u k +1 ( x ) /σ ) = p ( x ) P y r ( x, y ) exp( − v k +1 ( y ) /σ ) . Two remarks are in order here: ﬁrst, we could just as well start from u ≡ v = − σ log q and modify the iteration formulæ accordingly. Second and just as in other Gauss-Seidel algorithms, it is important to update one component based on the other updatedcomponent: the right-hand sides have u k and v k +1 .If ( u, v ) is a ﬁxed point of the algorithm, then π ( x, y ) p ( x ) q ( y ) = exp (cid:18) Φ( x, y ) − u ( x ) − v ( y ) σ (cid:19) . Comparing this formula to Theorem 4 shows the beneﬁt of this reparameterization, since u ( x ) and v ( y ) have a simple interpretation: they represent (up to a common additiveconstant) the expected utilities of a man of observable characteristics x and of a woman ofobservable characteristics y . This can be seen by checking, for instance, that E (max U ( X, Y ) | X = x ) = σ log n ( x ) . It can be shown that at the optimum π ( x, y ) = 0 where r ( x, y ) = 0. N couples, the marginal p assigns 1 /N probability to each of ( x , . . . , x N ), and similarly for women. Deﬁne a matrix Ψby Ψ ij = exp(Φ( x i , y j ) /σ ), and vectors a ki = exp( − u k ( x i ) /σ ), b kj = exp( − v k ( y j ) /σ ). Thenwe end up with the shockingly simple and inexpensive formulæ: b k +1 = N Ψ ′ a k and a k +1 = N Ψ b k +1 . Our theory so far relies on several strong assumptions. Some of them are easy to relax; wediscuss three of them, before turning to potential extensions.

Single households.

So far we have not allowed for unmatched individuals. In anoptimal matching, some men and/or women may remain single, as of course some must ifthere are more individuals on one side of the market. The choice of the socially optimalmatching can be broken down into the choice of the set of individuals who participate inmatches and the choice of actual matches between the selected men and women. Our theoryapplies without any change to the second subproblem; that is, all of our results extend to M and W as selected in the ﬁrst subproblem.From the point of view of statistical inference, we may lose some eﬃciency in doingso; we note here that when the unobserved heterogeneity in preferences over partners isseparable from the utility of marriage itself, our method does not incur any eﬃciency loss. Non-bipartite matching.

Bipartite matching refers to the fact that each individualis exogenously assigned in one category—in our terminology, husband or wife. Our analysisin fact is very easy to extend so as to incorporate same-sex unions, and thus to rationalizeendogamy in the gender dimension. 38o do so, we just need to add one (observed) characteristic, in the form of gender. If forinstance gender becomes the ﬁrst dimension of the characteristics vector, then the observedsurplus has an assorting weight Λ < χ and η will automatically take into accountthe dispersion of individual preference for same-sex unions. Continuous distributions.

While we have assumed discrete characteristics, we expectthe main thrust of our arguments to carry over to the case where the distributions of thecharacteristics are continuous. We are working on such an extension; this will requireadapting the (GUI) assumption to one that is better-suited to continuous choice.

Revealed Preferences.

As mentioned in the section on the Boundary Projectionestimator, the Lagrange multiplier e is known in the theory of revealed preferences asAfriat’s eﬃciency index. The analogy in fact goes deeper. Recall the basic theorem onrevealed preferences: Proposition 8 (Afriat)

The following conditions are equivalent:(i) The observed quantity-price vectors ( x k , p k ) Nk =1 are consistent with maximization ofa single utility function;(ii) There exist scalars λ k < , k = 1 , ..., N such that N X k =1 λ k p k · x σ ( k ) is maximized over σ ∈ S N when σ ( j ) = j for all j . This is reminiscent of a multidimensional matching problem in which prices p correspondto the characteristics x of men, consumptions q to those of women y , and there is nounobserved heterogeneity. We are currently exploring this nalogy. Screening.

In the theory of screening, a “type” θ refers to a set of individual char-acteristics that are privately observed. Assume that utilities are additively separable in39ransfers, with u ( q, θ ) − t for an agent of type θ and W ( q ) + t for the principal . Then given quantity-transfer pairs ( q k , t k ) Nk =1 that presumably correspond to diﬀerent types,it can be shown that N X k =1 u ( q k , θ σ ( k ) )is maximized over σ ∈ S N when σ ( j ) = j for all j .This again suggests that our methods may help in estimating screening models.40 Facts from Convex Analysis

A.1 Basic results

We only sum up here the concepts we actually use in the paper; we refer the reader toHiriart-Urrut and Lemar´echal (2001) for a thorough exposition of the topic.Take any set Y ⊂ IR d ; then the convex hull of Y is the set of points in IR d that areconvex combinations of points in Y . We usually focus on its closure, the closed convex hull,denoted cch ( Y ).The support function S Y of Y is deﬁned as S Y ( x ) = sup y ∈ Y x · y for any x in Y . It is a convex function, and it is homogeneous of degree one. Moreover, S Y = S cch ( Y ) where cch ( Y ) is the closed convex hull of Y , and ∂S Y (0) = cch ( Y ).A point in Y is an extreme point if it does not belong in any open line segment joiningtwo points of Y .Now let u be a convex, continuous function deﬁned on IR d . Then the gradient ∇ u of u is well-deﬁned almost everywhere and locally bounded. If u is diﬀerentiable at x , then u (cid:0) x ′ (cid:1) ≥ u ( x ) + ∇ u ( x ) · ( x ′ − x )for all x ′ ∈ IR d . Moreover, if u is also diﬀerentiable at x ′ , then (cid:0) ∇ u ( x ) − ∇ u (cid:0) x ′ (cid:1)(cid:1) · (cid:0) x − x ′ (cid:1) ≥ . When u is not diﬀerentiable in x , it is still subdiﬀerentiable in the following sense. Wedeﬁne ∂u ( x ) as ∂u ( x ) = n y ∈ IR d : ∀ x ′ ∈ IR d , u (cid:0) x ′ (cid:1) ≥ u ( x ) + y · ( x ′ − x ) o . Then ∂u ( x ) is not empty, and it reduces to a single element if and only if u is diﬀerentiableat x ; in that case ∂u ( x ) = {∇ u ( x ) } . 41 .2 Generalized Convexity In order to make the paper self-contained, we present basic results on the theory of general-ized convexity , sometimes called the theory of c -convex functions. This theory extends manyresults from convex analysis and, in particular, duality results, to a much more generalsetting. We refer to Villani (2009), p. 54–57 (or Villani (2003), pp. 86–87) for a detailedaccount .Let ω be a function from the product of two sets X × Y to [ −∞ , + ∞ ). Deﬁnition 1

Consider any function ψ : X → ( −∞ , + ∞ ] . Its generalized Legendre trans-form ψ ⊥ : X → [ −∞ , + ∞ ) is deﬁned by ψ ⊥ ( y ) = inf x ∈X { ψ ( x ) − ω ( x, y ) } . Conversely, take any function ζ : Y → [ −∞ , + ∞ ) ; then its generalized Legendre trans-form ζ ⊤ : X → ( −∞ , + ∞ ] is deﬁned by ζ ⊤ ( x ) = sup y ∈Y { ζ ( y ) + ω ( x, y ) } . A function ψ is called ω -convex if it is not identically + ∞ and if there exists ζ : Y → [ −∞ , + ∞ ] such that ψ = ζ ⊤ . Recall that the usual Legendre transform is deﬁned as ψ ∗ ( y ) = inf x ∈X { ψ ( x ) − x · y } ;thus it coincides with the generalized Legendre transform when ω is bilinear, and then ω -convexity boils down to standard convexity.Our analysis rests on the following fundamental result, which generalizes standard con-vex analysis. A cautionary remark is in order here: the sign conventions vary in the literature, so our own choicesmay diﬀer from those of any given author. roposition 9 For every function ψ : X → ( −∞ , + ∞ ] , ψ ⊥⊤ ≤ ψ with equality if and only if ψ is ω -convex. Proof

Take any x ∈ X ; then ψ ⊥⊤ ( x ) = sup y ∈Y inf x ′ ∈X (cid:8) ψ (cid:0) x ′ (cid:1) − ω (cid:0) x ′ , y (cid:1) + ω ( x, y ) (cid:9) ;taking x ′ = x shows that ψ ⊥⊤ ( x ) ≤ ψ ( x ).Conversely, if ψ ⊥⊤ = ψ then ψ ( x ) = ζ ⊤ ( x ), with ζ = ψ ⊥ . But for any function ζ , thetriple transform ζ ⊤⊥⊤ coincides with ζ ⊤ . To see this, write ζ ⊤⊥⊤ ( x ) = sup y ∈Y inf x ′ ∈X sup y ′ ∈Y (cid:8) ζ (cid:0) y ′ (cid:1) + ω (cid:0) x ′ , y ′ (cid:1) − ω (cid:0) x ′ , y (cid:1) + ω ( x, y ) (cid:9) . Now for all x and y , inf x ′ ∈X sup y ′ ∈Y (cid:8) ζ (cid:0) y ′ (cid:1) + ω (cid:0) x ′ , y ′ (cid:1) − ω (cid:0) x ′ , y (cid:1)(cid:9) ≥ ζ ( y )as is easily seen by taking y ′ = y ; therefore ζ ⊤⊥⊤ ( x ) ≥ sup y ∈Y { ζ ( y ) + ω ( x, y ) } = ζ ⊤ ( x ) . Applying this to the ζ such that ψ = ζ ⊤ concludes the proof.QED. B Proofs

B.1 Proof of Theorem 1

In order to prove Theorem 1, some preparation is needed. Remember our shorthand notation˜ x = ( x, ε ), and ˜ y = ( y, η ). For any function ˜ u ( x, ε ), ﬁx x and use the theory of generalizedconvexity brieﬂy recalled in Appendix (A.2) to deﬁne˜ u ⊥ ( x, y ) = inf ε { ˜ u ( x, ε ) − χ (( x, ε ) , y ) } generalized Legendre transform of ˜ u ( x, · ) with respect to the partial surplus function χ (( x, · ) , · ). We deﬁne in the same manner˜ v ⊥ ( x, y ) = inf η { ˜ v ( y, η ) − ξ ( x, ( y, η )) } . Similarly, for two functions U ( x, y ) and V ( x, y ), we deﬁne U ⊤ ( x, ε ) : = sup y { U ( x, y ) + χ (( x, ε ) , y ) } V ⊤ ( y, η ) : = sup x { V ( x, y ) + ξ ( x, ( y, η )) } . Lemma 1

Let A be the set of pairs of functions ( U, V ) such that ∀ x, y, U ( x, y ) + V ( x, y ) ≥ Φ ( x, y ) . Then W = inf ( U,V ) ∈ A (cid:26)Z U ⊤ (˜ x ) d ˜ P (˜ x ) + Z V ⊤ (˜ y ) d ˜ Q (˜ y ) (cid:27) . Proof of Lemma 1

By the Kantorovich duality theorem (Villani (2009) Theorem 5.10), W = sup ˜ π ∈M ( P,Q ) Z ˜Φ (˜ x, ˜ y ) dπ (˜ x, ˜ y ) = inf (˜ u, ˜ v ) ∈ ˜ A (cid:26)Z ˜ u (˜ x ) d ˜ P (˜ x ) + Z ˜ v (˜ y ) d ˜ Q (˜ y ) (cid:27) , (B.1)where ˜ A is the set of pairs of functions (˜ u. ˜ v ) such that ∀ ˜ x, ˜ y, ˜ u (˜ x ) + ˜ v (˜ y ) ≥ ˜Φ (˜ x, ˜ y ) . Note the following two facts about the right-hand side of this equality:1. Since ˜Φ(˜ x, ˜ y ) = Φ( x, y ) + χ (( x, ε ) , y ) + ξ (( y, η ) , x ) , the inﬁmum in (B.1) can be taken over the pair of functions (˜ u, ˜ v ) that satisfy˜ u ( x, ε ) ≥ sup y (cid:26) Φ( x, y ) + χ (( x, ε ) , y ) + sup η [ ξ (( y, η ) , x ) − ˜ v ( y, η )] (cid:27) , or ˜ u (˜ x ) ≥ sup y n Φ( x, y ) + χ (( x, ε ) , y ) − ˜ v ⊥ ( x, y ) . o

44t the optimum this must hold with equality. Going back to Deﬁnition 1, it followsthat ˜ u ( x, · ) is χ (( x, · ) , · )-convex for every x ; and using Proposition 9, we can substitute˜ u with ˜ u ⊥⊤ , that is: ˜ u ( x, ε ) = sup y n ˜ u ⊥ ( x, y ) + χ (( x, ε ) , y ) o . Given a similar argument on ˜ v , the objective function can be rewritten as Z sup y n ˜ u ⊥ ( x, y ) + χ (( x, ε ) , y ) o d ˜ P (˜ x ) + Z sup x n ˜ v ⊥ ( x, y ) + ξ ( x, ( y, η )) o d ˜ Q (˜ y ) .

2. Also note that the constraint of the minimization problem in (B.1) is also ∀ x, y, ˜ u ⊤ ( x, y ) + ˜ v ⊥ ( x, y ) ≥ Φ ( x, y )which follows directly from the fact that ∀ x, ε, y, η, ˜ u ( x, ε ) − χ (( x, ε ) , y ) + ˜ v ( y, η ) − ξ ( x, ( y, η )) ≥ Φ ( x, y ) . Now deﬁne U ( x, y ) = ˜ u ⊥ ( x, y ) and V ( x, y ) = ˜ v ⊥ ( x, y ) ;Given points 1. and 2. above, we can rewrite the value W as W = inf ( U,V ) ∈ A (cid:26)Z U ⊤ (˜ x ) d ˜ P (˜ x ) + Z V ⊤ (˜ y ) d ˜ Q (˜ y ) (cid:27) . QED.We are now in a position to prove the theorem.

Proof of Theorem 1

Start by drawing two samples of size N of men and women fromtheir population distributions P and Q ; we denote the corresponding values of the observedcharacteristics { x , ..., x n } and { y , ..., y n } . Call P n and Q n the corresponding sample dis-tributions; e.g. P n assigns a mass p i,n = 1 n n X j =1 x j = x ∗ i )45o the value x ∗ i of observable characteristics of men. The Law of Large Numbers impliesthat P n and Q n converge in distribution to P and Q , the population distributions of theobservable types. Now we have for any possible x Z U ⊤ (˜ x ) d ˜ P n ( ε | X = x ) = X i =1 ,..,nx i = x sup j =1 ,...,n { U ( x i , y j ) + χ (( x i , ε i ) , y j ) } + o (1)As N gets large enough, each of the possible values of observable characteristics ofwomen y ∗ t is included in the sample { y , ..., y n } ; then the sup in the above expression runsover all such possible values n y ∗ , ..., y ∗ T y o . But under (GUI), conditional on X the randomvariables χ (( x, ε ) , y ∗ t ) are independent Gumbel random variables with scaling factor σ , sowe get 1 σ Z U ⊤ (˜ x ) d ˜ P n ( ε | X = x ) = log T y X t =1 exp ( U ( x, y t ) /σ ) + o P (1)hence, taking the limit and integrating over x , Z U ⊤ (˜ x ) d ˜ P (˜ x ) = σ E P log X y exp ( U ( X, y ) /σ )and similarly Z V ⊤ (˜ y ) d ˜ Q (˜ y ) = σ E Q = log X x exp ( V ( x, Y ) /σ ) . QED.

B.2 Proof of Theorem 2

Proof

By theorem (1), we have W N = inf U ( x,y )+ V ( x,y ) ≥ Φ( x,y ) ∀ x,y σ P x p ( x ) log (cid:16)P y exp ( U ( x, y ) /σ ) (cid:17) + σ P y q ( y ) log ( P x exp ( V ( x, y ) /σ )) W N = inf U ( x,y ) ,V ( x,y ) sup π ( x,y ) ≥ σ P x p ( x ) log (cid:16)P y exp ( U ( x, y ) /σ ) (cid:17) + σ P y q ( y ) log ( P x exp ( V ( x, y ) /σ ))+ P x,y π ( x, y ) (Φ ( x, y ) − U ( x, y ) − V ( x, y )) = sup π ( x,y ) ≥ (X xy π ( x, y ) Φ ( x, y ) + inf U ( .,. ) F ( U ) + inf V ( .,. ) G ( V ) ) where F ( U ) = σ X x p ( x ) log X y exp ( U ( x, y ) /σ ) ! − X xy π ( x, y ) U ( x, y ) G ( V ) = σ X y q ( y ) log X x exp ( V ( x, y ) /σ ) ! − X xy π ( x, y ) V ( x, y ) . Clearly, U ( ., . ) and V ( ., . ) in the inner minimization problems satisfy π ( x, y ) = p ( x ) exp ( U ( x, y ) /σ ) P y exp ( U ( x, y ) /σ ) = q ( y ) exp ( V ( x, y ) /σ ) P x exp ( V ( x, y ) /σ ) ; (B.2)note that these equations imply that P y π ( x, y ) = p ( x ) and P x π ( x, y ) = q ( y ), so that π ∈ M ( P, Q ). Rearranging terms, W N = sup π ∈M ( P,Q ) P xy π ( x, y ) Φ ( x, y ) − ( σ + σ ) P xy π ( x, y ) log π ( x, y )+ σ P x p ( x ) log p ( x ) + σ P y q ( y ) log q ( y ) and noticing that P xy π ( x, y ) log π ( x, y ) = D ( π ) − S ( P ) − S ( Q ) gives the desired result. B.3 Proof of Theorem 3

Proof

The result follows directly from the Kantorovich duality (cf. Villani (2009), Ch. 2);it can also be obtained by letting σ and σ tend to zero in Theorem 1, and noting that, as σ , σ → σ E P " log X y [exp ( U ( X, y ) /σ )] → E P (cid:20) max y U ( X, y ) (cid:21) ,σ E Q " log X x [exp ( V ( x, Y ) /σ )] → E Q h max x U ( x, Y ) i . .4 Proof of Theorem 4 Proof (i) For σ >

0, the map π → P x,y π ( x, y )Φ ( x, y ) − σI ( π ) is strictly convave and ﬁnite,on the convex domain M ( P, Q ); thus there exists a unique π ∈ M ( P, Q ) maximizing (2.3).(ii) Let B be the set of pairs of functions ( u ( x ) , v ( y )) such that P x u ( x ) p ( x ) = P y v ( y ) q ( y ) = 0, and for ( u, v ) ∈ B , let Z be the partition Z ( u, v ) := X x,y p ( x ) q ( y ) exp (cid:18) Φ ( x, y ) − u ( x ) − v ( y ) σ (cid:19) . Introduce p u,v ( x ) : = ∂ log Z ( u, v ) ∂u ( x ) = P y p ( x ) q ( y ) exp (cid:16) Φ( x,y ) − u ( x ) − v ( y ) σ (cid:17)P x,y p ( x ) q ( y ) exp (cid:16) Φ( x,y ) − u ( x ) − v ( y ) σ (cid:17) q u,v ( y ) : = ∂ log Z ( u, v ) ∂v ( y ) = P x p ( x ) q ( y ) exp (cid:16) Φ( x,y ) − u ( x ) − v ( y ) σ (cid:17)P x,y p ( x ) q ( y ) exp (cid:16) Φ( x,y ) − u ( x ) − v ( y ) σ (cid:17) as a result p u,v and q u,v are probability vectors. By the strict concavity of log Z , there existsa unique vector ( u, v ) ∈ B such that p = p u,v q = q u,v and π ( x, y ) = p ( x ) q ( y ) exp (cid:16) Φ( x,y ) − u ( x ) − v ( y ) σ (cid:17) ∈ M ( P, Q ).(iii) Let π ∈ M ( P, Q ) be the solution of (2.3). From Expression (B.2) in the proof ofTheorem 2, we have that σ log π ( x, y ) = U ( x, y ) + σ log p ( x ) − σ log X y exp ( U ( x, y ) /σ ) ! σ log π ( x, y ) = V ( x, y ) + σ log q ( y ) − σ log X x exp ( V ( x, y ) /σ ) ! thus, summing up σ log π ( x, y ) p ( x ) q ( y ) = Φ ( x, y ) − u ( x ) − v ( y ) − c u ( x ) = σ log p ( x ) + σ log X y exp ( U ( x, y ) /σ ) ! + c v ( y ) = σ log q ( y ) + σ log X x exp ( V ( x, y ) /σ ) ! + c c = c + c and c and c are constant adjusted so that ( u, v ) ∈ B . Hence π is solution of equation(4.1). It follows immediately that c = W . B.5 Proof of Proposition 3

Proof a) The convexity of W follows from the fact that it is the supremum of expressionwhich are linear with respect to θ .b) As a result, by the enveloppe theorem, the subdiﬀerential of W at θ is the set of { C (Π) , − I (Π) } such that Λ C (Π) − σI (Π) = W (Λ , σ ). When this set consists of a singlepoint, W is diﬀerentiable at θ and ∂ W ∂ Λ k ( θ ) = C k (Π) , ∂ W ∂σ ( θ ) = − I (Π) . B.6 Proof of Proposition 1

Proof

Non-emptiness is obvious. Now F c is convex: Let ˆ C and ˜ C be two feasible cross-product matrices in F c . We ﬁrst show that for any α ∈ [0 , α ˆ C + (1 − α ) ˜ C is in F c .By deﬁnition of F c , there exist ˆ π and ˜ π in M ( P, Q ) such that ˆ C ij = E ˆ π [ X ij Y ij ] and˜ C ij = E ˜ π [ X ij Y ij ]. Let ¯ π = α ˆ π + (1 − α ) ˜ π . Then α ˆ C ij + (1 − α ) ˜ C ij = E ¯ π [ X ij Y ij ], and¯ π ∈ M ( P, Q ), thus α ˆ C + (1 − α ) ˜ C ∈ F c . Now we prove that F c is closed: Let C n bea sequence in F c converging to C ∈ IR rs , and let π n be the associated matching. ByTheorem 11.5.4 in Dudley (2002), as M ( P, Q ) is uniformly tight, π n has a weakly convergingsubsequence in M ( P, Q ); call π its limit. Then C is the cross-product associated to π , sothat C ∈ F c . Finally, F is a closed convex set as it is the upper graph of the function I r ( C )deﬁned in Eq. (3.1). 49 .7 Proof of Proposition 2 Proof R is the reunion of the subgradients of W which was seen in Prop. 1 to be thesupport function of F : hence R is the frontier of F . B.8 Proof of Proposition 3

Proof a) Positive homogeneity and convexity of degree one follows from the fact that W is the support function of F . Strict convexity for σ > I ( π ). Part b) follows directly from the enveloppe theorem. Part c) results of I r ( C )being the Legendre transform of W ( λ,

1) which is strictly convex, hence it convex on F c ,diﬀerentiable on its interior, and by the enveloppe theorem, ∂I r ∂C k = Λ k σ . B.9 Proof of Proposition 4

Proof a) The sets R c ( I ) are extreme points of the sets I − r ([0 , I ]) which are closed convexsets. One has I − r ( { } ) = { C ∞ } which corresponds to Π = P ⊗ Q , and I − r ([0 , S ( P ) + S ( Q )]) = F c . Clearly, one has ˆ C ∈ R c (cid:16) I r (cid:16) ˆ C (cid:17)(cid:17) . Finally, the form P k Λ k dC k vanishes along R c ( I ),so one has P k Λ k dC k = 0, hence the result. C Proof of Proposition 6

Proof

By equation (4.1), we havelog π ( x, y ) p ( x ) q ( y ) = Φ( x, y ) − u ( x ) − v ( y ) − cσ hence σ ∂ log π∂ Λ k ( x, y ) = φ k ( x, y ) − ∂u ( x ) ∂ Λ k − ∂v ( x ) ∂ Λ k − ∂c∂ Λ k . But we have that X x ∂ log π Λ ( x, y ) ∂ Λ k π Λ ( x, y ) = X x ∂π Λ( x,y ) ∂ Λ k = ∂∂ Λ k X x π Λ( x,y ) = ∂q ( y ) ∂ Λ k = 0 , thus for all x and y,E (cid:20) ∂ log π Λ ( X, Y ) ∂ Λ k | X = x (cid:21) = 0 and E (cid:20) ∂ log π Λ ( X, Y ) ∂ Λ k | Y = y (cid:21) = 050ence ∂ log π∂ Λ k ( x, y ) ∈ V ◦ , while ∂u ( x ) ∂ Λ k + ∂v ( y ) ∂ Λ k ∈ V + , therefore φ k ( x, y ) = σ ∂ log π∂ Λ k ( x, y ) + ∂u ( x ) ∂ Λ k + ∂v ( y ) ∂ Λ k + E h φ k ( X, Y ) i is the orthogonal decomposition of φ k ( x, y ) on V ◦ ⊕ V + ⊕ R , hence ∂u ( x ) ∂ Λ k + ∂v ( y ) ∂ Λ k = P φ k ( x, y ). D Proof of Proposition 7

Proof

We have ∂ W (Λ , σ ) ∂ Λ l = E h φ l ( X, Y ) i hence ∂ W (Λ , σ ) ∂ Λ k ∂ Λ l = E (cid:20) φ l ( X, Y ) ∂ log π∂ Λ k ( X, Y ) (cid:21) = σE (cid:20) ∂ log π∂ Λ k ( X, Y ) ∂ log π∂ Λ l ( X, Y ) (cid:21) . Further, by the orthogonality of V ◦ and V + , cov (cid:16) φ k ( X, Y ) , φ l ( X, Y ) (cid:17) = σ cov (cid:18) ∂ log π∂ Λ k ( X, Y ) , ∂ log π∂ Λ l ( X, Y ) (cid:19) + cov (cid:16) P φ k ( X, Y ) , P φ l ( X, Y ) (cid:17) QED.

E Proof of Theorem 5

Proof

We have ˆ λ N = ∂I r ∂C , hence at ﬁrst order ˆ λ N − λ = D I r . (cid:16) ˆ C N − C (cid:17) + o P (cid:16) / √ N (cid:17) .But as I r is the Legendre transform of W ( · , D I r = (cid:0) D W ( · , (cid:1) − = I − by Proposition 7. F Connections to Statistical physics

There is in fact, a very close parallel between our theory and Statistical physics and Ther-modynamics. We refer to Parisi (1988) for more on Statistical physics, and to M´ezard and51ontanari (2009) for connection with Information theory. To give hints to the parallel, letus just mention that the social welfare W is the analog of a total energy ; the term P λ k C k is the analog of an internal energy ; I ( π ) is the analog of an entropy ; the parameter σ isthe analog of a temperature . A pure matching is the equivalent of a solid state ; the pointsof nondiﬀerentiability of W are analog to critical points .Note that equation 4.1 is known in the mathematical physics literature as the Schr¨odinger-Bernstein equation, cf. R¨uschendorf and Thomsen (1998) and references therein. It wasﬁrst studied by Erwin Schr¨odinger as part of his research program in time irreversibilityin Statistical Physics. Interestingly, it also bears some connections with the better-known“Schr¨odinger equation” in Quantum mechanics of the same inventor. In fact, as discoveredby Zambrini, a dynamic formulation of this equation is the Euclidian Schr¨odinger equationwhich arises in Ed Nelson’s formulation of “Stochastic Mechanics,” an Euclidian analog ofquantum mechanics. For more on this topic, see Parisi (1988), Chap. 19.52 eferences Becker, G. (1973). A theory of marriage, part i.

Journal of Political Economy , , 813–846.Blair, C. (1984). Every ﬁnite distributive lattice is a set of stable matchings. Journal ofCombinatorial Theory, Series A , , 353–356.Carlier, G., Galichon, A., & Santambrogio, F. (2008). From knothe’s transport to bre-nier’s map and a continuation method for optimal transport [preprint available onhttp://arxiv.org/abs/0810.4153].Chiappori, P.-A., Salani´e, B., Tillman, A., & Weiss, Y. (2008).

Assortative matching on themarriage market: A structural investigation [mimeo Columbia University].Choo, E., & Siow, A. (2006). Who marries whom and why.

Journal of Political Economy , , 175–201.Decker, C., Stephens, B., & McCann, R. (2009). When do systematic gains uniquely de-termine the number of marriages between diﬀerent types in the choo-siow matchingmodel? suﬃcient conditions for a unique equilibrium [mimeo University of Toronto].Dudley, R. M. (2002).

Real analysis and probability . Cambridge University Press.Echenique, F. (2008). What matchings can be stable? the testable implications of matchingtheory.

Mathematics of Operations Research , , 757–768.Fox, J. (2009). Identiﬁcation in matching games (tech. rep.). NBER.Gale, D., & Shapley, L. (1962). College admissions and the stability of marriage.

AmericanMathematical Monthly , , 9–14.Hiriart-Urrut, J.-B., & Lemar´echal, C. (2001). Fundamental of convex analysis . Springer.M´ezard, M., & Montanari, A. (2009).

Information, physics, and computation . Oxford Uni-versity Press.Parisi, G. (1988).

Statistical ﬁeld theory . Perseus Books.R¨uschendorf, L. (1995). Convergence of the iterative proportional ﬁtting procedure.

Annalsof Statistics , , 1160–1174.R¨uschendorf, L., & Thomsen, W. (1998). Closedness of sum spaces and the generalizedschrˆdinger problem. Theory of Probability and its Applications , , 483–494.53iow, A. (2009). Testing becker’s theory of positive assortative matching (tech. rep.). Uni-versity of Toronto.van der Vaart, A. (1998).

Asymptotic statistics . Cambridge University Press.Villani, C. (2003).

Topics in optimal transportation . American Mathematical Society.Villani, C. (2009).

Optimal transport, old and new . Springer.54 his figure "SetsCovarioD.png" is available in "png"(cid:10) format from:http://arxiv.org/ps/2102.12811v1his figure "SetsCovarioD.png" is available in "png"(cid:10) format from:http://arxiv.org/ps/2102.12811v1