[PDF] Vector quantile regression and optimal transport, from theory to numerics

Abstract

In this paper, we first revisit the Koenker and Bassett variational approach to (univariate) quantile regression, emphasizing its link with latent factor representations and correlation maximization problems. We then review the multivariate extension due to Carlier et al. (2016, 2017) which relates vector quantile regression to an optimal transport problem with mean independence constraints. We introduce an entropic regularization of this problem, implement a gradient descent numerical method and illustrate its feasibility on univariate and bivariate examples.

Full PDF

VVECTOR QUANTILE REGRESSION AND OPTIMAL TRANSPORT,FROM THEORY TO NUMERICS

GUILLAUME CARLIER (cid:91) , VICTOR CHERNOZHUKOV ♦ , GWENDOLINE DE BIE § ,AND ALFRED GALICHON † Abstract.

In this paper, we ﬁrst revisit the Koenker and Bassett variational approachto (univariate) quantile regression, emphasizing its link with latent factor representationsand correlation maximization problems. We then review the multivariate extension due toCarlier et al. (2016, 2017) which relates vector quantile regression to an optimal transportproblem with mean independence constraints. We introduce an entropic regularization ofthis problem, implement a gradient descent numerical method and illustrate its feasibilityon univariate and bivariate examples.

Keywords: vector quantile regression, optimal transport with mean independence con-straints, latent factors, entropic regularization

JEL Classiﬁcation:

C51, C60

Date : 7/31/2019 (First draft). This draft: 7/25/2020.This is a pre-print of an article published in

Empirical Economics (2020). The ﬁnal authenticated versionis available online at: https://doi.org/10.1007/s00181-020-01919-y . (cid:91) CEREMADE, UMR CNRS 7534, PSL, Universit´e Paris IX Dauphine, Pl. de Lattre de Tassigny, 75775Paris Cedex 16, FRANCE, and MOKAPLAN Inria Paris; [email protected]. ♦ Department of Economics, MIT, 50 Memorial Drive, E52-361B, Cambridge, MA 02142, USA; [email protected]. § DMA, ENS Paris; [email protected]. Funding from R´egion Ile-de-France grant is acknowledged. † Economics and Mathematics Departments, New York University, 70 Washington Square South, NewYork, NY 10013, USA; [email protected]. Funding from NSF grant DMS-1716489 is acknowledged. a r X i v : . [ ec on . GN ] F e b G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON Introduction

Quantile regression, introduced by Koenker and Bassett (1978), has become a verypopular tool for analyzing the response of the whole distribution of a dependent variable toa set of predictors. It is a far-reaching generalization of the median regression, allowing fora predition of any quantile of the distribution. We brieﬂy recall classical quantile regression.For t ∈ [0 , t -quantile of ε = Y − q t ( x ) given X = x minimizesthe loss function E [ tε + + (1 − t ) ε − | X ], or equivalently E [ ε + + ( t − ε | X ]. As a result, if q t ( x ) is speciﬁed under the parametric form q t ( x ) = β (cid:62) t x + α t , it is natural to estimate α t and β t by minimizing the lossmin α,β E (cid:20)(cid:16) Y − β (cid:62) X − α (cid:17) + + (1 − t ) (cid:16) β (cid:62) X + α (cid:17)(cid:21) . While the previous optimization problem estimates α t and β t for pointwise values of t ,if one would like to estimate the whole curve t (cid:55)→ ( α t , β t ), one simply should construct theloss function by integrating the previous loss functions over t ∈ [0 , t (cid:55)→ ( α t , β t ) minimizesmin ( α t ,β t ) t ∈ [0 , (cid:90) E (cid:20)(cid:16) Y − β (cid:62) t X − α t (cid:17) + + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17)(cid:21) dt. As it is known since the original work by Koenker and Bassett, this problem has an(inﬁnite-dimensional) linear programming formulation. Deﬁning P t = (cid:0) Y − β (cid:62) t X − α t (cid:1) + asthe positive deviations of Y with respect to their predicted quantiles β (cid:62) t X + α t , we have P t ≥ (cid:0) Y − β (cid:62) t X − α t (cid:1) − = P t − Y + β (cid:62) t X + α t ≥

0, so the problem reformulates as min P t ≥ ,β t ,α t (cid:90) E (cid:104) P t + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17)(cid:105) dt (1.1) s.t. P t − Y + β (cid:62) t X + α t ≥ V t ] Whenever we write a variable in brackets after a constraint, as [ V t ] in (1.1), we mean that this variableplays the role of a multiplier. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 3 which we will call “dual formulation” of the classical quantile regression problem . To thedual formulation corresponds a primal one (dual to the dual), which is formally obtainedby a minimax formulationmin P t ≥ ,β t ,α t max V t ≥ (cid:90) E (cid:104) P t + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17) + V t Y − V t P − V t β (cid:62) t X − V t α t (cid:105) dt thusmax V t ≥ (cid:90) E [ V t Y ] dt + min P t ≥ ,β t ,α t (cid:90) E (cid:104) (1 − V t ) P t + β (cid:62) t ((1 − t − V t ) X ) + V t ( Y − α t ) (cid:105) dt hence we arrive at the primal formulationmax V t ≥ (cid:90) E [ Y V t ] dt (1.2) s.t. V t ≤ P t ≥ E [ V t X ] = (1 − t ) E [ X ] [ β t ] E [ V t ] = (1 − t ) [ α t ]If V t and ( α t , β t ) are solutions to the above primal and dual programs, complementaryslackness yields { Y >β (cid:62) t X + α t } ≤ V t ≤ { Y ≥ β (cid:62) t X + α t } , hence if ( X, Y ) has a continuousdistribution, then for any ( α, β ), P (cid:0) Y − β (cid:62) X − α = 0 (cid:1) = 0, and therefore one has almostsurely V t = { Y ≥ β (cid:62) t X + α t } . Koenker and Ng (2005) impose a monotonicity constraint of the estimated quantile curves.Indeed, if β (cid:62) t x + α t is the t -quantile of the conditional distribution of Y given X = x , thecurve t (cid:55)→ β (cid:62) t x + α t should be nondecreasing. Hence, these authors impose a naturalconstraint on the dual, that is β (cid:62) t X + α t ≥ β (cid:62) t (cid:48) X + α t (cid:48) for t ≥ t (cid:48) , and they incorporate this It may seem awkward to start with the “dual” formulation before giving out the “primal” one, and the“primal” being the dual to the “dual,” this choice of labeling is pretty arbitrary. However, our choice ismotivated by consistency with optimal transport theory, introduced below.

G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON constraint into (1.1), yieldingmin P t ≥ ,N t ≥ ,β t ,α t (cid:90) E (cid:104) P t + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17)(cid:105) dts.t. P t − N t = Y − β (cid:62) t X − α t [ V t ] t ≥ t (cid:48) ⇒ β (cid:62) t X + α t ≥ β (cid:62) t (cid:48) X + α t (cid:48) . Note that if t (cid:55)→ β (cid:62) t x + α t is nondecreasing, then { y ≥ β (cid:62) t x + α t } should be nonincreasing.Therefore, in that case, V t should be nonincreasing in t , which allows us to impose a mono-tonicity constraint on the primal variable V t instead of a monotonicity constraint on thedual variables β t and α t . This is precisely the problem we look at. Considermax V t (cid:90) E [ Y V t ] dt (1.3) s.t. V t ≥ N t ≥ V t ≤ P t ≥ E [ V t X ] = (1 − t ) E [ X ] [ β t ] E [ V t ] = (1 − t ) [ α t ] t ≥ t (cid:48) ⇒ V t ≤ V t (cid:48) . Let us now take a look at a sample version of this problem. Here, we observe as sample( X i , Y i ) for i ∈ { , ..., N } . We shall discretize the probability space [0 ,

1] into T points, t = 0 < t < ... < t T ≤

1. Let x be the 1 × K row vector whose k -th entry is (cid:80) ≤ i ≤ n X ik /N . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 5

The sample analog of (1.3) ismax V τi ≥ (cid:88) ≤ i ≤ N ≤ τ ≤ T V τi Y i V τi ≤ N (cid:88) ≤ i ≤ N V τi X ik = (1 − t τ ) x k N (cid:88) ≤ i ≤ N V τi = (1 − t τ ) V τ ≥ V τ ≥ ... ≥ V τ ( N − ≥ V τ,N ≥ . Denoting t the T × t τ , and D a T × T matrix with ones on themain diagonal, and − V τ ≥ V τ ≥ ... ≥ V τ ( N − ≥ V τ,N ≥ V (cid:62) D ≥

0, and the programrewrites max V (cid:62) N V Y N V X = (1 T − t ) x N V N = (1 T − t ) V (cid:62) D T = 1 N V (cid:62) D ≥ . G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON

Setting π = D (cid:62) V /N , and U = D − N = (1 /T, /T, ..., µ = D (cid:62) (1 T − t ) = (1 /T, ..., /T ),and ν = 1 N /N , one can reformulate the problem asmax π ≥ (cid:88) ≤ τ ≤ T ≤ i ≤ N π τi U τ Y i (cid:88) ≤ τ ≤ T π τi = ν iN (cid:88) i =1 π τi = µ τ (cid:88) ≤ i ≤ N π τi X ik = µ τ x k which rewrites in the population asmax ( U,X,Y ) ∼ π E π [ U Y ] (1.4) s.t. U ∼ U ([0 , X, Y ) ∼ ν E [ X | U ] = E [ X ] . Note that this is a direct extension of the Monge-Kantorovich problem of optimal trans-port – in fact it boils down to it when the last constraint is absent. This should not besurprising, given the connection between optimal transport, as recalled below. In the presentpaper, we introduce the

Regularized Vector Quantile Regression (RVQR) problem, whichconsists of adding an entropic regularization term in the expression (1.4), which yields, fora given data distribution ν , max ( U,X,Y ) ∼ π E π [ U Y ] − ε E π [ln π ( U, X, Y )] (1.5) s.t. U ∼ U ([0 , X, Y ) ∼ ν E [ X | U ] = E [ X ] . Due to smoothness and regularity, the regularized problem (1.5) enjoys computational andanalytical properties that are missing from the original problem (1.4). In particular, the

ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 7 dual to (1.4) is a smooth, unconstrained problem that can be solved by computationalmethods. While here, unlike in the context of stadard optimal transport, the Kullback-Leibler divergence projection onto the mean-independence constraint is not in closed form,we can use Nesterov’s gradient descent acceleration, which gives optimal convergence ratesfor ﬁrst-order methods.The present paper in part provides a survey of previous results, and in part conveys newresults. In the vein of the previous papers on the topic (2016, 2017), this paper seeks toapply the optimal transport toolbox to quantile regression. In contrast with these papers,a particular focus in the present paper is to propose a regularized version of the problem aswell as new computational methods. The two main new contributions of the paper are (1) aconnection with shape-constrained classical regression (section 4), and (2) the introductionof the regularized vector quantile regression problem (RVQR) along with a duality theoremfor that problem (section 6).The paper is organized as follows. Section 2 will oﬀer reminders on the notion of quantile;section 3 will review the previous results of Carlier et al. (2016, 2017) on the “speciﬁed” case;section 4 oﬀers a new result (theorem 4.1) on the comparison with the shape-constrainedclassical quantile regression; and section 5 will review results on the multivariate case.Section 6 introduces RVQR and introduces results relevant for that problem, in particulara duality result in that case (theorem 6.1).2.

Several characterizations of quantiles

Throughout the paper, (Ω , F , P ) will be some ﬁxed nonatomic space probability. Givena random vector Z with values in R k deﬁned on this space we will denote by L ( Z ) thelaw of Z , given a probability measure θ on R k , we shall often write Z ∼ θ to express that L ( Z ) = θ . Independence of two random variables Z and Z will be denoted as Z ⊥⊥ Z . One way to deﬁne the nonatomicity of (Ω , F , P ) is by the existence of a uniformly distributed randomvariable on this space, this somehow ensures that the space is rich enough so that there exists randomvariables with prescribed law. If, on the contrary, the space is ﬁnite for instance only ﬁnitely supportedprobability measures can be realized as the law of such random variables. G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON

Quantiles.

Let Y be some univariate random variable deﬁned on (Ω , F , P ). Denotingby F Y the distribution function of Y : F Y ( α ) := P ( Y ≤ α ) , ∀ α ∈ R the quantile function of Y , Q Y = F − Y is the generalized inverse of F Y given by the formula: Q Y ( t ) := inf { α ∈ R : F Y ( α ) > t } for all t ∈ (0 , . (2.1)Let us now recall two well-known facts about quantiles: • α = Q Y ( t ) is a solution of the convex minimization problemmin α { E (( Y − α ) + ) + α (1 − t ) } (2.2) • there exists a uniformly distributed random variable U such that Y = Q Y ( U ). More-over, among uniformly distributed random variables, U is maximally correlated to Y in the sense that it solvesmax { E ( V Y ) , V ∼ µ } (2.3)where µ := U ([0 , , L ( Y ) has no atom, i.e. when F Y is continuous, U is unique andgiven by U = F Y ( Y ). Problem (2.3) is the easiest example of optimal transportproblem one can think of. The decomposition of a random variable Y as the com-posed of a monotone nondecreasing function and a uniformly distributed randomvariable is called a polar factorization of Y . The existence of such decompositionsgoes back to Ryﬀ (1970) and the extension to the multivariate case (by optimaltransport) is due to Brenier (1991).We therefore see that there are basically two diﬀerent approaches to study or estimatequantiles: • the local or ” t by t ” approach which consists, for a ﬁxed probability level t , in usingdirectly formula (2.1) or the minimization problem (2.2) (or some approximation ofit), this can be done very eﬃciently in practice but has the disadvantage of forgetting In fact for (2.3) to make sense one needs some integrabilty of Y i.e. E ( | Y | ) < + ∞ . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 9 the fundamental global property of the quantile function: it should be monotone in t , • the global approach (or polar factorization approach), where quantiles of Y aredeﬁned as all nondecreasing functions Q for which one can write Y = Q ( U ) with U uniformly distributed. In this approach, one rather tries to recover directly thewhole monotone function Q (or the uniform variable U that is maximally correlatedto Y ). Therefore this is a global approach for which one should rather use theoptimal transport problem (2.3).2.2. Conditional quantiles.

Let us assume now that, in addition to the random variable Y , we are also given a random vector X ∈ R N which we may think of as being a list ofexplanatory variables for Y . We are primarily interested in the dependence between Y and X and in particular the conditional quantiles of Y given X = x . Let us denote by ν thejoint law of ( X, Y ) by ν the law of X , and by ν ( . | x ) the conditional law of Y given X = x : ν := L ( X, Y ) , m := L ( X ) , ν ( . | x ) := L ( Y | X = x ) (2.4)which in particular yields d ν ( x, y ) = d ν ( y | x )d m ( x ) . We then denote by F ( x, y ) = F Y | X = x ( y ) the conditional cdf: F ( x, y ) := P ( Y ≤ y | X = x )and Q ( x, t ) the conditional quantile Q ( x, t ) := inf { α ∈ R : F ( x, α ) > t } , ∀ t ∈ (0 , . For the sake of simplicity, we shall assume that for m = L ( X )-almost every x ∈ R N ( m -a.e. x for short), one has t (cid:55)→ Q ( x, t ) is continuous and increasing (2.5)so that for m -a.e. x , F ( x, Q ( x, t )) = t for every t ∈ (0 ,

1) and Q ( x, F ( x, y )) = y for every y in the support of ν ( . | x ). Let us now deﬁne the random variable U := F ( X, Y ) , (2.6)then by construction: P ( U < t | X = x ) = P ( F ( x, Y ) < t | X = x ) = P ( Y < Q ( x, t ) | X = x )= F ( x, Q ( x, t )) = t. We deduce that U is uniformly distributed and independent from X (since its conditionalcdf does not depend on x ). Moreover since U = F ( X, Y ) = F ( X, Q ( X, U )) it follows from(2.5) that one has the representation Y = Q ( X, U )in which U can naturally be interpreted as a latent factor.This easy remark leads to a conditional polar factorization of Y through the pointwiserelation Y = Q ( X, U ) with Q ( X, . ) nondecreasing and U ∼ µ , U ⊥⊥ X . We would like toemphasize now that there is a variational principle behind this conditional decomposition.Let us indeed consider the variant of the optimal transport problem (2.3) where one furtherrequires U to be independent from the vector of regressors X :max { E ( V Y ) , L ( V ) = µ, V ⊥⊥ X } . (2.7)then we have Proposition 2.1. If E ( | Y | ) < + ∞ and (2.5) holds, the random variable U deﬁned in (2.6)solves (2.7).Proof. Let V be admissible for (2.7). Let us deﬁne for x ∈ spt( m ) and t ∈ [0 , ϕ ( x, t ) := (cid:90) t Q ( x, s )d s. We ﬁrst claim that ϕ ( X, U ) is integrable, indeed we obviously have | ϕ ( X, U ) | ≤ (cid:90) | Q ( X, s ) | d s ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 11 hence E ( | ϕ ( X, U ) | ) ≤ (cid:90) R N (cid:90) | Q ( x, s ) | d µ ( s ) d m ( x )= (cid:90) R N (cid:90) R | y | d ν ( y | x )d m ( x ) = E ( | Y | ) < + ∞ where we have used in the second line the fact that the image of µ by Q ( x, . ) is ν ( . | x ). Since ϕ ( x, . ) is convex and Y = ∂ ϕ∂u ( X, U ) the pointwise inequality ϕ ( X, V ) − ϕ ( X, U ) ≥ Y ( V − U )holds almost surely. But since L ( X, V ) = L ( X, U ) integrating the previous inequalityyields E ( ϕ ( X, V ) − ϕ ( X, U )) = 0 ≥ E ( Y ( V − U )) . Specified and quasi-specified quantile regression

Speciﬁed quantile regression.

Since the seminal work of Koenker and Bassett(1978), it has been widely accepted that a convenient way to estimate conditional quantilesis to stipulate an aﬃne form with respect to x for the conditional quantile. Since a quantilefunction should be monotone in its second argument, this leads to the following deﬁnition Deﬁnition 3.1.

Quantile regression is speciﬁed if there exist ( α, β ) ∈ C ([0 , , R ) × C ([0 , , R N ) such that for m -a.e. x t (cid:55)→ α ( t ) + β ( t ) (cid:62) x is increasing on [0 ,

1] (3.1) and Q ( x, t ) = α ( t ) + β ( t ) (cid:62) x, (3.2) for m -a.e. x and every t ∈ [0 , . If (3.1)-(3.2) hold, quantile regression is speciﬁed withregression coeﬃcients ( α, β ) . Speciﬁcation of quantile regression can be characterized by the validity of an aﬃne in X representation of Y with a latent factor: Proposition 3.2.

Let ( α, β ) be continuous and satisfy (3.1). Quantile regression is speciﬁedwith regression coeﬃcients ( α, β ) if and only if there exists U such that Y = α ( U ) + β ( U ) (cid:62) X almost surely , L ( U ) = µ, U ⊥⊥ X. (3.3) Proof.

The fact that speciﬁcation of quantile regression implies decomposition (3.3) hasalready been explained in paragraph 2.2. Let us assume (3.3), and compute F ( x, α ( t ) + β ( t ) (cid:62) x ) = P ( Y ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( α ( U ) + β ( U ) (cid:62) x ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( U ≤ t | X = x ) = P ( U ≤ t ) = t so that Q ( x, t ) = α ( t ) + β ( t ) (cid:62) x .3.2. Quasi-speciﬁed quantile regression.

Let us now assume that both X and Y areintegrable E ( (cid:107) X (cid:107) + | Y | ) < + ∞ (3.4)and normalize, without loss of generality, X in such a way that E ( X ) = 0 . (3.5)Koenker and Bassett showed that, for a ﬁxed probability level t , the regression coeﬃcients( α, β ) can be estimated by quantile regression i.e. the minimization probleminf ( α,β ) ∈ R N E ( ρ t ( Y − α − β (cid:62) X )) (3.6)where the penalty ρ t is given by ρ t ( z ) := tz − + (1 − t ) z + with z − and z + denoting thenegative and positive parts of z . For further use, note that (3.6) can be conveniently berewritten as inf ( α,β ) ∈ R N { E (( Y − α − β (cid:62) X ) + ) + (1 − t ) α } . (3.7)As noticed by Koenker and Bassett, this convex program admits as dual formulationsup { E ( V t Y )) : V t ∈ [0 , , E ( V t ) = (1 − t ) , E ( V t X ) = 0 } . (3.8) ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 13

An optimal ( α, β ) for (3.7) and an optimal V t in (3.8) are related by the complementaryslackness condition: Y > α + β (cid:62) X ⇒ V t = 1 , and Y < α + β (cid:62) X ⇒ V t = 0 . (3.9)Note that α appears naturally as a Lagrange multiplier associated to the constraint E ( V t ) =(1 − t ) and β as a Lagrange multiplier associated to E ( V t X ) = 0.To avoid mixing i.e. the possibility that V t takes values in (0 , ν = L ( X, Y ) gives zero mass to nonvertical hyperplanes i.e. P ( Y = α + β (cid:62) X ) = 0 , ∀ ( α, β ) ∈ R N . (3.10)We shall also consider a nondegeneracy condition on the (centered) random vector X whichsays that its law is not supported by any hyperplane : P ( β (cid:62) X = 0) < , ∀ β ∈ R N \ { } . (3.11)Thanks to (3.10), we may simply write V t = { Y >α + β (cid:62) X } (3.12)and thus the constraints E ( V t ) = (1 − t ), E ( XV t ) = 0 read E ( { Y >α + β (cid:62) X } ) = P ( Y > α + β (cid:62) X ) = (1 − t ) , E ( X { Y >α + β (cid:62) X } ) = 0 (3.13)which simply are the ﬁrst-order conditions for (3.7).Any pair ( α, β ) which solves the optimality conditions (3.13) for the Koenker and Bassettapproach will be denoted α = α QR ( t ) , β = β QR ( t )and the variable V t solving (3.8) given by (3.12) will similarly be denoted V QRt V QRt := { Y >α QR ( t )+ β QR ( t ) (cid:62) X } . (3.14) if E ( (cid:107) X (cid:107) ) < + ∞ then (3.11) amounts to the standard requirement that E ( XX (cid:62) ) is nonsingular. Uniqueness will be discussed later on.

Note that in the previous considerations the probability level t is ﬁxed, this is what wecalled the ” t by t ” approach. For this approach to be consistent with conditional quantileestimation, if we allow t to vary we should add an additional monotonicity requirement: Deﬁnition 3.3.

Quantile regression is quasi-speciﬁed if there exists for each t , a solu-tion ( α QR ( t ) , β QR ( t )) of (3.13) (equivalently the minimization problem (3.6)) such that t ∈ [0 , (cid:55)→ ( α QR ( t ) , β QR ( t )) is continuous and, for m -a.e. xt (cid:55)→ α QR ( t ) + β QR ( t ) (cid:62) x is increasing on [0 , . (3.15)A ﬁrst consequence of quasi-speciﬁcation is given by Proposition 3.4.

Assume (2.5)-(3.4)-(3.5) and (3.10). If quantile regression is quasi-speciﬁed and if we deﬁne U QR := (cid:82) V QRt dt (recall that V QRt is given by (3.14)) then: • U QR is uniformly distributed, • X is mean-independent from U QR i.e. E ( X | U QR ) = E ( X ) = 0 , • Y = α QR ( U QR ) + β QR ( U QR ) (cid:62) X almost surely.Moreover U QR solves the correlation maximization problem with a mean-independenceconstraint: max { E ( V Y ) , L ( V ) = µ, E ( X | V ) = 0 } . (3.16) Proof.

Obviously V QRt = 1 ⇒ U QR ≥ t, and U QR > t ⇒ V QRt = 1hence P ( U QR ≥ t ) ≥ P ( V QRt = 1) = P ( Y > α QR ( t ) + β QR ( t ) (cid:62) X ) = (1 − t ) and P ( U QR >t ) ≤ P ( V QRt = 1) = (1 − t ) which proves that U QR is uniformly distributed and { U QR > t } coincides with { V QRt = 1 } up to a set of null probability. We thus have E ( X U QR >t ) = E ( XV QRt ) = 0, by a standard approximation argument we deduce that E ( Xf ( U QR )) = 0for every f ∈ C ([0 , , R ) which means that X is mean-independent from U QR .As already observed U QR > t implies that Y > α QR ( t ) + β QR ( t ) (cid:62) X in particular Y ≥ α QR ( U QR − δ ) + β QR ( U QR − δ ) (cid:62) X for δ >

0, letting δ → + and using the continuity of If quantile regression is speciﬁed and the pair of functions ( α, β ) is as in deﬁnition 3.1, then for every t ,( α ( t ) , β ( t )) solves the conditions (3.13). This shows that speciﬁcation implies quasi-speciﬁcation. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 15 ( α QR , β QR ) we get Y ≥ α QR ( U QR ) + β QR ( U QR ) (cid:62) X . The converse inequality is obtainedsimilarly by remarking that U QR < t implies that Y ≤ α QR ( t ) + β QR ( t ) (cid:62) X .Let us now prove that U QR solves (3.16). Take V uniformly distributed, such that X ismean-independent from V and set V t := { V >t } , we then have E ( XV t ) = 0, E ( V t ) = (1 − t )but since V QRt solves (3.8) we have E ( V t Y ) ≤ E ( V QRt Y ). Observing that V = (cid:82) V t dt andintegrating the previous inequality with respect to t gives E ( V Y ) ≤ E ( U QR Y ) so that U QR solves (3.16).Let us continue with a uniqueness argument for the mean-independent decompositiongiven in proposition 3.4: Proposition 3.5.

Assume (2.5)-(3.4)-(3.5)-(3.10) and (3.11). Let us assume that Y = α ( U ) + β ( U ) (cid:62) X = α ( U ) + β ( U ) (cid:62) X with: • both U and U uniformly distributed, • X is mean-independent from U and U : E ( X | U ) = E ( X | U ) = 0 , • α, β, α, β are continuous on [0 , , • ( α, β ) and ( α, β ) satisfy the monotonicity condition (3.1),then α = α, β = β, U = U .

Proof.

Let us deﬁne for every t ∈ [0 , ϕ ( t ) := (cid:90) t α ( s ) ds, b ( t ) := (cid:90) t β ( s ) ds. Let us also deﬁne for ( x, y ) in R N +1 : ψ ( x, y ) := max t ∈ [0 , { ty − ϕ ( t ) − b ( t ) (cid:62) x } thanks to the monotonicity condition (3.1), the maximization program above is strictlyconcave in t for every y and m -a.e. x . We then remark that Y = α ( U ) + β ( U ) (cid:62) X = ϕ (cid:48) ( U ) + b (cid:48) ( U ) (cid:62) X exactly is the ﬁrst-order condition for the above maximization problemwhen ( x, y ) = ( X, Y ). In other words, we have ψ ( x, y ) + b ( t ) (cid:62) x + ϕ ( t ) ≥ ty, ∀ ( t, x, y ) ∈ [0 , × R N × R (3.17)with an equality for ( x, y, t ) = ( X, Y, U ) i.e. ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U ) = U Y, almost surely. (3.18)Using the fact that L ( U ) = L ( U ) and the fact that mean-independence gives E ( b ( U ) (cid:62) X ) = E ( b ( U ) (cid:62) X ) = 0, we have E ( U Y ) = E ( ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U )) = E ( ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U )) ≥ E ( U Y )but reversing the role of U and U , we also have E ( U Y ) ≤ E ( U Y ) and then E ( U Y ) = E ( ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U ))so that, thanks to inequality (3.17) ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U ) = U Y, almost surelywhich means that U solves max t ∈ [0 , { tY − ϕ ( t ) − b ( t ) (cid:62) X } which, by strict concavity admits U as unique solution. This proves that U = U and thus α ( U ) − α ( U ) = ( β ( U ) − β ( U )) (cid:62) X taking the conditional expectation with respect to U on both sides we then obtain α = α and thus β ( U ) (cid:62) X = β ( U ) (cid:62) X almost surely. We then compute F ( x, α ( t ) + β ( t ) (cid:62) x ) = P ( α ( U ) + β ( U ) (cid:62) X ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( α ( U ) + β ( U ) (cid:62) x ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( U ≤ t | X = x )and similarly F ( x, α ( t ) + β ( t ) (cid:62) x ) = P ( U ≤ t | X = x ) = F ( x, α ( t ) + β ( t ) (cid:62) x ). Thanks to(2.5), we deduce that β ( t ) (cid:62) x = β ( t ) (cid:62) x for m -a.e. x and every t ∈ [0 , β = β . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 17

Corollary 3.6.

Assume (2.5)-(3.4)-(3.5)-(3.10) and (3.11). If quantile regression is quasi-speciﬁed, the regression coeﬃcients ( α QR , β QR ) are uniquely deﬁned and if Y can be writtenas Y = α ( U ) + β ( U ) (cid:62) X for U uniformly distributed, X being mean independent from U , ( α, β ) continuous such thatthe monotonicity condition (3.1) holds then necessarily α = α QR , β = β QR . To sum up, we have shown that quasi-speciﬁcation is equivalent to the validity of thefactor linear model: Y = α ( U ) + β ( U ) (cid:62) X for ( α, β ) continuous and satisfying the monotonicity condition (3.1) and U , uniformlydistributed and such that X is mean-independent from U . This has to be compared withthe decomposition of paragraph 2.2 where U is required to be independent from X but thedependence of Y with respect to U , given X , is given by a nondecreasing function of U which is not necessarily aﬃne in X .4. Quantile regression without specification

Now we wish to address quantile regression in the case where neither speciﬁcation norquasi-speciﬁcation can be taken for granted. In such a general situation, keeping in mindthe remarks from the previous paragraphs, we can think of two natural approaches.The ﬁrst one consists in studying directly the correlation maximization with a mean-independence constraint (3.16). The second one consists in getting back to the Koenkerand Bassett t by t problem (3.8) but adding as an additional global consistency constraintthat V t should be nonincreasing (which we abbreviate as V t ↓ ) with respect to t :sup { E ( (cid:90) V t Y dt ) : V t ↓ , V t ∈ [0 , , E ( V t ) = (1 − t ) , E ( V t X ) = 0 } (4.1)Our aim is to compare these two approaches (and in particular to show that the maxi-mization problems (3.16) and (4.1) have the same value) as well as their dual formulations. Before going further, let us remark that (3.16) can directly be considered in the multivariatecase whereas the monotonicity constrained problem (4.1) makes sense only in the univariatecase.As proven in Carlier et al. (2016), (3.16) is dual toinf ( ψ,ϕ,b ) { E ( ψ ( X, Y )) + E ( ϕ ( U )) : ψ ( x, y ) + ϕ ( u ) ≥ uy − b ( u ) (cid:62) x } (4.2)which can be reformulated as:inf ( ϕ,b ) (cid:90) max t ∈ [0 , ( ty − ϕ ( t ) − b ( t ) (cid:62) x ) ν ( dx, dy ) + (cid:90) ϕ ( t ) dt (4.3)in the sense that sup(3 .

16) = inf(4 .

2) = inf(4 . . (4.4)The existence of a solution to (4.2) is not straightforward and is established under ap-propriate assumptions in Carlier et al. (2017) in the multivariate case. The following resultshows that there is a t -dependent reformulation of (3.16): Lemma 4.1.

The value of (3.16) coincides with sup { E ( (cid:90) t V t Y dt ) : V t ↓ , V t ∈ { , } , E ( V t ) = (1 − t ) , E ( V t X ) = 0 } . (4.5) Proof.

Let U be admissible for (3.16) and deﬁne V t := { U>t } then U = (cid:82) V t dt and ob-viously ( V t ) t is admissible for (4.5), we thus have sup(3 . ≤ sup(4 . V t ) t admissible for (4.5) and let V := (cid:82) V t dt , we then have V > t ⇒ V t = 1 ⇒ V ≥ t since E ( V t ) = (1 − t ) this implies that V is uniformly distributed and V t = { V >t } almostsurely so that E ( X { V >t } ) = 0 which implies that X is mean-independent from V and thus E ( (cid:82) V t Y dt ) ≤ sup(3 . .

16) = sup(4 . C := { v : [0 , (cid:55)→ [0 , , ↓} With a little abuse of notations when a reference number (A) refers to a maximization (minimization)problem, we will simply write sup( A ) (inf( A )) to the denote the value of this optimization problem. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 19

Let ( V t ) t be admissible for (4.1) and set v t ( x, y ) := E ( V t | X = x, Y = y ) , V t := v t ( X, Y )it is obvious that ( V t ) t is admissible for (4.1) and by construction E ( V t Y ) = E ( V t Y ). More-over the deterministic function ( t, x, y ) (cid:55)→ v t ( x, y ) satisﬁes the following conditions:for ﬁxed ( x, y ), t (cid:55)→ v t ( x, y ) belongs to C , (4.6)and for a.e. t ∈ [0 , (cid:90) v t ( x, y ) ν ( dx, dy ) = (1 − t ) , (cid:90) v t ( x, y ) xν ( dx, dy ) = 0 . (4.7)Conversely, if ( t, x, y ) (cid:55)→ v t ( x, y ) satisﬁes (4.6)-(4.7), V t := v t ( X, Y ) is admissible for (4.1)and E ( V t Y ) = (cid:82) v t ( x, y ) yν ( dx, dy ). All this proves that sup(4 .

1) coincides withsup ( t,x,y ) (cid:55)→ v t ( x,y ) (cid:90) v t ( x, y ) yν ( dx, dy ) dt subject to: (4 . − (4 .

7) (4.8)

Theorem 4.2.

The shape constrained quantile regression problem (4.1) is related to thecorrelation maximization with a mean independence constraint (3.16) by: sup(3 .

16) = sup(4 . . Proof.

We know from lemma 4.1 and the remarks above thatsup(3 .

16) = sup(4 . ≤ sup(4 .

1) = sup(4 . . We now get rid of constraints (4.7) by rewriting (4.8) in sup-inf form assup v t satisﬁes (4.6) inf ( α,β ) (cid:90) v t ( x, y )( y − α ( t ) − β ( t ) (cid:62) x ) ν ( dx, dy ) dt + (cid:90) (1 − t ) α ( t ) dt Recall that one always have sup inf ≤ inf sup so that sup(4 .

8) is less thaninf ( α,β ) sup v t satisf. (4.6) (cid:90) v t ( x, y )( y − α ( t ) − β ( t ) (cid:62) x ) ν ( dx, dy ) dt + (cid:90) (1 − t ) α ( t ) dt ≤ inf ( α,β ) (cid:90) (cid:16) sup v ∈C (cid:90) v ( t )( y − α ( t ) − β ( t ) (cid:62) x ) dt (cid:17) ν ( dx, dy ) + (cid:90) (1 − t ) α ( t ) dt It follows from Lemma 4.3 below that, for q ∈ L (0 ,

1) deﬁning Q ( t ) := (cid:82) t q ( s ) ds , one hassup v ∈C (cid:90) v ( t ) q ( t ) dt = max t ∈ [0 , Q ( t ) . So setting ϕ ( t ) := (cid:82) t α ( s ) ds , b ( t ) := (cid:82) t β ( s ) ds and remarking that integrating by partsimmediately gives (cid:90) (1 − t ) α ( t ) dt = (cid:90) ϕ ( t ) dt, we have sup v ∈C (cid:90) v ( t )( y − α ( t ) − β ( t ) (cid:62) x ) dt + (cid:90) (1 − t ) α ( t ) dt = max t ∈ [0 , { ty − ϕ ( t ) − b ( t ) (cid:62) x } + (cid:90) ϕ ( t ) dt. This yieldssup(4 . ≤ inf ( ϕ,b ) (cid:90) max t ∈ [0 , ( ty − ϕ ( t ) − b ( t ) (cid:62) x ) ν ( dx, dy ) + (cid:90) ϕ ( t ) dt = inf(4 . .

3) = sup(3 .

16) which ends the proof.In the previous proof, we have used the elementary result (proven in the appendix)

Lemma 4.3.

Let q ∈ L (0 , and deﬁne Q ( t ) := (cid:82) t q ( s ) ds for every t ∈ [0 , , one has sup v ∈C (cid:90) v ( t ) q ( t ) dt = max t ∈ [0 , Q ( t ) . Vector quantiles, vector quantile regression and optimal transport

We now consider the case where Y is a random vector with values in R d with d ≥ Y is integrable say) in dimension d and are related tooptimal transport theory. The aim of this section is to brieﬂy summarize the optimaltransport approach to quantile regression introduced in Carlier et al. (2016) and furtheranalyzed in their follow-up 2017 paper.5.1. Brenier’s map as a vector quantile.

From now on we ﬁx as a reference measurethe uniform measure on the unit cube [0 , d i.e. µ d := U ([0 , d ) (5.1) ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 21

Given Y , an integrable R d -valued random variable on (Ω , F , P ), a remarkable theorem dueto Brenier (1991) and extended by McCann (1995) implies that there exists a unique U ∼ µ d and a unique (up to the addition of a constant) convex function deﬁned on [0 , d such that Y = ∇ ϕ ( U ) . (5.2)The map ∇ ϕ is called the Brenier’s map between µ d and L ( Y ).The convex function ϕ is not necessarily diﬀerentiable but being convex it is diﬀerentiableat Lebesgue-a.e. point of [0 , d so that ∇ ϕ ( U ) is well deﬁned almost surely, it is worth atthis point recalling that the Legendre transform of ϕ is the convex function: ϕ ∗ ( y ) := sup u ∈ [0 , d { u (cid:62) y − ϕ ( u ) } (5.3)and that the subdiﬀerentials of ϕ and ϕ ∗ are deﬁned respectively by ∂ϕ ( u ) := { y ∈ R d : ϕ ( u ) + ϕ ∗ ( y ) = u (cid:62) y } and ∂ϕ ∗ ( y ) := { u ∈ [0 , d : ϕ ( u ) + ϕ ∗ ( y ) = u (cid:62) y } so that ∂ϕ and ∂ϕ ∗ are inverse to each other in the sense that y ∈ ∂ϕ ( u ) ⇔ u ∈ ∂ϕ ∗ ( y )which is often refered to in convex analysis as the Fenchel reciprocity formula . Note thenthat (5.2) implies that U ∈ ∂ϕ ∗ ( Y ) almost surely . If both ϕ and ϕ ∗ are diﬀerentiable, their subgradients reduce to the singleton formed bytheir gradient and the Fenchel recirprocity formula simply gives ∇ ϕ − = ∇ ϕ ∗ . Recalling thesubgradient of the convex function ϕ is monotone in the sense that whenever y ∈ ∂ϕ ( u )and y ∈ ∂ϕ ( u ) one has ( y − y ) (cid:62) ( u − u ) ≥ , Note the analogy with the fact that in the univariate case the cdf and the quantile of Y are generalizedinverse to each other. we see that gradients of convex functions are a genelarization to the multivariate case ofmonotone univariate maps. It is therefore natural in view of (5.2) to deﬁne the vectorquantile of Y as: Deﬁnition 5.1.

The vector quantile of Y is the Brenier’s map between µ d and L ( Y ) . Now, it is worth noting that the Brenier’s map (and the uniformly distributed randomvector U in (5.2)) are not abstract objects, they have a variational characterization relatedto optimal transport . Consider indeedsup { E ( V (cid:62) Y ) : V ∼ µ d } (5.4)and its dualinf f,g { (cid:90) [0 , d f d µ d + E ( g ( Y )) : f ( u ) + g ( y ) ≥ u (cid:62) y, ∀ ( u, y ) ∈ [0 , d × R d } (5.5)then U in (5.2) is the unique solution of (5.4) and any solution ( f, g ) of the dual (5.5)satisﬁes ∇ f = ∇ ϕ µ d -a.e..5.2. Conditional vector quantiles.

Assume now as in paragraph 2.2 that we are alsogiven a random vector X ∈ R N . As in (2.4), we denote by ν the law of ( X, Y ), by m thelaw of X and by ν ( . | x ) the conditional law of Y given X = x (the only diﬀerence with (2.4)is that Y is R d -valued). Conditional vector quantile are then deﬁned as Deﬁnition 5.2.

For m = L ( X ) -a.e. x ∈ R N , the vector conditional quantile of Y given X = x is the Brenier’s map between µ d := U ([0 , d ) and ν ( . | x ) := L ( Y | X = x ) . Wedenote this well deﬁned map as ∇ ϕ x where ϕ x is a convex function on [0 , d . If both ϕ x and its Legendre transform ϕ ∗ x ( y ) := sup u ∈ [0 , d { u (cid:62) y − ϕ x ( u ) } In the case where E ( (cid:107) Y (cid:107) ) < + ∞ , (5.4) is equivalent to minimize E ( (cid:107) V − Y (cid:107) ) among uniformlydistributed V ’s. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 23 are diﬀerentiable , one can deﬁne the random vector: U := ∇ ϕ ∗ X ( Y )which is equivalent to Y = ∇ ϕ X ( U ) . (5.6)One can check exactly as in the proof of Proposition 2.1 for the univariate case that if Y isintegrable then U ∼ µ d , U ⊥⊥ X and U solves max { E ( V (cid:62) Y ) , V ∼ µ d , V ⊥⊥ X } . (5.7)5.3. Vector quantile regression.

When one assumes that the convex function ϕ x is aﬃnewith respect to the explanatory variables x (speciﬁcation): ϕ x ( u ) = ϕ ( u ) + b ( u ) (cid:62) x with ϕ : [0 , d → R and b : [0 , d → R N smooth, the conditional quantile is itself aﬃneand the relation (5.6) takes the form Y = ∇ ϕ X ( U ) = α ( U ) + β ( U ) X, for α = ∇ ϕ, β := Db (cid:62) . (5.8)This aﬃne form moreover implies that not only U maximizes the correlation with Y amonguniformly distributed random vectors independent from X but in the larger class of uni-formly distributed random vectors for which E ( X | U ) = E ( X ) = 0 . This is the reason why the study ofmax { E ( V (cid:62) Y ) , V ∼ µ d , E ( X | V ) = 0 } (5.9) A deep regularity theory initated by Caﬀarelli (1992) in the 1990’s gives conditions on ν ( . | x ) such thatthis is in fact the case that the optimal transport map is smooth and/or invertible, we refer the interestedreader to the textbook of Figalli (2017) for a detailed and recent account of this regularity theory. here we assume that both X and Y are integrable is the main tool in the approach of Carlier et al. (2016, 2017) to vector quantile regression.Let us now brieﬂy summarize the main ﬁndings in these two papers. First observe that(5.9) can be recast as a linear program by setting π := L ( U, X, Y ) and observing that U solves (5.9) if and only if π solvesmax π ∈ MI( µ d ,ν ) (cid:90) [0 , d × R N × R d u (cid:62) y d π ( u, x, y ) (5.10)where MI( ν, µ ) is the set of probability measures which satisfy the linear constraints: • the ﬁrst marginal of π is µ d , i.e., for every ϕ ∈ C ([0 , d , R ): (cid:90) [0 , d × R N × R d ϕ ( u )d π ( u, x, y ) = (cid:90) [0 , d ϕ ( u )d µ d ( u ) , • the second marginal of π is ν , i.e., for every ψ ∈ C b ( R N × R d , R ): (cid:90) [0 , d × R N × R d ψ ( x, y )d π ( u, x, y ) = (cid:90) R N × R d ψ ( x, y )d ν ( x, y )= E ( ψ ( X, Y )) , • the conditional expectation of x given u is 0, i.e., for every b ∈ C ([0 , d , R N ): (cid:90) [0 , d × R N × R d b ( u ) (cid:62) x d π ( u, x, y ) = 0 . The dual of the linear program (5.9) then readsinf ( ϕ,ψ,b ) (cid:90) [0 , d ϕ d µ d + (cid:90) R N × R d ψ ( x, y )d ν ( x, y ) (5.11)subject to the pointwise constraint ϕ ( u ) + b ( u ) (cid:62) x + ψ ( x, y ) ≥ u (cid:62) y given b and ϕ the lowest ψ ﬁtting this constraint being the (convex in y ) function ψ ( x, y ) := sup u ∈ [0 , d { u (cid:62) y − ϕ ( u ) − b ( u ) (cid:62) x } . The existence of a solution ( ψ, ϕ, b ) to (5.11) is established in Carlier et al. (2016) (undersome assumptions on ν ) and optimality for U in (5.9) is characterized by the pointwisecomplementary slackness condition ϕ ( U ) + b ( U ) (cid:62) X + ψ ( X, Y ) = U (cid:62) Y almost surely . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 25 If ϕ and b were smooth we could deduce from the latter that Y = ∇ ϕ ( U ) + Db ( U ) (cid:62) U = ∇ ϕ X ( U ) , for ϕ x ( u ) := ϕ ( u ) + b ( u ) (cid:62) x which is exactly (5.8). So speciﬁcation of vector quantile regression is essentially the sameas assuming this smoothness and the convexity of u (cid:55)→ ϕ x ( u ) := ϕ ( u ) + b ( u ) (cid:62) x . In general,these properties cannot be taken for granted and what can be deduced from complementaryslackness is given by the weaker relations ϕ X ( U ) = ϕ ∗∗ X ( U ) , Y ∈ ∂ϕ ∗∗ X ( U ) almost surely,were ϕ ∗∗ x is the convex envelope of ϕ x (i.e. the largest convex function below ϕ x ), we referthe reader to Carlier et al. (2017) for details.6. Discretization, regularization, numerical minimization

Discrete optimal transport with a mean independence constraint.

We nowturn to a discrete setting for implementation purposes, and consider data ( X j , Y j ) j =1 ..J distributed according to the empirical measure ν = (cid:80) Jj =1 ν j δ ( x j ,y j ) , and a [0 , d -uniformsample ( U i ) i =1 ,...,I with empirical measure µ = (cid:80) Ii =1 µ i δ u i . In this setting, the vectorquantile regression primal (5.10) writesmax π ∈ R I × J + I (cid:88) i =1 J (cid:88) j =1 u (cid:62) i y j π ij subject to marginal constraints ∀ j, (cid:80) i π ij = ν j and ∀ i, (cid:80) j π ij = µ i and the mean-independenceconstraint between X and U : ∀ i, (cid:80) j x j π ij = 0. Its dual formulation (5.11) readsinf ( φ i ) i , ( ψ j ) j , ( b i ) i J (cid:88) j =1 ψ j ν j + I (cid:88) i =1 φ i µ i subject to the constraint ∀ i, j, φ i + b (cid:62) i x j + ψ j ≥ u (cid:62) i y j . The Regularized Vector Quantile Regression (RVQR) problem.

Using theoptimality condition φ i = max j u (cid:62) i y j − b (cid:62) i x j − ψ j , we obtain the unconstrained formulationinf ( ψ j ) j , ( b i ) i (cid:88) j ψ j ν j + (cid:88) i µ i (cid:18) max j u (cid:62) i y j − b (cid:62) i x j − ψ j (cid:19) . Replacing the maximum with its smoothed version , given a small regularization parameter ε , yields the smooth convex minimization problem (see Cuturi and Peyr ˜A © (2016) for moredetails in connection with entropic regularization of optimal transport), which we call the Regularized Vector Quantile Regression (RVQR) probleminf ( ψ j ) j , ( b i ) i J ( ψ, b ) := (cid:88) j ψ j ν j + ε (cid:88) i µ i log (cid:88) j exp (cid:18) ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j ] (cid:19) . (6.1)We then have the following duality result : Theorem 6.1.

The RVQR problem max π ij ≥ (cid:88) ij π ij (cid:16) u (cid:62) i y j (cid:17) − ε (cid:88) ij π ij log π ij (cid:88) j π ij = µ i (cid:88) i π ij = ν j (cid:88) j π ij x j = (cid:88) j ν j x j has dual (6.1), or equivalently min ϕ i ,v j (cid:88) i µ i ϕ i + (cid:88) j ψ j ν j + ε (cid:88) ij exp (cid:18) ε [ u (cid:62) i y j − ϕ i − b (cid:62) i x j − ψ j ] (cid:19) . Note that the objective J in (6.1) remains invariant under the two transformations • ( b, ψ ) ← ( b + c, ψ − c (cid:62) x ) with c ∈ R N is a constant translation vector, Recall that the softmax with regularization parameter ε > α , . . . , α J ) is given bySoftmax ε ( α , . . . α J ) := ε log( (cid:80) Jj =1 e αjε ). Which can be proved either by using the Fenchel-Rockafellar duality theorem or by hand. Indeed, in theprimal, there are only ﬁnitely many linear constraints and nonnegativity constraints are not binding becauseof the entropy. The existence of Lagrange multipliers for the equality constraints is then straightforward.

ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 27 • ψ ← ψ + λ where λ ∈ R is a constant.These two invariances enable us to ﬁx the value of b = 0 and (for instance) to chose λ in such a way that (cid:80) i,j exp (cid:0) ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j ] (cid:1) ) = 1. Remark.

This formulation is eligible for stochastic optimization techniques when the numberof (

X, Y ) observations is very large. Stochastic optimization w.r.t. ψ can be performedusing the stochastic averaged gradient algorithm, see Genevay et al. (2016), considering theequivalent objective inf ψ,φ,b (cid:88) j h ε ( x j , y j , ψ, φ, b ) ν j with h ε ( x j , y j , ψ, φ, b ) = ψ j + (cid:80) i µ i φ i + ε (cid:80) i exp (cid:0) ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j − φ i ] (cid:1) . Such tech-niques are not needed to compute b since the number of U samples (i.e. the size of b ) is setby the user.6.3. Gradient descent.

As already noted the objective J in (6.1) is convex and smooth.Its gradient has the explicit form ∂J∂ψ j := ν j − I (cid:88) i =1 µ i e θ ij (cid:80) Jk =1 e θ ik where θ ij = 1 ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j ] (6.2)and ∂J∂b i := − µ i (cid:80) Jk =1 x k e θ ik (cid:80) Jk =1 e θ ik . (6.3)To solve (6.1) numerically, we therefore can use a gradient descent mehod. An eﬃcient wayto do it is to use Nesterov accelerated gradient algorithm see Nesterov (1983) and Beckand Teboulle (2009). Note that if ψ, b solves (6.1), the fact that the partial derivatives in(6.2)-(6.3) vanish imply that the coupling α εij := µ i e θ ij (cid:80) Jk =1 e θ ik satisﬁes the constraint of ﬁxed marginals and mean-independence of the primal problem.Since the index j corresponds to observations it is convenient to introduce for every x ∈ it is even strictly convex once we have chosen normalizations which take into account the two invariancesof J explained above. X := { x , . . . , x J } and y ∈ Y := { y , . . . y j } the probability π ε ( x, y, u i ) := (cid:88) j : x j = x, y j = y α εij . Results

Quantiles computation.

The discrete probability π ε is an approximation (because ofthe regularization ε ) of L ( U, X, Y ) where U solves (5.9). The corresponding approximatequantile Q εX ( U ) is given by E π ε [ Y | X, U ]. In the above discrete setting, this yields Q εx ( u i ) := E π ε [ Y | X = x, U = u i ] = (cid:88) y ∈Y y π ε ( x, y, u i ) (cid:80) y (cid:48) ∈Y π ε ( x, y (cid:48) , u i ) . Remark.

To estimate the conditional distribution of Y given U = u and X = x , we canuse kernel methods. In the experiments, we compute approximate quantiles as means onneighborhoods of X values to make up for the lack of replicates. This amounts to considering E π ε [ Y | X ∈ B η ( x ) , U = u i ] where B η ( x ) is a Euclidean ball of radius η centered on x . Empirical illustrations.

We demonstrate the use of this approach on a series of healthrelated experiments. We use the “ANSUR II” dataset (Anthropometric Survey of US ArmyPersonnel), which can be found online . This dataset is one of the most comprehensivepublicly available data sets on body size and shape, containing 93 measurements for over4,082 male adult US military personnel. It allows us to easily build multivariate dependentvariables. One-dimensional VQR.

We start by one-dimensional dependent variables ( d = 1),namely Weight ( Y ) and Thigh circumference ( Y ), explained by X =(1, Height), to allowfor comparison with classical quantile regression of Koenker and Bassett (1978). Figure1 displays results of our method compared to the classical approach, for diﬀerent heightquantiles (10%, 30%, 60%, 90%). Figure 1 is computed with a “soft” potential φ whiletable 1 depicts the diﬀerence with its “hard” counterpart (see the beginning of section 6.2).Figure 2 and Table 2 detail the impact of regularization strength on these quantiles. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 29

First dimension Second dimension

Figure 1.

Comparison between one-dimensional VQR (regularized dualin dashed red, with a “soft” φ ) and classical approach (green) with (i) Y =Weight (Left) or (ii) Y =Thigh circumference and X =(1, Height).Quantiles are plotted for diﬀerent height quantiles (10%, 30%, 60%, 90%).Regularization strengths are ε = 0 .

1. Chosen grid size is n = 20. ε || Q soft − Q hard || / || Q soft || , X = 10% 3.8 · − · − · − · − || Q soft − Q hard || / || Q soft || , X = 30% 6.8 · − · − · − · − || Q soft − Q hard || / || Q soft || , X = 60% 1.2 · − · − · − · − || Q soft − Q hard || / || Q soft || , X = 90% 1.6 · − · − · − · − Table 1.

Relative error between one-dimensional VQR with a “soft” com-putation of φ and its “hard” counterpart, with Y =Weight and X =(1,Height) for diﬀerent height quantiles (10%, 30%, 60%, 90%), depending onregularization strengths ε . Chosen grid size is n = 20. Multi-dimensional VQR.

In contrast, multivariate quantile regression explains thejoint dependence Y = ( Y , Y ) by X =(1,Height). Figures 4 and 5 (each correspondingto an explained component, either Y or Y ) depicts how smoothing operates in higherdimension for diﬀerent Height quantiles (10%, 50% and 90%), compared to a previous ε = 0 . ε = 0 . ε = 0 . ε = 1 Figure 2.

Regularized one-dimensional VQR, dual (dashed red) comparedto classical QR (green) with Y =Weight regressed on X =(1, Height), forvarying regularization strengths ε . Quantiles are plotted for diﬀerent heightquantiles (10%, 30%, 60%, 90%). Chosen grid size is n = 20. ε || Q QR − Q V QR || / || Q QR || , X = 10% 9.8 · − · − · − · − || Q QR − Q V QR || / || Q QR || , X = 30% 8.5 · − · − · − · − || Q QR − Q V QR || / || Q QR || , X = 60% 7.7 · − · − · − · − || Q QR − Q V QR || / || Q QR || , X = 90% 8.2 · − · − · − · − Table 2.

Relative error between one-dimensional VQR and classical QRapproach with Y =Weight and X =(1, Height) for diﬀerent height quantiles(10%, 30%, 60%, 90%), depending on regularization strengths ε . Chosengrid size is n = 20.unregularized approach Carlier et al. (2016). Figure 3 details computational times in 2Dusing an Intel(R) Core(TM) i7-7500U CPU 2.70GHz. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 31

Figure 3.

Comparison of computational times between the unregularizedcase (using Gurobi’s barrier logging) and the regularized case, for a varyingnumber of predictors in 2D. In the latter, this time represents the time toreach an error of 10 − in (cid:107)·(cid:107) between two iterates of the transport plan for ε = 0 .

1. Chosen grid size is n = 10 (per axis). S m a ll h e i g h t M e d i u m h e i g h t T a ll h e i g h t Unregularized Regularized dual

Figure 4.

Two-dimensional regularized quantile regression of Y =(Weight,Thigh) explained by X =(1, Height). Quantiles of Y =Weight are plottedfor diﬀerent height quantiles: 10% (Bottom), 50% (Middle) and 90% (Top).Chosen grid size is n = 10 (per axis) and regularization strength ε = 0 . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 33 S m a ll h e i g h t M e d i u m h e i g h t T a ll h e i g h t Unregularized Regularized dual

Figure 5.

Two-dimensional regularized quantile regression of Y =(Weight,Thigh) explained by X =(1, Height). Quantiles of Y =Thigh are plottedfor diﬀerent height quantiles: 10% (Bottom), 50% (Middle) and 90% (Top).Chosen grid size is n = 10 (per axis) and regularization strength ε = 0 . Appendix

Proof of Lemma 4.3.

Since [0 ,t ] ∈ C , one obviously ﬁrst hassup v ∈C (cid:90) v ( s ) q ( s ) ds ≥ max t ∈ [0 , (cid:90) t q ( s ) ds = max t ∈ [0 , Q ( t ) . Let us now prove the converse inequality, taking an arbitrary v ∈ C . We ﬁrst observe that Q is absolutely continuous and that v is of bounded variation (its derivative in the sense ofdistributions being a bounded nonpositive measure which we denote by η ), integrating byparts and using the deﬁnition of C then give: (cid:90) v ( s ) q ( s ) ds = − (cid:90) Qη + v (1 − ) Q (1) ≤ (max [0 , Q ) × ( − η ([0 , v (1 − ) Q (1)= (max [0 , Q )( v (0 + ) − v (1 − )) + v (1 − ) Q (1)= (max [0 , Q ) v (0 + ) + ( Q (1) − max [0 , Q ) v (1 − ) ≤ max [0 , Q. References [1] Beck, A., and Teboulle, M. (2009). “A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.”

SIAM J. Imaging Sci. , 2(1), pp. 183–202.[2] Brenier, Y. (1991). “Polar factorization and monotone rearrangement of vector-valued functions.”

Comm. Pure Appl. Math.

J. Amer. Math. Soc. , 5(1),pp. 99–104.[4] Carlier, G., Chernozhukov, V. and Galichon, A. (2016). “Vector quantile regression: an optimal trans-port approach.”

Ann. Statist.

J. Multivariate Anal. , pp. 161, pp. 96–102.[6] Cuturi, M. and Peyr´e, G. (2016). “A smoothed dual approach for variational Wasserstein problems.”

SIAM J. Imaging Sci.

The Monge-Amp ere equation and its applications . Zurich Lectures in AdvancedMathematics. European Mathematical Society (EMS), Zurich.

ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 35 [8] Genevay, A., Cuturi, M., Peyr´e, G., and Bach, F. (2016). “Stochastic optimization for large-scale optimaltransport.” In

Advances in neural information processing systems , pp. 3440–3448.[9] Koenker, R. and Bassett, Jr., G. (1978). “Regression quantiles.”

Econometrica , 46(1, pp. 33–50.[10] McCann, R. (1995). “Existence and uniqueness of monotone measure preserving maps.”

Duke Math. J.

Dokl. Akad. Nauk SSSR