Vector quantile regression and optimal transport, from theory to numerics
Guillaume Carlier, Victor Chernozhukov, Gwendoline De Bie, Alfred Galichon
VVECTOR QUANTILE REGRESSION AND OPTIMAL TRANSPORT,FROM THEORY TO NUMERICS
GUILLAUME CARLIER (cid:91) , VICTOR CHERNOZHUKOV ♦ , GWENDOLINE DE BIE § ,AND ALFRED GALICHON † Abstract.
In this paper, we first revisit the Koenker and Bassett variational approachto (univariate) quantile regression, emphasizing its link with latent factor representationsand correlation maximization problems. We then review the multivariate extension due toCarlier et al. (2016, 2017) which relates vector quantile regression to an optimal transportproblem with mean independence constraints. We introduce an entropic regularization ofthis problem, implement a gradient descent numerical method and illustrate its feasibilityon univariate and bivariate examples.
Keywords: vector quantile regression, optimal transport with mean independence con-straints, latent factors, entropic regularization
JEL Classification:
C51, C60
Date : 7/31/2019 (First draft). This draft: 7/25/2020.This is a pre-print of an article published in
Empirical Economics (2020). The final authenticated versionis available online at: https://doi.org/10.1007/s00181-020-01919-y . (cid:91) CEREMADE, UMR CNRS 7534, PSL, Universit´e Paris IX Dauphine, Pl. de Lattre de Tassigny, 75775Paris Cedex 16, FRANCE, and MOKAPLAN Inria Paris; [email protected]. ♦ Department of Economics, MIT, 50 Memorial Drive, E52-361B, Cambridge, MA 02142, USA; [email protected]. § DMA, ENS Paris; [email protected]. Funding from R´egion Ile-de-France grant is acknowledged. † Economics and Mathematics Departments, New York University, 70 Washington Square South, NewYork, NY 10013, USA; [email protected]. Funding from NSF grant DMS-1716489 is acknowledged. a r X i v : . [ ec on . GN ] F e b G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON Introduction
Quantile regression, introduced by Koenker and Bassett (1978), has become a verypopular tool for analyzing the response of the whole distribution of a dependent variable toa set of predictors. It is a far-reaching generalization of the median regression, allowing fora predition of any quantile of the distribution. We briefly recall classical quantile regression.For t ∈ [0 , t -quantile of ε = Y − q t ( x ) given X = x minimizesthe loss function E [ tε + + (1 − t ) ε − | X ], or equivalently E [ ε + + ( t − ε | X ]. As a result, if q t ( x ) is specified under the parametric form q t ( x ) = β (cid:62) t x + α t , it is natural to estimate α t and β t by minimizing the lossmin α,β E (cid:20)(cid:16) Y − β (cid:62) X − α (cid:17) + + (1 − t ) (cid:16) β (cid:62) X + α (cid:17)(cid:21) . While the previous optimization problem estimates α t and β t for pointwise values of t ,if one would like to estimate the whole curve t (cid:55)→ ( α t , β t ), one simply should construct theloss function by integrating the previous loss functions over t ∈ [0 , t (cid:55)→ ( α t , β t ) minimizesmin ( α t ,β t ) t ∈ [0 , (cid:90) E (cid:20)(cid:16) Y − β (cid:62) t X − α t (cid:17) + + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17)(cid:21) dt. As it is known since the original work by Koenker and Bassett, this problem has an(infinite-dimensional) linear programming formulation. Defining P t = (cid:0) Y − β (cid:62) t X − α t (cid:1) + asthe positive deviations of Y with respect to their predicted quantiles β (cid:62) t X + α t , we have P t ≥ (cid:0) Y − β (cid:62) t X − α t (cid:1) − = P t − Y + β (cid:62) t X + α t ≥
0, so the problem reformulates as min P t ≥ ,β t ,α t (cid:90) E (cid:104) P t + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17)(cid:105) dt (1.1) s.t. P t − Y + β (cid:62) t X + α t ≥ V t ] Whenever we write a variable in brackets after a constraint, as [ V t ] in (1.1), we mean that this variableplays the role of a multiplier. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 3 which we will call “dual formulation” of the classical quantile regression problem . To thedual formulation corresponds a primal one (dual to the dual), which is formally obtainedby a minimax formulationmin P t ≥ ,β t ,α t max V t ≥ (cid:90) E (cid:104) P t + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17) + V t Y − V t P − V t β (cid:62) t X − V t α t (cid:105) dt thusmax V t ≥ (cid:90) E [ V t Y ] dt + min P t ≥ ,β t ,α t (cid:90) E (cid:104) (1 − V t ) P t + β (cid:62) t ((1 − t − V t ) X ) + V t ( Y − α t ) (cid:105) dt hence we arrive at the primal formulationmax V t ≥ (cid:90) E [ Y V t ] dt (1.2) s.t. V t ≤ P t ≥ E [ V t X ] = (1 − t ) E [ X ] [ β t ] E [ V t ] = (1 − t ) [ α t ]If V t and ( α t , β t ) are solutions to the above primal and dual programs, complementaryslackness yields { Y >β (cid:62) t X + α t } ≤ V t ≤ { Y ≥ β (cid:62) t X + α t } , hence if ( X, Y ) has a continuousdistribution, then for any ( α, β ), P (cid:0) Y − β (cid:62) X − α = 0 (cid:1) = 0, and therefore one has almostsurely V t = { Y ≥ β (cid:62) t X + α t } . Koenker and Ng (2005) impose a monotonicity constraint of the estimated quantile curves.Indeed, if β (cid:62) t x + α t is the t -quantile of the conditional distribution of Y given X = x , thecurve t (cid:55)→ β (cid:62) t x + α t should be nondecreasing. Hence, these authors impose a naturalconstraint on the dual, that is β (cid:62) t X + α t ≥ β (cid:62) t (cid:48) X + α t (cid:48) for t ≥ t (cid:48) , and they incorporate this It may seem awkward to start with the “dual” formulation before giving out the “primal” one, and the“primal” being the dual to the “dual,” this choice of labeling is pretty arbitrary. However, our choice ismotivated by consistency with optimal transport theory, introduced below.
G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON constraint into (1.1), yieldingmin P t ≥ ,N t ≥ ,β t ,α t (cid:90) E (cid:104) P t + (1 − t ) (cid:16) β (cid:62) t X + α t (cid:17)(cid:105) dts.t. P t − N t = Y − β (cid:62) t X − α t [ V t ] t ≥ t (cid:48) ⇒ β (cid:62) t X + α t ≥ β (cid:62) t (cid:48) X + α t (cid:48) . Note that if t (cid:55)→ β (cid:62) t x + α t is nondecreasing, then { y ≥ β (cid:62) t x + α t } should be nonincreasing.Therefore, in that case, V t should be nonincreasing in t , which allows us to impose a mono-tonicity constraint on the primal variable V t instead of a monotonicity constraint on thedual variables β t and α t . This is precisely the problem we look at. Considermax V t (cid:90) E [ Y V t ] dt (1.3) s.t. V t ≥ N t ≥ V t ≤ P t ≥ E [ V t X ] = (1 − t ) E [ X ] [ β t ] E [ V t ] = (1 − t ) [ α t ] t ≥ t (cid:48) ⇒ V t ≤ V t (cid:48) . Let us now take a look at a sample version of this problem. Here, we observe as sample( X i , Y i ) for i ∈ { , ..., N } . We shall discretize the probability space [0 ,
1] into T points, t = 0 < t < ... < t T ≤
1. Let x be the 1 × K row vector whose k -th entry is (cid:80) ≤ i ≤ n X ik /N . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 5
The sample analog of (1.3) ismax V τi ≥ (cid:88) ≤ i ≤ N ≤ τ ≤ T V τi Y i V τi ≤ N (cid:88) ≤ i ≤ N V τi X ik = (1 − t τ ) x k N (cid:88) ≤ i ≤ N V τi = (1 − t τ ) V τ ≥ V τ ≥ ... ≥ V τ ( N − ≥ V τ,N ≥ . Denoting t the T × t τ , and D a T × T matrix with ones on themain diagonal, and − V τ ≥ V τ ≥ ... ≥ V τ ( N − ≥ V τ,N ≥ V (cid:62) D ≥
0, and the programrewrites max V (cid:62) N V Y N V X = (1 T − t ) x N V N = (1 T − t ) V (cid:62) D T = 1 N V (cid:62) D ≥ . G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON
Setting π = D (cid:62) V /N , and U = D − N = (1 /T, /T, ..., µ = D (cid:62) (1 T − t ) = (1 /T, ..., /T ),and ν = 1 N /N , one can reformulate the problem asmax π ≥ (cid:88) ≤ τ ≤ T ≤ i ≤ N π τi U τ Y i (cid:88) ≤ τ ≤ T π τi = ν iN (cid:88) i =1 π τi = µ τ (cid:88) ≤ i ≤ N π τi X ik = µ τ x k which rewrites in the population asmax ( U,X,Y ) ∼ π E π [ U Y ] (1.4) s.t. U ∼ U ([0 , X, Y ) ∼ ν E [ X | U ] = E [ X ] . Note that this is a direct extension of the Monge-Kantorovich problem of optimal trans-port – in fact it boils down to it when the last constraint is absent. This should not besurprising, given the connection between optimal transport, as recalled below. In the presentpaper, we introduce the
Regularized Vector Quantile Regression (RVQR) problem, whichconsists of adding an entropic regularization term in the expression (1.4), which yields, fora given data distribution ν , max ( U,X,Y ) ∼ π E π [ U Y ] − ε E π [ln π ( U, X, Y )] (1.5) s.t. U ∼ U ([0 , X, Y ) ∼ ν E [ X | U ] = E [ X ] . Due to smoothness and regularity, the regularized problem (1.5) enjoys computational andanalytical properties that are missing from the original problem (1.4). In particular, the
ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 7 dual to (1.4) is a smooth, unconstrained problem that can be solved by computationalmethods. While here, unlike in the context of stadard optimal transport, the Kullback-Leibler divergence projection onto the mean-independence constraint is not in closed form,we can use Nesterov’s gradient descent acceleration, which gives optimal convergence ratesfor first-order methods.The present paper in part provides a survey of previous results, and in part conveys newresults. In the vein of the previous papers on the topic (2016, 2017), this paper seeks toapply the optimal transport toolbox to quantile regression. In contrast with these papers,a particular focus in the present paper is to propose a regularized version of the problem aswell as new computational methods. The two main new contributions of the paper are (1) aconnection with shape-constrained classical regression (section 4), and (2) the introductionof the regularized vector quantile regression problem (RVQR) along with a duality theoremfor that problem (section 6).The paper is organized as follows. Section 2 will offer reminders on the notion of quantile;section 3 will review the previous results of Carlier et al. (2016, 2017) on the “specified” case;section 4 offers a new result (theorem 4.1) on the comparison with the shape-constrainedclassical quantile regression; and section 5 will review results on the multivariate case.Section 6 introduces RVQR and introduces results relevant for that problem, in particulara duality result in that case (theorem 6.1).2.
Several characterizations of quantiles
Throughout the paper, (Ω , F , P ) will be some fixed nonatomic space probability. Givena random vector Z with values in R k defined on this space we will denote by L ( Z ) thelaw of Z , given a probability measure θ on R k , we shall often write Z ∼ θ to express that L ( Z ) = θ . Independence of two random variables Z and Z will be denoted as Z ⊥⊥ Z . One way to define the nonatomicity of (Ω , F , P ) is by the existence of a uniformly distributed randomvariable on this space, this somehow ensures that the space is rich enough so that there exists randomvariables with prescribed law. If, on the contrary, the space is finite for instance only finitely supportedprobability measures can be realized as the law of such random variables. G. CARLIER, V. CHERNOZHUKOV, G. DE BIE, AND A. GALICHON
Quantiles.
Let Y be some univariate random variable defined on (Ω , F , P ). Denotingby F Y the distribution function of Y : F Y ( α ) := P ( Y ≤ α ) , ∀ α ∈ R the quantile function of Y , Q Y = F − Y is the generalized inverse of F Y given by the formula: Q Y ( t ) := inf { α ∈ R : F Y ( α ) > t } for all t ∈ (0 , . (2.1)Let us now recall two well-known facts about quantiles: • α = Q Y ( t ) is a solution of the convex minimization problemmin α { E (( Y − α ) + ) + α (1 − t ) } (2.2) • there exists a uniformly distributed random variable U such that Y = Q Y ( U ). More-over, among uniformly distributed random variables, U is maximally correlated to Y in the sense that it solvesmax { E ( V Y ) , V ∼ µ } (2.3)where µ := U ([0 , , L ( Y ) has no atom, i.e. when F Y is continuous, U is unique andgiven by U = F Y ( Y ). Problem (2.3) is the easiest example of optimal transportproblem one can think of. The decomposition of a random variable Y as the com-posed of a monotone nondecreasing function and a uniformly distributed randomvariable is called a polar factorization of Y . The existence of such decompositionsgoes back to Ryff (1970) and the extension to the multivariate case (by optimaltransport) is due to Brenier (1991).We therefore see that there are basically two different approaches to study or estimatequantiles: • the local or ” t by t ” approach which consists, for a fixed probability level t , in usingdirectly formula (2.1) or the minimization problem (2.2) (or some approximation ofit), this can be done very efficiently in practice but has the disadvantage of forgetting In fact for (2.3) to make sense one needs some integrabilty of Y i.e. E ( | Y | ) < + ∞ . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 9 the fundamental global property of the quantile function: it should be monotone in t , • the global approach (or polar factorization approach), where quantiles of Y aredefined as all nondecreasing functions Q for which one can write Y = Q ( U ) with U uniformly distributed. In this approach, one rather tries to recover directly thewhole monotone function Q (or the uniform variable U that is maximally correlatedto Y ). Therefore this is a global approach for which one should rather use theoptimal transport problem (2.3).2.2. Conditional quantiles.
Let us assume now that, in addition to the random variable Y , we are also given a random vector X ∈ R N which we may think of as being a list ofexplanatory variables for Y . We are primarily interested in the dependence between Y and X and in particular the conditional quantiles of Y given X = x . Let us denote by ν thejoint law of ( X, Y ) by ν the law of X , and by ν ( . | x ) the conditional law of Y given X = x : ν := L ( X, Y ) , m := L ( X ) , ν ( . | x ) := L ( Y | X = x ) (2.4)which in particular yields d ν ( x, y ) = d ν ( y | x )d m ( x ) . We then denote by F ( x, y ) = F Y | X = x ( y ) the conditional cdf: F ( x, y ) := P ( Y ≤ y | X = x )and Q ( x, t ) the conditional quantile Q ( x, t ) := inf { α ∈ R : F ( x, α ) > t } , ∀ t ∈ (0 , . For the sake of simplicity, we shall assume that for m = L ( X )-almost every x ∈ R N ( m -a.e. x for short), one has t (cid:55)→ Q ( x, t ) is continuous and increasing (2.5)so that for m -a.e. x , F ( x, Q ( x, t )) = t for every t ∈ (0 ,
1) and Q ( x, F ( x, y )) = y for every y in the support of ν ( . | x ). Let us now define the random variable U := F ( X, Y ) , (2.6)then by construction: P ( U < t | X = x ) = P ( F ( x, Y ) < t | X = x ) = P ( Y < Q ( x, t ) | X = x )= F ( x, Q ( x, t )) = t. We deduce that U is uniformly distributed and independent from X (since its conditionalcdf does not depend on x ). Moreover since U = F ( X, Y ) = F ( X, Q ( X, U )) it follows from(2.5) that one has the representation Y = Q ( X, U )in which U can naturally be interpreted as a latent factor.This easy remark leads to a conditional polar factorization of Y through the pointwiserelation Y = Q ( X, U ) with Q ( X, . ) nondecreasing and U ∼ µ , U ⊥⊥ X . We would like toemphasize now that there is a variational principle behind this conditional decomposition.Let us indeed consider the variant of the optimal transport problem (2.3) where one furtherrequires U to be independent from the vector of regressors X :max { E ( V Y ) , L ( V ) = µ, V ⊥⊥ X } . (2.7)then we have Proposition 2.1. If E ( | Y | ) < + ∞ and (2.5) holds, the random variable U defined in (2.6)solves (2.7).Proof. Let V be admissible for (2.7). Let us define for x ∈ spt( m ) and t ∈ [0 , ϕ ( x, t ) := (cid:90) t Q ( x, s )d s. We first claim that ϕ ( X, U ) is integrable, indeed we obviously have | ϕ ( X, U ) | ≤ (cid:90) | Q ( X, s ) | d s ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 11 hence E ( | ϕ ( X, U ) | ) ≤ (cid:90) R N (cid:90) | Q ( x, s ) | d µ ( s ) d m ( x )= (cid:90) R N (cid:90) R | y | d ν ( y | x )d m ( x ) = E ( | Y | ) < + ∞ where we have used in the second line the fact that the image of µ by Q ( x, . ) is ν ( . | x ). Since ϕ ( x, . ) is convex and Y = ∂ ϕ∂u ( X, U ) the pointwise inequality ϕ ( X, V ) − ϕ ( X, U ) ≥ Y ( V − U )holds almost surely. But since L ( X, V ) = L ( X, U ) integrating the previous inequalityyields E ( ϕ ( X, V ) − ϕ ( X, U )) = 0 ≥ E ( Y ( V − U )) . Specified and quasi-specified quantile regression
Specified quantile regression.
Since the seminal work of Koenker and Bassett(1978), it has been widely accepted that a convenient way to estimate conditional quantilesis to stipulate an affine form with respect to x for the conditional quantile. Since a quantilefunction should be monotone in its second argument, this leads to the following definition Definition 3.1.
Quantile regression is specified if there exist ( α, β ) ∈ C ([0 , , R ) × C ([0 , , R N ) such that for m -a.e. x t (cid:55)→ α ( t ) + β ( t ) (cid:62) x is increasing on [0 ,
1] (3.1) and Q ( x, t ) = α ( t ) + β ( t ) (cid:62) x, (3.2) for m -a.e. x and every t ∈ [0 , . If (3.1)-(3.2) hold, quantile regression is specified withregression coefficients ( α, β ) . Specification of quantile regression can be characterized by the validity of an affine in X representation of Y with a latent factor: Proposition 3.2.
Let ( α, β ) be continuous and satisfy (3.1). Quantile regression is specifiedwith regression coefficients ( α, β ) if and only if there exists U such that Y = α ( U ) + β ( U ) (cid:62) X almost surely , L ( U ) = µ, U ⊥⊥ X. (3.3) Proof.
The fact that specification of quantile regression implies decomposition (3.3) hasalready been explained in paragraph 2.2. Let us assume (3.3), and compute F ( x, α ( t ) + β ( t ) (cid:62) x ) = P ( Y ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( α ( U ) + β ( U ) (cid:62) x ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( U ≤ t | X = x ) = P ( U ≤ t ) = t so that Q ( x, t ) = α ( t ) + β ( t ) (cid:62) x .3.2. Quasi-specified quantile regression.
Let us now assume that both X and Y areintegrable E ( (cid:107) X (cid:107) + | Y | ) < + ∞ (3.4)and normalize, without loss of generality, X in such a way that E ( X ) = 0 . (3.5)Koenker and Bassett showed that, for a fixed probability level t , the regression coefficients( α, β ) can be estimated by quantile regression i.e. the minimization probleminf ( α,β ) ∈ R N E ( ρ t ( Y − α − β (cid:62) X )) (3.6)where the penalty ρ t is given by ρ t ( z ) := tz − + (1 − t ) z + with z − and z + denoting thenegative and positive parts of z . For further use, note that (3.6) can be conveniently berewritten as inf ( α,β ) ∈ R N { E (( Y − α − β (cid:62) X ) + ) + (1 − t ) α } . (3.7)As noticed by Koenker and Bassett, this convex program admits as dual formulationsup { E ( V t Y )) : V t ∈ [0 , , E ( V t ) = (1 − t ) , E ( V t X ) = 0 } . (3.8) ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 13
An optimal ( α, β ) for (3.7) and an optimal V t in (3.8) are related by the complementaryslackness condition: Y > α + β (cid:62) X ⇒ V t = 1 , and Y < α + β (cid:62) X ⇒ V t = 0 . (3.9)Note that α appears naturally as a Lagrange multiplier associated to the constraint E ( V t ) =(1 − t ) and β as a Lagrange multiplier associated to E ( V t X ) = 0.To avoid mixing i.e. the possibility that V t takes values in (0 , ν = L ( X, Y ) gives zero mass to nonvertical hyperplanes i.e. P ( Y = α + β (cid:62) X ) = 0 , ∀ ( α, β ) ∈ R N . (3.10)We shall also consider a nondegeneracy condition on the (centered) random vector X whichsays that its law is not supported by any hyperplane : P ( β (cid:62) X = 0) < , ∀ β ∈ R N \ { } . (3.11)Thanks to (3.10), we may simply write V t = { Y >α + β (cid:62) X } (3.12)and thus the constraints E ( V t ) = (1 − t ), E ( XV t ) = 0 read E ( { Y >α + β (cid:62) X } ) = P ( Y > α + β (cid:62) X ) = (1 − t ) , E ( X { Y >α + β (cid:62) X } ) = 0 (3.13)which simply are the first-order conditions for (3.7).Any pair ( α, β ) which solves the optimality conditions (3.13) for the Koenker and Bassettapproach will be denoted α = α QR ( t ) , β = β QR ( t )and the variable V t solving (3.8) given by (3.12) will similarly be denoted V QRt V QRt := { Y >α QR ( t )+ β QR ( t ) (cid:62) X } . (3.14) if E ( (cid:107) X (cid:107) ) < + ∞ then (3.11) amounts to the standard requirement that E ( XX (cid:62) ) is nonsingular. Uniqueness will be discussed later on.
Note that in the previous considerations the probability level t is fixed, this is what wecalled the ” t by t ” approach. For this approach to be consistent with conditional quantileestimation, if we allow t to vary we should add an additional monotonicity requirement: Definition 3.3.
Quantile regression is quasi-specified if there exists for each t , a solu-tion ( α QR ( t ) , β QR ( t )) of (3.13) (equivalently the minimization problem (3.6)) such that t ∈ [0 , (cid:55)→ ( α QR ( t ) , β QR ( t )) is continuous and, for m -a.e. xt (cid:55)→ α QR ( t ) + β QR ( t ) (cid:62) x is increasing on [0 , . (3.15)A first consequence of quasi-specification is given by Proposition 3.4.
Assume (2.5)-(3.4)-(3.5) and (3.10). If quantile regression is quasi-specified and if we define U QR := (cid:82) V QRt dt (recall that V QRt is given by (3.14)) then: • U QR is uniformly distributed, • X is mean-independent from U QR i.e. E ( X | U QR ) = E ( X ) = 0 , • Y = α QR ( U QR ) + β QR ( U QR ) (cid:62) X almost surely.Moreover U QR solves the correlation maximization problem with a mean-independenceconstraint: max { E ( V Y ) , L ( V ) = µ, E ( X | V ) = 0 } . (3.16) Proof.
Obviously V QRt = 1 ⇒ U QR ≥ t, and U QR > t ⇒ V QRt = 1hence P ( U QR ≥ t ) ≥ P ( V QRt = 1) = P ( Y > α QR ( t ) + β QR ( t ) (cid:62) X ) = (1 − t ) and P ( U QR >t ) ≤ P ( V QRt = 1) = (1 − t ) which proves that U QR is uniformly distributed and { U QR > t } coincides with { V QRt = 1 } up to a set of null probability. We thus have E ( X U QR >t ) = E ( XV QRt ) = 0, by a standard approximation argument we deduce that E ( Xf ( U QR )) = 0for every f ∈ C ([0 , , R ) which means that X is mean-independent from U QR .As already observed U QR > t implies that Y > α QR ( t ) + β QR ( t ) (cid:62) X in particular Y ≥ α QR ( U QR − δ ) + β QR ( U QR − δ ) (cid:62) X for δ >
0, letting δ → + and using the continuity of If quantile regression is specified and the pair of functions ( α, β ) is as in definition 3.1, then for every t ,( α ( t ) , β ( t )) solves the conditions (3.13). This shows that specification implies quasi-specification. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 15 ( α QR , β QR ) we get Y ≥ α QR ( U QR ) + β QR ( U QR ) (cid:62) X . The converse inequality is obtainedsimilarly by remarking that U QR < t implies that Y ≤ α QR ( t ) + β QR ( t ) (cid:62) X .Let us now prove that U QR solves (3.16). Take V uniformly distributed, such that X ismean-independent from V and set V t := { V >t } , we then have E ( XV t ) = 0, E ( V t ) = (1 − t )but since V QRt solves (3.8) we have E ( V t Y ) ≤ E ( V QRt Y ). Observing that V = (cid:82) V t dt andintegrating the previous inequality with respect to t gives E ( V Y ) ≤ E ( U QR Y ) so that U QR solves (3.16).Let us continue with a uniqueness argument for the mean-independent decompositiongiven in proposition 3.4: Proposition 3.5.
Assume (2.5)-(3.4)-(3.5)-(3.10) and (3.11). Let us assume that Y = α ( U ) + β ( U ) (cid:62) X = α ( U ) + β ( U ) (cid:62) X with: • both U and U uniformly distributed, • X is mean-independent from U and U : E ( X | U ) = E ( X | U ) = 0 , • α, β, α, β are continuous on [0 , , • ( α, β ) and ( α, β ) satisfy the monotonicity condition (3.1),then α = α, β = β, U = U .
Proof.
Let us define for every t ∈ [0 , ϕ ( t ) := (cid:90) t α ( s ) ds, b ( t ) := (cid:90) t β ( s ) ds. Let us also define for ( x, y ) in R N +1 : ψ ( x, y ) := max t ∈ [0 , { ty − ϕ ( t ) − b ( t ) (cid:62) x } thanks to the monotonicity condition (3.1), the maximization program above is strictlyconcave in t for every y and m -a.e. x . We then remark that Y = α ( U ) + β ( U ) (cid:62) X = ϕ (cid:48) ( U ) + b (cid:48) ( U ) (cid:62) X exactly is the first-order condition for the above maximization problemwhen ( x, y ) = ( X, Y ). In other words, we have ψ ( x, y ) + b ( t ) (cid:62) x + ϕ ( t ) ≥ ty, ∀ ( t, x, y ) ∈ [0 , × R N × R (3.17)with an equality for ( x, y, t ) = ( X, Y, U ) i.e. ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U ) = U Y, almost surely. (3.18)Using the fact that L ( U ) = L ( U ) and the fact that mean-independence gives E ( b ( U ) (cid:62) X ) = E ( b ( U ) (cid:62) X ) = 0, we have E ( U Y ) = E ( ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U )) = E ( ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U )) ≥ E ( U Y )but reversing the role of U and U , we also have E ( U Y ) ≤ E ( U Y ) and then E ( U Y ) = E ( ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U ))so that, thanks to inequality (3.17) ψ ( X, Y ) + b ( U ) (cid:62) X + ϕ ( U ) = U Y, almost surelywhich means that U solves max t ∈ [0 , { tY − ϕ ( t ) − b ( t ) (cid:62) X } which, by strict concavity admits U as unique solution. This proves that U = U and thus α ( U ) − α ( U ) = ( β ( U ) − β ( U )) (cid:62) X taking the conditional expectation with respect to U on both sides we then obtain α = α and thus β ( U ) (cid:62) X = β ( U ) (cid:62) X almost surely. We then compute F ( x, α ( t ) + β ( t ) (cid:62) x ) = P ( α ( U ) + β ( U ) (cid:62) X ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( α ( U ) + β ( U ) (cid:62) x ≤ α ( t ) + β ( t ) (cid:62) x | X = x )= P ( U ≤ t | X = x )and similarly F ( x, α ( t ) + β ( t ) (cid:62) x ) = P ( U ≤ t | X = x ) = F ( x, α ( t ) + β ( t ) (cid:62) x ). Thanks to(2.5), we deduce that β ( t ) (cid:62) x = β ( t ) (cid:62) x for m -a.e. x and every t ∈ [0 , β = β . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 17
Corollary 3.6.
Assume (2.5)-(3.4)-(3.5)-(3.10) and (3.11). If quantile regression is quasi-specified, the regression coefficients ( α QR , β QR ) are uniquely defined and if Y can be writtenas Y = α ( U ) + β ( U ) (cid:62) X for U uniformly distributed, X being mean independent from U , ( α, β ) continuous such thatthe monotonicity condition (3.1) holds then necessarily α = α QR , β = β QR . To sum up, we have shown that quasi-specification is equivalent to the validity of thefactor linear model: Y = α ( U ) + β ( U ) (cid:62) X for ( α, β ) continuous and satisfying the monotonicity condition (3.1) and U , uniformlydistributed and such that X is mean-independent from U . This has to be compared withthe decomposition of paragraph 2.2 where U is required to be independent from X but thedependence of Y with respect to U , given X , is given by a nondecreasing function of U which is not necessarily affine in X .4. Quantile regression without specification
Now we wish to address quantile regression in the case where neither specification norquasi-specification can be taken for granted. In such a general situation, keeping in mindthe remarks from the previous paragraphs, we can think of two natural approaches.The first one consists in studying directly the correlation maximization with a mean-independence constraint (3.16). The second one consists in getting back to the Koenkerand Bassett t by t problem (3.8) but adding as an additional global consistency constraintthat V t should be nonincreasing (which we abbreviate as V t ↓ ) with respect to t :sup { E ( (cid:90) V t Y dt ) : V t ↓ , V t ∈ [0 , , E ( V t ) = (1 − t ) , E ( V t X ) = 0 } (4.1)Our aim is to compare these two approaches (and in particular to show that the maxi-mization problems (3.16) and (4.1) have the same value) as well as their dual formulations. Before going further, let us remark that (3.16) can directly be considered in the multivariatecase whereas the monotonicity constrained problem (4.1) makes sense only in the univariatecase.As proven in Carlier et al. (2016), (3.16) is dual toinf ( ψ,ϕ,b ) { E ( ψ ( X, Y )) + E ( ϕ ( U )) : ψ ( x, y ) + ϕ ( u ) ≥ uy − b ( u ) (cid:62) x } (4.2)which can be reformulated as:inf ( ϕ,b ) (cid:90) max t ∈ [0 , ( ty − ϕ ( t ) − b ( t ) (cid:62) x ) ν ( dx, dy ) + (cid:90) ϕ ( t ) dt (4.3)in the sense that sup(3 .
16) = inf(4 .
2) = inf(4 . . (4.4)The existence of a solution to (4.2) is not straightforward and is established under ap-propriate assumptions in Carlier et al. (2017) in the multivariate case. The following resultshows that there is a t -dependent reformulation of (3.16): Lemma 4.1.
The value of (3.16) coincides with sup { E ( (cid:90) t V t Y dt ) : V t ↓ , V t ∈ { , } , E ( V t ) = (1 − t ) , E ( V t X ) = 0 } . (4.5) Proof.
Let U be admissible for (3.16) and define V t := { U>t } then U = (cid:82) V t dt and ob-viously ( V t ) t is admissible for (4.5), we thus have sup(3 . ≤ sup(4 . V t ) t admissible for (4.5) and let V := (cid:82) V t dt , we then have V > t ⇒ V t = 1 ⇒ V ≥ t since E ( V t ) = (1 − t ) this implies that V is uniformly distributed and V t = { V >t } almostsurely so that E ( X { V >t } ) = 0 which implies that X is mean-independent from V and thus E ( (cid:82) V t Y dt ) ≤ sup(3 . .
16) = sup(4 . C := { v : [0 , (cid:55)→ [0 , , ↓} With a little abuse of notations when a reference number (A) refers to a maximization (minimization)problem, we will simply write sup( A ) (inf( A )) to the denote the value of this optimization problem. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 19
Let ( V t ) t be admissible for (4.1) and set v t ( x, y ) := E ( V t | X = x, Y = y ) , V t := v t ( X, Y )it is obvious that ( V t ) t is admissible for (4.1) and by construction E ( V t Y ) = E ( V t Y ). More-over the deterministic function ( t, x, y ) (cid:55)→ v t ( x, y ) satisfies the following conditions:for fixed ( x, y ), t (cid:55)→ v t ( x, y ) belongs to C , (4.6)and for a.e. t ∈ [0 , (cid:90) v t ( x, y ) ν ( dx, dy ) = (1 − t ) , (cid:90) v t ( x, y ) xν ( dx, dy ) = 0 . (4.7)Conversely, if ( t, x, y ) (cid:55)→ v t ( x, y ) satisfies (4.6)-(4.7), V t := v t ( X, Y ) is admissible for (4.1)and E ( V t Y ) = (cid:82) v t ( x, y ) yν ( dx, dy ). All this proves that sup(4 .
1) coincides withsup ( t,x,y ) (cid:55)→ v t ( x,y ) (cid:90) v t ( x, y ) yν ( dx, dy ) dt subject to: (4 . − (4 .
7) (4.8)
Theorem 4.2.
The shape constrained quantile regression problem (4.1) is related to thecorrelation maximization with a mean independence constraint (3.16) by: sup(3 .
16) = sup(4 . . Proof.
We know from lemma 4.1 and the remarks above thatsup(3 .
16) = sup(4 . ≤ sup(4 .
1) = sup(4 . . We now get rid of constraints (4.7) by rewriting (4.8) in sup-inf form assup v t satisfies (4.6) inf ( α,β ) (cid:90) v t ( x, y )( y − α ( t ) − β ( t ) (cid:62) x ) ν ( dx, dy ) dt + (cid:90) (1 − t ) α ( t ) dt Recall that one always have sup inf ≤ inf sup so that sup(4 .
8) is less thaninf ( α,β ) sup v t satisf. (4.6) (cid:90) v t ( x, y )( y − α ( t ) − β ( t ) (cid:62) x ) ν ( dx, dy ) dt + (cid:90) (1 − t ) α ( t ) dt ≤ inf ( α,β ) (cid:90) (cid:16) sup v ∈C (cid:90) v ( t )( y − α ( t ) − β ( t ) (cid:62) x ) dt (cid:17) ν ( dx, dy ) + (cid:90) (1 − t ) α ( t ) dt It follows from Lemma 4.3 below that, for q ∈ L (0 ,
1) defining Q ( t ) := (cid:82) t q ( s ) ds , one hassup v ∈C (cid:90) v ( t ) q ( t ) dt = max t ∈ [0 , Q ( t ) . So setting ϕ ( t ) := (cid:82) t α ( s ) ds , b ( t ) := (cid:82) t β ( s ) ds and remarking that integrating by partsimmediately gives (cid:90) (1 − t ) α ( t ) dt = (cid:90) ϕ ( t ) dt, we have sup v ∈C (cid:90) v ( t )( y − α ( t ) − β ( t ) (cid:62) x ) dt + (cid:90) (1 − t ) α ( t ) dt = max t ∈ [0 , { ty − ϕ ( t ) − b ( t ) (cid:62) x } + (cid:90) ϕ ( t ) dt. This yieldssup(4 . ≤ inf ( ϕ,b ) (cid:90) max t ∈ [0 , ( ty − ϕ ( t ) − b ( t ) (cid:62) x ) ν ( dx, dy ) + (cid:90) ϕ ( t ) dt = inf(4 . .
3) = sup(3 .
16) which ends the proof.In the previous proof, we have used the elementary result (proven in the appendix)
Lemma 4.3.
Let q ∈ L (0 , and define Q ( t ) := (cid:82) t q ( s ) ds for every t ∈ [0 , , one has sup v ∈C (cid:90) v ( t ) q ( t ) dt = max t ∈ [0 , Q ( t ) . Vector quantiles, vector quantile regression and optimal transport
We now consider the case where Y is a random vector with values in R d with d ≥ Y is integrable say) in dimension d and are related tooptimal transport theory. The aim of this section is to briefly summarize the optimaltransport approach to quantile regression introduced in Carlier et al. (2016) and furtheranalyzed in their follow-up 2017 paper.5.1. Brenier’s map as a vector quantile.
From now on we fix as a reference measurethe uniform measure on the unit cube [0 , d i.e. µ d := U ([0 , d ) (5.1) ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 21
Given Y , an integrable R d -valued random variable on (Ω , F , P ), a remarkable theorem dueto Brenier (1991) and extended by McCann (1995) implies that there exists a unique U ∼ µ d and a unique (up to the addition of a constant) convex function defined on [0 , d such that Y = ∇ ϕ ( U ) . (5.2)The map ∇ ϕ is called the Brenier’s map between µ d and L ( Y ).The convex function ϕ is not necessarily differentiable but being convex it is differentiableat Lebesgue-a.e. point of [0 , d so that ∇ ϕ ( U ) is well defined almost surely, it is worth atthis point recalling that the Legendre transform of ϕ is the convex function: ϕ ∗ ( y ) := sup u ∈ [0 , d { u (cid:62) y − ϕ ( u ) } (5.3)and that the subdifferentials of ϕ and ϕ ∗ are defined respectively by ∂ϕ ( u ) := { y ∈ R d : ϕ ( u ) + ϕ ∗ ( y ) = u (cid:62) y } and ∂ϕ ∗ ( y ) := { u ∈ [0 , d : ϕ ( u ) + ϕ ∗ ( y ) = u (cid:62) y } so that ∂ϕ and ∂ϕ ∗ are inverse to each other in the sense that y ∈ ∂ϕ ( u ) ⇔ u ∈ ∂ϕ ∗ ( y )which is often refered to in convex analysis as the Fenchel reciprocity formula . Note thenthat (5.2) implies that U ∈ ∂ϕ ∗ ( Y ) almost surely . If both ϕ and ϕ ∗ are differentiable, their subgradients reduce to the singleton formed bytheir gradient and the Fenchel recirprocity formula simply gives ∇ ϕ − = ∇ ϕ ∗ . Recalling thesubgradient of the convex function ϕ is monotone in the sense that whenever y ∈ ∂ϕ ( u )and y ∈ ∂ϕ ( u ) one has ( y − y ) (cid:62) ( u − u ) ≥ , Note the analogy with the fact that in the univariate case the cdf and the quantile of Y are generalizedinverse to each other. we see that gradients of convex functions are a genelarization to the multivariate case ofmonotone univariate maps. It is therefore natural in view of (5.2) to define the vectorquantile of Y as: Definition 5.1.
The vector quantile of Y is the Brenier’s map between µ d and L ( Y ) . Now, it is worth noting that the Brenier’s map (and the uniformly distributed randomvector U in (5.2)) are not abstract objects, they have a variational characterization relatedto optimal transport . Consider indeedsup { E ( V (cid:62) Y ) : V ∼ µ d } (5.4)and its dualinf f,g { (cid:90) [0 , d f d µ d + E ( g ( Y )) : f ( u ) + g ( y ) ≥ u (cid:62) y, ∀ ( u, y ) ∈ [0 , d × R d } (5.5)then U in (5.2) is the unique solution of (5.4) and any solution ( f, g ) of the dual (5.5)satisfies ∇ f = ∇ ϕ µ d -a.e..5.2. Conditional vector quantiles.
Assume now as in paragraph 2.2 that we are alsogiven a random vector X ∈ R N . As in (2.4), we denote by ν the law of ( X, Y ), by m thelaw of X and by ν ( . | x ) the conditional law of Y given X = x (the only difference with (2.4)is that Y is R d -valued). Conditional vector quantile are then defined as Definition 5.2.
For m = L ( X ) -a.e. x ∈ R N , the vector conditional quantile of Y given X = x is the Brenier’s map between µ d := U ([0 , d ) and ν ( . | x ) := L ( Y | X = x ) . Wedenote this well defined map as ∇ ϕ x where ϕ x is a convex function on [0 , d . If both ϕ x and its Legendre transform ϕ ∗ x ( y ) := sup u ∈ [0 , d { u (cid:62) y − ϕ x ( u ) } In the case where E ( (cid:107) Y (cid:107) ) < + ∞ , (5.4) is equivalent to minimize E ( (cid:107) V − Y (cid:107) ) among uniformlydistributed V ’s. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 23 are differentiable , one can define the random vector: U := ∇ ϕ ∗ X ( Y )which is equivalent to Y = ∇ ϕ X ( U ) . (5.6)One can check exactly as in the proof of Proposition 2.1 for the univariate case that if Y isintegrable then U ∼ µ d , U ⊥⊥ X and U solves max { E ( V (cid:62) Y ) , V ∼ µ d , V ⊥⊥ X } . (5.7)5.3. Vector quantile regression.
When one assumes that the convex function ϕ x is affinewith respect to the explanatory variables x (specification): ϕ x ( u ) = ϕ ( u ) + b ( u ) (cid:62) x with ϕ : [0 , d → R and b : [0 , d → R N smooth, the conditional quantile is itself affineand the relation (5.6) takes the form Y = ∇ ϕ X ( U ) = α ( U ) + β ( U ) X, for α = ∇ ϕ, β := Db (cid:62) . (5.8)This affine form moreover implies that not only U maximizes the correlation with Y amonguniformly distributed random vectors independent from X but in the larger class of uni-formly distributed random vectors for which E ( X | U ) = E ( X ) = 0 . This is the reason why the study ofmax { E ( V (cid:62) Y ) , V ∼ µ d , E ( X | V ) = 0 } (5.9) A deep regularity theory initated by Caffarelli (1992) in the 1990’s gives conditions on ν ( . | x ) such thatthis is in fact the case that the optimal transport map is smooth and/or invertible, we refer the interestedreader to the textbook of Figalli (2017) for a detailed and recent account of this regularity theory. here we assume that both X and Y are integrable is the main tool in the approach of Carlier et al. (2016, 2017) to vector quantile regression.Let us now briefly summarize the main findings in these two papers. First observe that(5.9) can be recast as a linear program by setting π := L ( U, X, Y ) and observing that U solves (5.9) if and only if π solvesmax π ∈ MI( µ d ,ν ) (cid:90) [0 , d × R N × R d u (cid:62) y d π ( u, x, y ) (5.10)where MI( ν, µ ) is the set of probability measures which satisfy the linear constraints: • the first marginal of π is µ d , i.e., for every ϕ ∈ C ([0 , d , R ): (cid:90) [0 , d × R N × R d ϕ ( u )d π ( u, x, y ) = (cid:90) [0 , d ϕ ( u )d µ d ( u ) , • the second marginal of π is ν , i.e., for every ψ ∈ C b ( R N × R d , R ): (cid:90) [0 , d × R N × R d ψ ( x, y )d π ( u, x, y ) = (cid:90) R N × R d ψ ( x, y )d ν ( x, y )= E ( ψ ( X, Y )) , • the conditional expectation of x given u is 0, i.e., for every b ∈ C ([0 , d , R N ): (cid:90) [0 , d × R N × R d b ( u ) (cid:62) x d π ( u, x, y ) = 0 . The dual of the linear program (5.9) then readsinf ( ϕ,ψ,b ) (cid:90) [0 , d ϕ d µ d + (cid:90) R N × R d ψ ( x, y )d ν ( x, y ) (5.11)subject to the pointwise constraint ϕ ( u ) + b ( u ) (cid:62) x + ψ ( x, y ) ≥ u (cid:62) y given b and ϕ the lowest ψ fitting this constraint being the (convex in y ) function ψ ( x, y ) := sup u ∈ [0 , d { u (cid:62) y − ϕ ( u ) − b ( u ) (cid:62) x } . The existence of a solution ( ψ, ϕ, b ) to (5.11) is established in Carlier et al. (2016) (undersome assumptions on ν ) and optimality for U in (5.9) is characterized by the pointwisecomplementary slackness condition ϕ ( U ) + b ( U ) (cid:62) X + ψ ( X, Y ) = U (cid:62) Y almost surely . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 25 If ϕ and b were smooth we could deduce from the latter that Y = ∇ ϕ ( U ) + Db ( U ) (cid:62) U = ∇ ϕ X ( U ) , for ϕ x ( u ) := ϕ ( u ) + b ( u ) (cid:62) x which is exactly (5.8). So specification of vector quantile regression is essentially the sameas assuming this smoothness and the convexity of u (cid:55)→ ϕ x ( u ) := ϕ ( u ) + b ( u ) (cid:62) x . In general,these properties cannot be taken for granted and what can be deduced from complementaryslackness is given by the weaker relations ϕ X ( U ) = ϕ ∗∗ X ( U ) , Y ∈ ∂ϕ ∗∗ X ( U ) almost surely,were ϕ ∗∗ x is the convex envelope of ϕ x (i.e. the largest convex function below ϕ x ), we referthe reader to Carlier et al. (2017) for details.6. Discretization, regularization, numerical minimization
Discrete optimal transport with a mean independence constraint.
We nowturn to a discrete setting for implementation purposes, and consider data ( X j , Y j ) j =1 ..J distributed according to the empirical measure ν = (cid:80) Jj =1 ν j δ ( x j ,y j ) , and a [0 , d -uniformsample ( U i ) i =1 ,...,I with empirical measure µ = (cid:80) Ii =1 µ i δ u i . In this setting, the vectorquantile regression primal (5.10) writesmax π ∈ R I × J + I (cid:88) i =1 J (cid:88) j =1 u (cid:62) i y j π ij subject to marginal constraints ∀ j, (cid:80) i π ij = ν j and ∀ i, (cid:80) j π ij = µ i and the mean-independenceconstraint between X and U : ∀ i, (cid:80) j x j π ij = 0. Its dual formulation (5.11) readsinf ( φ i ) i , ( ψ j ) j , ( b i ) i J (cid:88) j =1 ψ j ν j + I (cid:88) i =1 φ i µ i subject to the constraint ∀ i, j, φ i + b (cid:62) i x j + ψ j ≥ u (cid:62) i y j . The Regularized Vector Quantile Regression (RVQR) problem.
Using theoptimality condition φ i = max j u (cid:62) i y j − b (cid:62) i x j − ψ j , we obtain the unconstrained formulationinf ( ψ j ) j , ( b i ) i (cid:88) j ψ j ν j + (cid:88) i µ i (cid:18) max j u (cid:62) i y j − b (cid:62) i x j − ψ j (cid:19) . Replacing the maximum with its smoothed version , given a small regularization parameter ε , yields the smooth convex minimization problem (see Cuturi and Peyr ˜A © (2016) for moredetails in connection with entropic regularization of optimal transport), which we call the Regularized Vector Quantile Regression (RVQR) probleminf ( ψ j ) j , ( b i ) i J ( ψ, b ) := (cid:88) j ψ j ν j + ε (cid:88) i µ i log (cid:88) j exp (cid:18) ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j ] (cid:19) . (6.1)We then have the following duality result : Theorem 6.1.
The RVQR problem max π ij ≥ (cid:88) ij π ij (cid:16) u (cid:62) i y j (cid:17) − ε (cid:88) ij π ij log π ij (cid:88) j π ij = µ i (cid:88) i π ij = ν j (cid:88) j π ij x j = (cid:88) j ν j x j has dual (6.1), or equivalently min ϕ i ,v j (cid:88) i µ i ϕ i + (cid:88) j ψ j ν j + ε (cid:88) ij exp (cid:18) ε [ u (cid:62) i y j − ϕ i − b (cid:62) i x j − ψ j ] (cid:19) . Note that the objective J in (6.1) remains invariant under the two transformations • ( b, ψ ) ← ( b + c, ψ − c (cid:62) x ) with c ∈ R N is a constant translation vector, Recall that the softmax with regularization parameter ε > α , . . . , α J ) is given bySoftmax ε ( α , . . . α J ) := ε log( (cid:80) Jj =1 e αjε ). Which can be proved either by using the Fenchel-Rockafellar duality theorem or by hand. Indeed, in theprimal, there are only finitely many linear constraints and nonnegativity constraints are not binding becauseof the entropy. The existence of Lagrange multipliers for the equality constraints is then straightforward.
ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 27 • ψ ← ψ + λ where λ ∈ R is a constant.These two invariances enable us to fix the value of b = 0 and (for instance) to chose λ in such a way that (cid:80) i,j exp (cid:0) ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j ] (cid:1) ) = 1. Remark.
This formulation is eligible for stochastic optimization techniques when the numberof (
X, Y ) observations is very large. Stochastic optimization w.r.t. ψ can be performedusing the stochastic averaged gradient algorithm, see Genevay et al. (2016), considering theequivalent objective inf ψ,φ,b (cid:88) j h ε ( x j , y j , ψ, φ, b ) ν j with h ε ( x j , y j , ψ, φ, b ) = ψ j + (cid:80) i µ i φ i + ε (cid:80) i exp (cid:0) ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j − φ i ] (cid:1) . Such tech-niques are not needed to compute b since the number of U samples (i.e. the size of b ) is setby the user.6.3. Gradient descent.
As already noted the objective J in (6.1) is convex and smooth.Its gradient has the explicit form ∂J∂ψ j := ν j − I (cid:88) i =1 µ i e θ ij (cid:80) Jk =1 e θ ik where θ ij = 1 ε [ u (cid:62) i y j − b (cid:62) i x j − ψ j ] (6.2)and ∂J∂b i := − µ i (cid:80) Jk =1 x k e θ ik (cid:80) Jk =1 e θ ik . (6.3)To solve (6.1) numerically, we therefore can use a gradient descent mehod. An efficient wayto do it is to use Nesterov accelerated gradient algorithm see Nesterov (1983) and Beckand Teboulle (2009). Note that if ψ, b solves (6.1), the fact that the partial derivatives in(6.2)-(6.3) vanish imply that the coupling α εij := µ i e θ ij (cid:80) Jk =1 e θ ik satisfies the constraint of fixed marginals and mean-independence of the primal problem.Since the index j corresponds to observations it is convenient to introduce for every x ∈ it is even strictly convex once we have chosen normalizations which take into account the two invariancesof J explained above. X := { x , . . . , x J } and y ∈ Y := { y , . . . y j } the probability π ε ( x, y, u i ) := (cid:88) j : x j = x, y j = y α εij . Results
Quantiles computation.
The discrete probability π ε is an approximation (because ofthe regularization ε ) of L ( U, X, Y ) where U solves (5.9). The corresponding approximatequantile Q εX ( U ) is given by E π ε [ Y | X, U ]. In the above discrete setting, this yields Q εx ( u i ) := E π ε [ Y | X = x, U = u i ] = (cid:88) y ∈Y y π ε ( x, y, u i ) (cid:80) y (cid:48) ∈Y π ε ( x, y (cid:48) , u i ) . Remark.
To estimate the conditional distribution of Y given U = u and X = x , we canuse kernel methods. In the experiments, we compute approximate quantiles as means onneighborhoods of X values to make up for the lack of replicates. This amounts to considering E π ε [ Y | X ∈ B η ( x ) , U = u i ] where B η ( x ) is a Euclidean ball of radius η centered on x . Empirical illustrations.
We demonstrate the use of this approach on a series of healthrelated experiments. We use the “ANSUR II” dataset (Anthropometric Survey of US ArmyPersonnel), which can be found online . This dataset is one of the most comprehensivepublicly available data sets on body size and shape, containing 93 measurements for over4,082 male adult US military personnel. It allows us to easily build multivariate dependentvariables. One-dimensional VQR.
We start by one-dimensional dependent variables ( d = 1),namely Weight ( Y ) and Thigh circumference ( Y ), explained by X =(1, Height), to allowfor comparison with classical quantile regression of Koenker and Bassett (1978). Figure1 displays results of our method compared to the classical approach, for different heightquantiles (10%, 30%, 60%, 90%). Figure 1 is computed with a “soft” potential φ whiletable 1 depicts the difference with its “hard” counterpart (see the beginning of section 6.2).Figure 2 and Table 2 detail the impact of regularization strength on these quantiles. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 29
First dimension Second dimension
Figure 1.
Comparison between one-dimensional VQR (regularized dualin dashed red, with a “soft” φ ) and classical approach (green) with (i) Y =Weight (Left) or (ii) Y =Thigh circumference and X =(1, Height).Quantiles are plotted for different height quantiles (10%, 30%, 60%, 90%).Regularization strengths are ε = 0 .
1. Chosen grid size is n = 20. ε || Q soft − Q hard || / || Q soft || , X = 10% 3.8 · − · − · − · − || Q soft − Q hard || / || Q soft || , X = 30% 6.8 · − · − · − · − || Q soft − Q hard || / || Q soft || , X = 60% 1.2 · − · − · − · − || Q soft − Q hard || / || Q soft || , X = 90% 1.6 · − · − · − · − Table 1.
Relative error between one-dimensional VQR with a “soft” com-putation of φ and its “hard” counterpart, with Y =Weight and X =(1,Height) for different height quantiles (10%, 30%, 60%, 90%), depending onregularization strengths ε . Chosen grid size is n = 20. Multi-dimensional VQR.
In contrast, multivariate quantile regression explains thejoint dependence Y = ( Y , Y ) by X =(1,Height). Figures 4 and 5 (each correspondingto an explained component, either Y or Y ) depicts how smoothing operates in higherdimension for different Height quantiles (10%, 50% and 90%), compared to a previous ε = 0 . ε = 0 . ε = 0 . ε = 1 Figure 2.
Regularized one-dimensional VQR, dual (dashed red) comparedto classical QR (green) with Y =Weight regressed on X =(1, Height), forvarying regularization strengths ε . Quantiles are plotted for different heightquantiles (10%, 30%, 60%, 90%). Chosen grid size is n = 20. ε || Q QR − Q V QR || / || Q QR || , X = 10% 9.8 · − · − · − · − || Q QR − Q V QR || / || Q QR || , X = 30% 8.5 · − · − · − · − || Q QR − Q V QR || / || Q QR || , X = 60% 7.7 · − · − · − · − || Q QR − Q V QR || / || Q QR || , X = 90% 8.2 · − · − · − · − Table 2.
Relative error between one-dimensional VQR and classical QRapproach with Y =Weight and X =(1, Height) for different height quantiles(10%, 30%, 60%, 90%), depending on regularization strengths ε . Chosengrid size is n = 20.unregularized approach Carlier et al. (2016). Figure 3 details computational times in 2Dusing an Intel(R) Core(TM) i7-7500U CPU 2.70GHz. ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 31
Figure 3.
Comparison of computational times between the unregularizedcase (using Gurobi’s barrier logging) and the regularized case, for a varyingnumber of predictors in 2D. In the latter, this time represents the time toreach an error of 10 − in (cid:107)·(cid:107) between two iterates of the transport plan for ε = 0 .
1. Chosen grid size is n = 10 (per axis). S m a ll h e i g h t M e d i u m h e i g h t T a ll h e i g h t Unregularized Regularized dual
Figure 4.
Two-dimensional regularized quantile regression of Y =(Weight,Thigh) explained by X =(1, Height). Quantiles of Y =Weight are plottedfor different height quantiles: 10% (Bottom), 50% (Middle) and 90% (Top).Chosen grid size is n = 10 (per axis) and regularization strength ε = 0 . ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 33 S m a ll h e i g h t M e d i u m h e i g h t T a ll h e i g h t Unregularized Regularized dual
Figure 5.
Two-dimensional regularized quantile regression of Y =(Weight,Thigh) explained by X =(1, Height). Quantiles of Y =Thigh are plottedfor different height quantiles: 10% (Bottom), 50% (Middle) and 90% (Top).Chosen grid size is n = 10 (per axis) and regularization strength ε = 0 . Appendix
Proof of Lemma 4.3.
Since [0 ,t ] ∈ C , one obviously first hassup v ∈C (cid:90) v ( s ) q ( s ) ds ≥ max t ∈ [0 , (cid:90) t q ( s ) ds = max t ∈ [0 , Q ( t ) . Let us now prove the converse inequality, taking an arbitrary v ∈ C . We first observe that Q is absolutely continuous and that v is of bounded variation (its derivative in the sense ofdistributions being a bounded nonpositive measure which we denote by η ), integrating byparts and using the definition of C then give: (cid:90) v ( s ) q ( s ) ds = − (cid:90) Qη + v (1 − ) Q (1) ≤ (max [0 , Q ) × ( − η ([0 , v (1 − ) Q (1)= (max [0 , Q )( v (0 + ) − v (1 − )) + v (1 − ) Q (1)= (max [0 , Q ) v (0 + ) + ( Q (1) − max [0 , Q ) v (1 − ) ≤ max [0 , Q. References [1] Beck, A., and Teboulle, M. (2009). “A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.”
SIAM J. Imaging Sci. , 2(1), pp. 183–202.[2] Brenier, Y. (1991). “Polar factorization and monotone rearrangement of vector-valued functions.”
Comm. Pure Appl. Math.
J. Amer. Math. Soc. , 5(1),pp. 99–104.[4] Carlier, G., Chernozhukov, V. and Galichon, A. (2016). “Vector quantile regression: an optimal trans-port approach.”
Ann. Statist.
J. Multivariate Anal. , pp. 161, pp. 96–102.[6] Cuturi, M. and Peyr´e, G. (2016). “A smoothed dual approach for variational Wasserstein problems.”
SIAM J. Imaging Sci.
The Monge-Amp ere equation and its applications . Zurich Lectures in AdvancedMathematics. European Mathematical Society (EMS), Zurich.
ECTOR QUANTILE REGRESSION FROM THEORY TO NUMERICS 35 [8] Genevay, A., Cuturi, M., Peyr´e, G., and Bach, F. (2016). “Stochastic optimization for large-scale optimaltransport.” In
Advances in neural information processing systems , pp. 3440–3448.[9] Koenker, R. and Bassett, Jr., G. (1978). “Regression quantiles.”
Econometrica , 46(1, pp. 33–50.[10] McCann, R. (1995). “Existence and uniqueness of monotone measure preserving maps.”
Duke Math. J.
Dokl. Akad. Nauk SSSR