[PDF] Gaussian Transforms Modeling and the Estimation of Distributional Regression Functions

Abstract

Conditional distribution functions are important statistical objects for the analysis of a wide class of problems in econometrics and statistics. We propose flexible Gaussian representations for conditional distribution functions and give a concave likelihood formulation for their global estimation. We obtain solutions that satisfy the monotonicity property of conditional distribution functions, including under general misspecification and in finite samples. A Lasso-type penalized version of the corresponding maximum likelihood estimator is given that expands the scope of our estimation analysis to models with sparsity. Inference and estimation results for conditional distribution, quantile and density functions implied by our representations are provided and illustrated with an empirical example and simulations.

Full PDF

GGAUSSIAN TRANSFORMS MODELING AND THE ESTIMATIONOF DISTRIBUTIONAL REGRESSION FUNCTIONS

RICHARD H. SPADY † AND SAMI STOULI § Abstract.

Conditional distribution functions are important statistical objects forthe analysis of a wide class of problems in econometrics and statistics. We proposeﬂexible Gaussian representations for conditional distribution functions and give aconcave likelihood formulation for their global estimation. We obtain solutions thatsatisfy the monotonicity property of conditional distribution functions, includingunder general misspeciﬁcation and in ﬁnite samples. A Lasso-type penalized versionof the corresponding maximum likelihood estimator is given that expands the scopeof our estimation analysis to models with sparsity. Inference and estimation resultsfor conditional distribution, quantile and density functions implied by our represen-tations are provided and illustrated with an empirical example and simulations. Introduction

The modeling and estimation of conditional distribution functions are important forthe analysis of various econometric and statistical problems. Conditional distributionfunctions are core building blocks in the identiﬁcation and estimation of nonseparablemodels with endogeneity (e.g., Imbens and Newey, 2009; Chernozhukov, Fernandez-Val, Newey, Stouli, and Vella, 2020), in counterfactual distributional analysis (e.g.,DiNardo, Fortin, and Lemieux, 1996; Chernozhukov, Fernandez-Val, and Melly, 2013),or in the construction of prediction intervals for a stationary time series (e.g., Hall,Wolﬀ, and Yao, 1999; Chernozhukov, Wutrich, and Zhu, 2019). Conditional distribu-tion functions are also a fruitful starting point for the formulation of ﬂexible estimationmethods for other objects of interest (Spady and Stouli, 2018a), such as conditionalquantile functions (CQF).

Date : November 13, 2020. † Nuﬃeld College, Oxford, and Department of Economics, Johns Hopkins University,[email protected]. § Department of Economics, University of Bristol, [email protected] are grateful to Whitney Newey for his encouragements and useful comments, and to seminarparticipants at Bristol, UC San Diego, Oxford, Lehigh, LSE, Johns Hopkins University and theEconometric Society World Congress 2020. We thank Xiaoran Liang for excellent research assistance. a r X i v : . [ ec on . E M ] N ov or a continuous outcome variable Y and a vector of covariates X , three main diﬃ-culties arise in the formulation of a ﬂexible model and in the choice of a loss functionfor the estimation of the conditional distribution and quantile functions of Y given X . A ﬁrst main diﬃculty is the speciﬁcation of a model that allows for the shape ofthe distribution of Y to vary across values of X , while being characterized by a lossfunction that preserves monotonicity in Y at each value of a potentially large num-ber of explanatory variables X in estimation. Because a valid maximum likelihood(ML) characterization would require this monotonicity property to hold, a second andrelated diﬃculty is the formulation of a loss function that characterizes an approxi-mate model with a clear information-theoretic interpretation under misspeciﬁcation.A third diﬃculty is that nonconcave likelihoods naturally arise in the context of non-separable models, even in the simplest case of a Gaussian location-scale speciﬁcation. One approach is to discard the monotonicity requirement in estimation and use lossfunctions that characterize quantile or distribution functions pointwise, while specify-ing a functional form that allows for the shape of the distribution of Y to vary acrossvalues of X . Quantile regression (Koenker and Basset, 1978) speciﬁes each CQF as alinear combination of the components of X . The CQF is then estimated at each quan-tile by a sequence of linear programming problems. Distribution regression (Foresi andPerrachi, 1995; Chernozhukov, Fernandez-Val, and Melly, 2013) speciﬁes each level ofthe cumulative distribution function (CDF) of Y conditional on X as a known CDFtransformation of a linear combination of the components of X . The conditional CDFis then estimated at each Y value by a sequence of binary outcome ML estimators.Another approach is to insist on the monotonicity requirement and use loss functionsthat characterize both quantile and distribution functions globally, but do not havean ML interpretation. Dual regression (Spady and Stouli, 2018a) speciﬁes monotonerepresentations for Y given X as linear combinations of known functions of X and astochastic element. The conditional CDF is then estimated globally by the empiricaldistribution of the estimated sample values of the stochastic element.In this paper we take a diﬀerent approach by formulating Gaussian representationsfor conditional CDFs, instead of modeling conditional CDFs or CQFs directly. Theserepresentations are speciﬁed as linear combinations of known functions of X and Y ,and the implied distributional regression models allow for the shape of the distributionof Y to vary across values of X . We give a concave likelihood characterization that Cf. Owen (2007) and Spady and Stouli (2018b) for a discussion in the context of simultaneousestimation of location and scale parameters in a linear regression model. ules out nonmonotone solutions. Under general misspeciﬁcation, this formulation alsocharacterizes quasi-Gaussian representations that satisfy the monotonicity propertyof conditional CDFs by construction. The corresponding distributional models areoptimal approximations to the true data probability distribution according to theKullback-Leibler Information Criterion (KLIC) (White, 1982).For estimation we derive the properties of the corresponding ML estimator and extendour analysis to a two-step penalized ML estimation strategy, where the unpenalizedestimator is used as a ﬁrst step for an adaptive Lasso (Zou, 2006) ML estimator whichpreserves the concavity of the objective function. We derive asymptotic properties ofthe corresponding estimators for conditional distribution, quantile and density func-tions. The penalized estimator is selection consistent, asymptotically normal andoracle, where the selection is based on the pseudo-true values of the parameter esti-mators. Under correct speciﬁcation the estimator is also eﬃcient. We also give thedual formulation of our estimators that we use for implementation.This paper makes ﬁve main contributions to the existing literature. First, we introducea new class of Gaussian representations in linear form for the ﬂexible estimation ofdistributional regression models. Second, we demonstrate that our models and the cor-responding loss function characterize globally monotone conditional CDFs and CQFsunder general misspeciﬁcation, both in ﬁnite samples with probability approachingone and in the population. Quantile and distribution regression can result in bothﬁnite sample estimates and population approximations under misspeciﬁcation that donot satisfy the monotonicity property of conditional quantile and distribution func-tions. Third, we establish that the resulting approximations are KLIC optimal undergeneral misspeciﬁcation. Compared to dual regression, we ﬁnd that the monotonicityproperty can be obtained jointly with the KLIC optimality property, and we establishexistence and uniqueness of solutions under general misspeciﬁcation. Fourth, we useduality theory to show that our formulation has considerable computational advan-tages. Compared to dual regression, we ﬁnd in particular that the dual ML problemhas the important advantage of being a convex programming problem (Boyd andVandenberghe, 2004) with linear constraints. Fifth, our estimation analysis allows forsparsity, thereby giving an asymptotically valid characterization of sparse, globallymonotone, and KLIC optimal representations for conditional CDFs and CQFs. Cf. Chernozhukov, Fernandez-Val, and Galichon (2010) for a discussion in the context of quantileregression. n Section 2 we introduce Gaussian transforms modeling. In Section 3 we give resultsunder misspeciﬁcation. Section 4 contains estimation and inference results, and du-ality theory is derived in Section 5. Section 6 illustrates our methods, and Section7 concludes. The proofs of all results are given in the Appendix. The online Ap-pendix Spady and Stouli (2020) contains supplemental material, including results ofnumerical simulations calibrated to the empirical illustration.2. Gaussian Transforms Modeling

Let Y be a continuous outcome variable and X a vector of explanatory variables. Atransformation to Gaussianity of the conditional CDF F Y | X ( Y | X ) of Y given X occursby application of the Gaussian quantile function Φ − ,(2.1) e = Φ − ( F Y | X ( Y | X )) ≡ g ( Y, X ) , where the resulting Gaussian transform (GT) e is a zero mean and unit varianceGaussian random variable and is independent from X , by construction. With y (cid:55)→ F Y | X ( y | X ) strictly increasing, the corresponding map y (cid:55)→ g ( y, X ) is also strictlyincreasing, with well-deﬁned inverse denoted e (cid:55)→ h ( X, e ).Important statistical objects such as the conditional distribution, quantile and den-sity functions of Y given X can be expressed as known functionals of g ( Y, X ). Theconditional CDF of Y given X can be expressed as F Y | X ( Y | X ) = Φ( g ( Y, X )) , the CQF of Y given X as Q Y | X ( u | X ) = h ( X, Φ − ( u )) , u ∈ (0 , , and the conditional probability density function (PDF) of Y given X as f Y | X ( Y | X ) = φ ( g ( Y, X )) { ∂ y g ( Y, X ) } , ∂ y g ( Y, X ) ≡ ∂g ( Y, X ) ∂y , where e (cid:55)→ φ ( e ) is the Gaussian PDF and we denote partial derivatives as ∂ y g ( y, x ) ≡ ∂g ( y, x ) /∂y . The GT g ( Y, X ) thus constitutes a natural modeling target in the contextof distributional regression models for F Y | X ( Y | X ), Q Y | X ( u | X ), and f Y | X ( Y | X ). Werefer to these objects as the ‘distributional regression functions’.In this paper we consider the class of conditional CDFs with Gaussian representa-tion e = g ( Y, X ) in linear form, where g ( Y, X ) is speciﬁed as a linear combination of nown transformations of Y and X . The implied models for the distributional regres-sion functions are ﬂexible, parsimonious, and able to capture complex features of theentire statistical relationship between Y and X . In particular, these models allow fornonlinearity and nonseparability of this relationship.2.1. Gaussian representations in linear form.

Let W ( X ) be a K × X and S ( Y ) a J × Y . Assume that W ( X ) includes an intercept, i.e., has ﬁrst component 1, and that S ( Y ) has ﬁrst twocomponents (1 , Y ) (cid:48) and derivative dS ( Y ) /dy = s ( Y ), a vector of functions continuouson R . We denote the marginal support of Y and X by Y and X , respectively, andtheir joint support by YX .Given a random vector ( Y, X (cid:48) ) (cid:48) with support YX = Y × X where Y = R , for some b ∈ R JK a GT regression model takes the form(2.2) e = b (cid:48) T ( X, Y ) , e | X ∼ N (0 , , T ( X, Y ) ≡ W ( X ) ⊗ S ( Y ) , with derivative function,(2.3) ∂ y { b (cid:48) T ( X, Y ) } = b (cid:48) t ( X, Y ) > , t ( X, Y ) ≡ W ( X ) ⊗ s ( Y ) , and where we use the Kronecker product ⊗ to deﬁne the dictionary formed with W ( X ), S ( Y ) and their interactions as T ( X, Y ), and the corresponding derivative vec-tor as t ( X, Y ). The GT g ( Y, X ) in (2.1) is speciﬁed as a linear combination of theknown functions T ( X, Y ), and hence of the components W ( X ), S ( Y ) and their inter-actions. The linear form of e is preserved by the derivative function b (cid:48) t ( X, Y ) whichis simultaneously speciﬁed as a linear combination of the known functions t ( X, Y ).This linear speciﬁcation can be viewed as an approximation to the general Gaussiantransformation (2.1) when, for a speciﬁed dictionary T ( X, Y ), there is no b ∈ R JK such that (2.2)-(2.3) hold. We analyze this case in Section 3.An interpretation of model (2.2)-(2.3) as a varying coeﬃcients model arises from speci-fying e and its derivative function as a linear combination of the known functions S ( Y )and s ( Y ), respectively,(2.4) e = β ( X ) (cid:48) S ( Y ) , ∂ y { β ( X ) (cid:48) S ( Y ) } = β ( X ) (cid:48) s ( Y ) > , with the vector of varying coeﬃcients β ( X ) = ( β ( X ) , . . . , β J ( X )) (cid:48) speciﬁed as(2.5) β j ( X ) = b (cid:48) j W ( X ) , j ∈ { , . . . , J } , ith b j = ( b j , . . . , b jK ) (cid:48) , j ∈ { , . . . , J } . Together (2.4)-(2.5) give the linear form J (cid:88) j =1 β j ( X ) S j ( Y ) = J (cid:88) j =1 { b (cid:48) j W ( X ) } S j ( Y ) = b (cid:48) [ W ( X ) ⊗ S ( Y )] = b (cid:48) T ( X, Y ) , with derivative b (cid:48) t ( X, Y ) >

0, which has the form of (2.2)-(2.3). Since the derivativecondition requires β ( X ) (cid:48) s ( Y ) >

0, it is necessary to formulate β ( X ) and s ( Y ) so thatthis is at least possible. A suﬃcient condition is that both vectors be nonnegativewith probability one. This requirement will for instance be satisﬁed with b > W ( X ) and s ( Y ) are speciﬁed as nonnegative splinefunctions (Curry and Schoenberg, 1966; Ramsay, 1988). In that particular case, werefer to the resulting Gaussian representations as ‘Spline-Spline models’.With J = 2, the important special case of a Gaussian location-scale representationcan be expressed in terms of representation (2.4) as e = β ( X ) + β ( X ) Y, e | X ∼ N (0 , , β j ( X ) ≡ b (cid:48) j W ( X ) , j ∈ { , } , with derivative function β ( X ) >

0, which is of the form (2.2)-(2.3) with S ( Y ) =(1 , Y ) (cid:48) . With β ( X ) = b (cid:48) W ( X ) and β ( X ) ≡ b ∈ R , this speciﬁcation specializesto the Gaussian location representation e = b (cid:48) W ( X ) + b Y , where b > Y given X implied by (2.2)-(2.3) are(2.6) F Y | X ( y | X ) = Φ( b (cid:48) T ( X, y )) , f Y | X ( y | X ) = φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } , y ∈ R , respectively, and the CQF of Y given X is(2.7) Q Y | X ( u | X ) = h ( X, Φ − ( u )) , u ∈ (0 , , where e (cid:55)→ h ( X, e ) is the well-deﬁned inverse of y (cid:55)→ Φ( b (cid:48) T ( X, y )). With J = 2, theconditional distribution of Y is restricted to Gaussianity for all values of X since theJacobian term b (cid:48) t ( X, y ) = W ( X ) (cid:48) b in (2.6) does not depend on Y . Theorem 1.

For model (2.2)-(2.3), the distributional regression functions take theform (2.6)-(2.7).

Theorem 1 demonstrates that model (2.2)-(2.3) corresponds to a well-deﬁned probabil-ity distribution for Y given X with Gaussian representation in linear form. Therefore,model (2.2)-(2.3) gives a valid representation for the distributional regression func-tions (2.6)-(2.7). Upon setting W ( X ) = 1, Theorem 1 implies that model (2.2)-(2.3)admits distributional models for marginal distribution, quantile and density functions f Y as a particular case. We note that Theorem 1 also implies that the conditionallog density function of Y given X takes the form:log f Y | X ( Y | X ) = −

12 [log(2 π ) + { b (cid:48) T ( X, Y ) } ] + log( b (cid:48) t ( X, Y )) . We use this formulation to give an ML characterization of b , and hence of b (cid:48) T ( X, Y )and the corresponding distributional regression functions.

Remark . Our modeling framework also applies when Y is bounded since Y canalways be monotonically transformed to a random variable with support the real line,e.g., with e = Φ − ( F Y ( Y )) ≡ g ( Y ), where F Y ( Y ) is the marginal distribution of Y .For the GT regression model e = (cid:101) b (cid:48) T ( X, g ( Y )) ≡ (cid:101) g ( g ( Y ) , X ), e | X ∼ N (0 , ∂ y { (cid:101) g ( g ( Y ) , X ) } >

0, the corresponding conditional CDF of Y given X isPr[ Y ≤ y | X ] = Pr[ (cid:101) g ( g ( Y ) , X ) ≤ (cid:101) g ( g ( y ) , X ) | X ] = Φ( (cid:101) g ( g ( y ) , X )), y ∈ R . (cid:3) Remark . With multiple outcomes ( Y , . . . , Y M ) (cid:48) ≡ Y , M ≥

2, writing Y m ≡ ( Y , . . . , Y m ) (cid:48) , a compact generalization of (2.2)-(2.3) is the recursive formulation e m = T m ( X, Y m ) (cid:48) b ,m , e m | X, Y m − ∼ N (0 , , m ∈ { , . . . , M } ,e = T ( X, Y ) (cid:48) b , , e | X ∼ N (0 , , where T m ( X, Y m ) ≡ T m − ( X, Y m − ) ⊗ S m ( Y m ) and T ( X, Y ) ≡ W ( X ) ⊗ S ( Y ), withderivative functions, ∂ y m { T m ( X, Y m ) (cid:48) b ,m } = t m ( X, Y m ) (cid:48) b ,m > , m ∈ { , . . . , M } , where t m ( X, Y m ) ≡ t m − ( X, Y m − ) ⊗ s m ( Y m ), m ∈ { , . . . , M } , and t ( X, Y ) ≡ W ( X ) ⊗ s ( Y ). By construction, the Gaussian representations e , . . . , e M are jointlyGaussian and mutually independent, with variance-covariance the identity matrix,i.e., ( e , . . . , e M ) ∼ (cid:81) Mm =1 Φ( e m ). This is a Gaussian version of Rosenblatt (1952)’smultivariate probability transformation. By recursive application of Theorem 1, theimplied conditional CDF of Y given X is F Y | X ( y , . . . , y M | X ) = ˆ y −∞ . . . ˆ y M −∞ f Y | X ( t , . . . , t M | X ) dt . . . dt M where the PDF of Y given X takes the form f Y | X ( y , . . . , y M | X ) = M (cid:89) m =1 φ ( T m ( X, y m ) (cid:48) b ,m ) { t m ( X, y m ) (cid:48) b ,m } , y m ≡ ( y , . . . , y m ) , for all y , . . . , y M ∈ R . The implied distributional regression functions of Y m given( X, Y (cid:48) m − ) (cid:48) are deﬁned analogously to (2.6)-(2.7), for each m ∈ { , . . . , M } . (cid:3) .2. Characterization and identiﬁcation.

For the set of parameter values thatsatisfy the derivative condition (2.3),Θ = (cid:8) b ∈ R JK : Pr [ b (cid:48) t ( X, Y ) >

0] = 1 (cid:9) , we deﬁne the population objective function(2.8) Q ( b ) = E (cid:20) − (cid:0) log(2 π ) + { b (cid:48) T ( X, Y ) } (cid:1) + log ( b (cid:48) t ( X, Y )) (cid:21) , b ∈ Θ . This criterion introduces a natural logarithmic barrier function (e.g., Boyd and Van-denberghe, 2004) in the form of the log of the Jacobian term ∂ y { b (cid:48) T ( X, Y ) } . Thisis important because the derivative function b (cid:48) t ( X, Y ) enters the log term and themonotonicity requirement for the conditional CDF and CQF is thus imposed directlyby the objective in the deﬁnition of the eﬀective domain of Q ( b ), i.e., the region in R JK where Q ( b ) > −∞ . An equivalent interpretation is that the eﬀective domain of Q ( b ) contains the set of parameter values that are admissible for GT regression mod-els with strictly positive conditional PDF, by virtue of the presence and properties ofboth the Gaussian density function and the logarithmic barrier function in (2.8).We characterize the shape and properties of Q ( b ) under the following main assumption. Assumption 1. E [ || T ( X, Y ) || ] < ∞ , E [ || t ( X, Y ) || ] < ∞ , and the smallest eigen-value of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] is bounded away from zero.These conditions restrict the set of dictionaries we allow for, as well as the probabil-ity distribution of Y conditional on X . In particular, because T ( X, Y ) includes Y ,Assumption 1 requires Y to have ﬁnite second moment. The moment conditions inAssumption 1 are also suﬃcient for the second-derivative matrix of Q ( b ),(2.9)Γ( b ) ≡ E [ γ ( Y, X, b )] , γ ( Y, X, b ) ≡ − T ( X, Y ) T ( X, Y ) (cid:48) − t ( X, Y ) t ( X, Y ) (cid:48) { b (cid:48) t ( X, Y ) } , b ∈ Θ , to exist. Nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] then guarantees that Γ( b ) is negativedeﬁnite, and hence that Q ( b ) is strictly concave and admits a unique maximum. Non-singularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] is thus suﬃcient for identiﬁcation of b , and the GT g ( Y, X ) is identiﬁed as a known linear combination of the known functions T ( X, Y ),and hence the distributional regression functions also are identiﬁed.

Theorem 2.

For model (2.2)-(2.3), if Assumption 1 holds then Q ( b ) has a uniquemaximum in Θ at b . Consequently, b , the GT g ( Y, X ) and the distributional regres-sion functions are identiﬁed. y Theorem 2, b is the only solution to the ﬁrst-order conditions(2.10) E [ ψ ( Y, X, b )] = 0 , ψ ( Y, X, b ) ≡ − T ( X, Y )( b (cid:48) T ( X, Y )) + t ( X, Y ) b (cid:48) t ( X, Y ) , b ∈ Θ . For the baseline case where Y has a zero mean and unit variance Gaussian distributionand is independent from X , we have that b (cid:48) T ( X, Y ) = Y and b (cid:48) t ( X, Y ) = 1 satisfythe conditions of model (2.2)-(2.3). Theorem 2 then implies that conditions (2.10) areuniquely satisﬁed by b = (0 , , JK − ) (cid:48) . This fact can be directly veriﬁed: E [ ψ ( Y, X, b )] = E [ − T ( X, Y ) Y + t ( X, Y )] = E [ W ( X ) ⊗ {− S ( Y ) Y + s ( Y ) } ]= E [ W ( X )] ⊗ E [ − S ( Y ) Y + s ( Y )] = 0 , since E [ − S ( Y ) Y + s ( Y )] = 0 has the form of the Stein equation for a standardGaussian random variable (e.g., Lemma 2.1 in Chen, Goldstein, and Shao, 2010),and hence holds for any vector of continuously diﬀerentiable functions S ( Y ) with E [ | s j ( Y ) | ] < ∞ , j ∈ { , . . . , J } . In contrast, conditions (2.10) holding with b (cid:54) =(0 , , JK − ) (cid:48) will indicate deviations of Y from Gaussianity and independence from X , thereby characterizing a transformation to Gaussianity of Y for almost everyvalue of X since b satisﬁes (2.2)-(2.3). Hence, we have the following direct testableimplications of Theorem 2. Corollary 1.

If there exists b such that model (2.2)-(2.3) holds then, for any vec-tors of functions (cid:102) W ( X ) and of continuously diﬀerentiable functions (cid:101) S ( e ) such that (cid:101) T ( X, e ) ≡ (cid:102) W ( X ) ⊗ (cid:101) S ( e ) and (cid:101) t ( X, e ) ≡ ∂ e (cid:101) T ( X, e ) satisfy Assumption 1 with T = (cid:101) T , t = (cid:101) t and Y = e , the following hold: (i) (0 , , JK − ) (cid:48) is the unique solution to max b ∈ (cid:101) Θ E [ − (log(2 π ) + { b (cid:48) (cid:101) T ( X, e ) } ) / b (cid:48) (cid:101) t ( X, e ))] , where (cid:101) Θ ≡ { b ∈ R JK : Pr[ b (cid:48) (cid:101) t ( X, e ) >

0] = 1 } , and (ii) the ‘Stein score’ conditions E [ − (cid:101) T ( X, e ) e + (cid:101) t ( X, e )] = 0 hold.

Discussion.

The general modeling of F Y | X ( Y | X ) can be done indirectly by spec-ifying a representation for Y given X ,(2.11) Y = H ( X, e ) , e | X ∼ F e , where the function H ( X, e ) is strictly increasing in its second argument e , a scalarrandom variable with distribution F e and independent of X . The speciﬁcation of both he function H and the distribution F e then determines the form of F Y | X ( Y | X ):(2.12) F Y | X ( y | X ) = F e ( H − ( y, X )) , y ∈ R , where y (cid:55)→ H − ( y, X ) denotes the inverse function of e (cid:55)→ H ( X, e ). In this approach,while in our context the statistical target of the analysis is F Y | X ( Y | X ), for a speciﬁeddistribution F e the object of modeling is the function H ( X, e ).In Econometrics relation (2.11) is often characterized as ‘nonlinear and nonseparable’in order to draw attention to the potentially complex X – Y structure at constant e and the lack of additive structure in e (e.g., Chesher, 2003; Matzkin, 2003). Theseare essential features of H that allow for the shape of the conditional distribution of Y to vary across values of X . An alternative approach to (2.11)-(2.12) that preservesnonlinearity and nonseparability is to model F Y | X ( Y | X ) directly as(2.13) F Y | X ( Y | X ) = F e ( g ( Y, X )) , for some strictly increasing function y (cid:55)→ g ( y, X ). In the approach we propose in thispaper, with F − e denoting the inverse function of F e , for a speciﬁed distribution F e the object of modeling is the quantile transform g ( X, Y ) = F − e ( F Y | X ( Y | X )), whichby construction has distribution F e and is independent of X .The modeling of the statistical relationship between X and Y through representation(2.12) or representation (2.13) is not innocuous. In particular, with f e denoting thePDF of e , the deﬁnition of the conditional PDF of Y given X according to the indirectapproach (2.12),(2.14) f Y | X ( y | X ) = f e ( H − ( y, X )) { ∂ y H − ( y, X ) } , y ∈ R , involves the inverse function of the modeling object H . In general this inverse functiondoes not have a closed-form expression, except for some simple cases like the locationmodel H ( X, e ) ≡ X (cid:48) β + σe with σ >

0, and the location-scale model H ( X, e ) ≡ X (cid:48) β +( X (cid:48) β ) e with X (cid:48) β >

0. Furthermore, expression (2.14) gives rise to a nonconcavelikelihood for even the simplest speciﬁcations of H and F e , including the locationand location-scale models with Gaussian e (Owen, 2007; Spady and Stouli, 2018b).In contrast, a major advantage of representation (2.13) is that the correspondingexpression for f Y | X ( Y | X ) circumvents the inversion step since f Y | X ( Y | X ) = f e ( g ( Y, X )) { ∂ y g ( Y, X ) } . his formulation allows for the direct speciﬁcation of ﬂexible models for g ( Y, X ) thatare characterized by a concave likelihood. Hence, considerable computational advan-tages accrue in estimation when e = g ( Y, X ) can be computed in closed-form, asfurther demonstrated by the duality analysis in Section 6. Moreover, we show inthe next section that this formulation allows for the characterization of well-deﬁnedrepresentations for F Y | X ( Y | X ) under misspeciﬁcation.3. Quasi-Gaussian Representations under Misspecification

In this section we study the properties of quasi-Gaussian representations for F Y | X ( Y | X )that are generated by maximization of the objective Q ( b ) under general misspeciﬁca-tion, i.e., when there is no representation of the form (2.2)-(2.3) that satisﬁes eitherthe Gaussianity or the independence properties, or both. We establish existence anduniqueness of such quasi-Gaussian representations and we ﬁnd that the implied rep-resentations for distributional regression functions are well-deﬁned and KLIC optimalapproximations for the true distributional regression functions.3.1. Existence and uniqueness.

Assumption 1 is suﬃcient for characterizing thesmoothness properties and the shape of Q ( b ) on Θ. The objective function is con-tinuous and strictly concave over the parameter space, and hence admits at mostone maximizer. Existence of a maximizer, on the other hand, requires an additionalregularity condition. Assumption 2.

The joint density function f Y X ( Y, X ) of Y and X is bounded awayfrom zero with probability one.Assumptions 1 and 2 allow for the characterization of the behavior of Q ( b ) on theboundary of Θ. Under these assumptions, the level sets of Q ( b ) are compact. Com-pactness of the level sets is a suﬃcient condition for existence of a maximizer, and is aconsequence of the explosive behavior of the objective function at the boundary of Θ.By the quadratic term −{ b (cid:48) T ( X, Y ) } being negative, as b approaches the boundaryof Θ the log Jacobian term diverges to −∞ , and hence so does −{ b (cid:48) T ( X, Y ) } / { b (cid:48) t ( X, Y ) } ) on a set with positive probability. Under Assumption 2, this is suf-ﬁcient to conclude that the objective function Q ( b ) diverges to −∞ , and hence thatthere exists at least one maximizer to Q ( b ) in Θ, denoted b ∗ . nder misspeciﬁcation, to the maximizer b ∗ corresponds the quasi-Gaussian repre-sentation e ∗ = T ( X, Y ) (cid:48) b ∗ ≡ g ∗ ( Y, X ), where g ∗ ( Y, X ) is an element of the set ofﬁnite-dimensional representations

E ≡ { m : Pr[ m ( Y, X ) = b (cid:48) T ( X, Y )] = 1 } with b ∈ Θ. By deﬁnition of Θ, y (cid:55)→ b (cid:48) T ( X, y ) is strictly increasing for each b ∈ Θwith probability one, and hence each m ∈ E has a well-deﬁned inverse function. Wenote that nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] implies that g ∗ ( Y, X ) is unique in E ,i.e., there is no m = g ∗ in E with m ( Y, X ) = b (cid:48) T ( Y, X ) and b (cid:54) = b ∗ .Deﬁne the range of y (cid:55)→ Φ( m ( y, x )) as U x ( m ) ≡ { u ∈ (0 ,

1) : Φ( m ( y, x )) = u for some y ∈ R } , for m ∈ E and x ∈ X . To the quasi-Gaussian representation g ∗ ( Y, X ) correspond ﬂexible approximations for the conditional CDF and CQF of Y given X , deﬁned as F ∗ ( Y, X ) ≡ Φ( g ∗ ( Y, X )) , Q ∗ ( u, X ) ≡ h ∗ ( X, Φ − ( u )) , u ∈ U X ( g ∗ ) , where e (cid:55)→ h ∗ ( X, e ) denotes the inverse of y (cid:55)→ g ∗ ( y, X ), and for the conditional PDFof Y given X , deﬁned as(3.1) f ∗ ( Y, X ) ≡ φ ( g ∗ ( Y, X )) { ∂ y g ∗ ( Y, X ) } . These representations are unique in, respectively, the following spaces

F ≡ { F : Pr[ F ( Y, X ) = Φ( m ( Y, X ))] = 1 }Q ≡ (cid:8) Q : Pr[ Q ( u, X ) = q ( X, Φ − ( u )) for all u ∈ U X ( m )] = 1 (cid:9) D ≡ { f : Pr[ f ( Y, X ) = φ ( m ( Y, X )) { ∂ y m ( Y, X ) } ] = 1 } with m ∈ E , and where e (cid:55)→ q ( X, e ) denotes the inverse of y (cid:55)→ m ( y, X ). Therefore,the approximations for the distributional regression functions are well-deﬁned, andthe conditional CDF and CQF approximations satisfy global monotonicity. Theorem 3.

If Assumptions 1-2 hold then there exists a unique maximum b ∗ to Q ( b ) in Θ . Consequently, the quasi-Gaussian representation g ∗ ( Y, X ) and the correspondingapproximations for the distributional regression functions are unique. KLIC optimality.

When the elements of D are proper conditional probabilitydistributions that integrate to one, a further motivation for the use of the proposedloss function Q ( b ) is the information-theoretic optimality of the implied distributionalregression functions under misspeciﬁcation (White, 1982). ince each f ∈ D satisﬁes f > f ∈ D is a properconditional PDF if it satisﬁes ´ R f ( y, X ) dy = 1 with probability one. A necessary andsuﬃcient condition for this to hold is that the boundary conditions(3.2) lim y →−∞ b (cid:48) T ( X, y ) = −∞ , lim y →∞ b (cid:48) T ( X, y ) = ∞ , hold with probability one, for all b ∈ Θ. Given a speciﬁed dictionary such that(3.2) holds, Theorem 3 implies that the approximation f ∗ ( Y, X ) in (3.1) is the uniquemaximum selected by the population criterion in D , i.e., f ∗ = arg max f ∈D E [log f ( Y, X )] , and hence that f ∗ ( Y, X ) is the KLIC closest probability distribution to f Y | X ( Y | X ).The corresponding F ∗ and Q ∗ are then the KLIC optimal conditional CDF and CQFapproximations for F Y | X ( Y | X ) and Q Y | X ( u | X ), respectively. Theorem 4. If E [ | log f Y | X ( Y | X ) | ] < ∞ and the boundary conditions (3.2) hold withprobability one for all b ∈ Θ , then f ∗ is the KLIC closest probability distribution to f Y | X ( Y | X ) in D , i.e., f ∗ = arg min f ∈D E (cid:20) log (cid:18) f Y | X ( Y | X ) f ( Y, X ) (cid:19)(cid:21) , where each f ∈ D is a proper conditional PDF. Moreover, f ∗ is related to the KLICoptimal conditional CDF F ∗ in F by F ∗ ( y, X ) = ˆ y −∞ f ∗ ( t, X ) dt, y ∈ R , and to the well-deﬁned inverse of y (cid:55)→ F ∗ ( y, X ) , the KLIC optimal CQF u (cid:55)→ Q ∗ ( X, u ) in Q with derivative ∂Q ∗ ( X, u ) ∂u = 1 f ∗ ( Q ∗ ( X, u ) , X ) > , u ∈ (0 , , with probability one. Under the boundary conditions (3.2), the set F is the space of conditional CDFs withGaussian representation in linear form, and the set Q is the space of correspondingwell-deﬁned CQFs. A necessary and suﬃcient condition for (3.2) is obtained, forinstance, if the limits lim y →±∞ | S j ( y ) | are ﬁnite, j ∈ { , . . . , J } . Under this maintained ondition, the varying coeﬃcients representation e = β ( X ) (cid:48) S ( Y ) in (2.4), written as e = β ( X ) (cid:48) S ( Y ) = β ( X ) + β ( X ) Y + J (cid:88) j =3 β j ( X ) S j ( Y ) , implies that β ( X ) > y →∞ β ( X ) (cid:48) S ( y ) would be ﬁnite or −∞ , and lim y →−∞ β ( X ) (cid:48) S ( y ) would beﬁnite or ∞ . The support of Y being the entire real line, β ( X ) > β ( X ) > β ( X ) (cid:48) s ( Y ) = β ( X ) + (cid:80) Jj =3 β j ( X ) s j ( Y ) > s j ( Y ), j ∈ { , . . . , J } , are speciﬁed to be zero outside some compact region of R , since thederivative then reduces to β ( X ) outside this region. The boundary conditions (3.2)then eﬀectively hold under a location-scale restriction in the tails of the distributionof Y given X . We also note that (3.2) always holds for J = 2 since the derivativecondition is β ( X ) > Remark . Another interpretation arises for the quasi-Gaussian representation e ∗ = g ∗ ( Y, X ) by writing e ∗ = [ W ( X ) ⊗ S ( Y )] (cid:48) b ∗ = K (cid:88) k =1 W k ( X ) { S ( Y ) (cid:48) b ∗ k } = K (cid:88) k =1 W k ( X ) β ∗ k ( Y ) = W ( X ) (cid:48) β ∗ ( Y ) , with β ∗ ( Y ) = ( β ∗ ( Y ) , . . . , β ∗ K ( Y )) (cid:48) a vector of varying coeﬃcients speciﬁed as β ∗ k ( Y ) ≡ S ( Y ) (cid:48) b ∗ k where b ∗ k = ( b ∗ k , . . . , b ∗ kJ ) (cid:48) , k ∈ { , . . . , K } . Under the conditions of Theorem4, F ∗ ( Y, X ) = Φ( W ( X ) (cid:48) β ∗ ( Y )) , is the KLIC optimal conditional CDF in F for a distribution regression model ofthe form F Y | X ( Y | X ) = Φ( W ( X ) (cid:48) β ( Y )) (Foresi and Perrachi, 1995; Chernozhukov,Fernandez-Val, and Melly, 2013), where β ( Y ) is a vector of unknown functions. (cid:3) Remark . If some component x (cid:55)→ W k ( x ) of W ( X ) has range the entire real line,then the corresponding varying coeﬃcient β ∗ k ( Y ) must be zero with probability onesince b ∗ ∈ Θ and there is no b ∗ ∈ Θ such that b ∗ k (cid:54) = 0 if x (cid:55)→ W k ( x ) has range R . (cid:3) This and the maintained assumption that lim y →±∞ | S j ( y ) | < ∞ are satisﬁed for instance if, for each j ∈ { , . . . , J } , the transformations S j ( Y ) are deﬁned as S j ( y ) ≡ ´ y −∞ s j ( t ) dt , for nonnegative splinefunctions s j ( Y ) (cid:54) = 0 on a compact subset of R , as s j ( Y ) = 0 outside this region and S j ( Y ) is then aCDF over the entire real line (Curry and Schoenberg, 1966; Ramsay, 1988). . Estimation, Inference, and Model Specification

Our characterization of GT regression models and of KLIC optimal approximationshas a natural ﬁnite sample counterpart. We use the sample analog of the popula-tion objective function (2.8) to propose an ML estimator for GT regression models,which is also asymptotically valid for quasi-Gaussian representations under misspeci-ﬁcation. We establish the asymptotic properties of the estimator, and extend the MLformulation in order to allow for potentially sparse Gaussian representations by usingthe ML estimator as a ﬁrst step for an adaptive Lasso (Zou, 2006) ML estimator.This formulation serves as a model selection procedure, and we derive the asymptoticdistribution of the corresponding estimators for the selected distributional regressionmodel.4.1.

Maximum Likelihood estimation.

We assume that we observe a sample of n independent and identically distributed realizations { ( y i , x i ) } ni =1 of the random vector( Y, X (cid:48) ) (cid:48) . The sample analog of Q ( b ) deﬁnes the GT regression empirical loss function: Q n ( b ) ≡ n − n (cid:88) i =1 (cid:26) −

12 [log(2 π ) + { b (cid:48) T ( x i , y i ) } ] + log( b (cid:48) t ( x i , y i )) (cid:27) , b ∈ Θ . The GT regression estimator is(4.1) (cid:98) b ≡ arg max b ∈ Θ Q n ( b ) . We derive the asymptotic properties of ˆ b under the following assumptions. Assumption 3. (i) { ( y i , x i ) } ni =1 are identically and independently distributed, and(ii) E [ || T ( X, Y ) || ] < ∞ .Assumption 3(i) can be replaced with the condition that { ( y i , x i ) } ni =1 is stationaryand ergodic (Newey and McFadden, 1994). Assumption 3(ii) is needed for consistentestimation of the asymptotic variance-covariance matrix of (cid:98) b .Recalling the deﬁnitions of γ ( Y, X, b ) and Γ( b ) in (2.9) and ψ ( Y, X, b ) in (2.10),the variance-covariance matrix of ˆ b is Γ − ΨΓ − /n , where Γ ≡ Γ( b ∗ ) and Ψ ≡ E [ ψ ( Y, X, b ∗ ) ψ ( Y, X, b ∗ ) (cid:48) ]. The corresponding estimators of Γ and Ψ are deﬁned as (cid:98) Γ = n − (cid:80) ni =1 γ ( y i , x i , ˆ b ) and (cid:98) Ψ = n − (cid:80) ni =1 ψ ( y i , x i , ˆ b ) ψ ( y i , x i , ˆ b ) (cid:48) , respectively. Thenext theorem states the asymptotic properties of the GT regression estimator. heorem 5. If Assumptions 1-3 hold, then (i) there exists ˆ b in Θ with probabilityapproaching one; (ii) ˆ b → p b ∗ ; and (iii) n / (ˆ b − b ∗ ) → d N (0 , Γ − ΨΓ − ) . Moreover, (cid:98) Γ − (cid:98) Ψ (cid:98) Γ − → p Γ − ΨΓ − . Theorem 5(i) demonstrates existence of a globally monotone representation ˆ b (cid:48) T ( Y, X )with ˆ b (cid:48) t ( Y, X ) > b ∗ such that e ∗ = T ( X, Y ) (cid:48) b ∗ is either not Gaussian or not independent from X , or both. Un-der correct speciﬁcation, the information matrix equality (e.g., Newey and McFad-den, 1994) implies that Γ = − Ψ and that the estimator is eﬃcient, with asymptoticvariance-covariance matrix − Γ − . The information matrix equality provides a testableimplication of the validity of model (2.2)-(2.3) and forms the basis of a speciﬁcationtest in ﬁnite samples (White, 1982; Chesher and Spady, 1991). Penalized estimation.

In general the components of a speciﬁed dictionary T ( X, Y ) that are suﬃcient for g ∗ ( Y, X ) to be Gaussian and independent from X are not known. The components of T ( X, Y ) that do not improve the quality of theGT approximation, as measured by the KLIC, have zero coeﬃcients. For selection ofcomponents with nonzero coeﬃcients, we use a penalized ML procedure based on theadaptive Lasso (Lu, Goldberg, and Fine, 2012; Horowitz and Nesheim, 2020) that pre-serves ML KLIC optimality and strict concavity of the objective function. Horowitzand Nesheim (2020) also ﬁnd that ML adaptive Lasso leads to asymptotic mean-squareerror improvements for nonzero coeﬃcients. Under misspeciﬁcation, adaptive LassoGT regression selects the KLIC optimal sparse approximation for g ( Y, X ). We notethat we do not assume that the true or pseudo-true parameter vector is sparse.The adaptive Lasso GT regression estimator is deﬁned as(4.2) ˆ b AL ≡ arg max b ∈ Θ n Q n ( b ) − λ n JK (cid:88) l =1 (cid:98) w l | b l | , (cid:98) w l ≡ (cid:40) | ˆ b l | if ˆ b l (cid:54) = 00 if ˆ b l = 0 , where λ n > (cid:98) w l are obtained from aﬁrst-step estimate (4.1). Alternatively, a bootstrap-based speciﬁcation test can be formulated such as the conditional Kol-mogorov speciﬁcation test of Andrews (1997) where critical values are obtained using a parametricbootstrap procedure. e write b ∗ = ( b ∗ (cid:48) A , b ∗ (cid:48) A c ) (cid:48) , where b ∗A is a p -dimensional vector of nonzero parametersand b ∗A c is a ( J K − p )-dimensional vector of zero parameters, with p ≤ J K . Thevector (cid:98) b AL = ( (cid:98) b (cid:48)A , (cid:98) b (cid:48)A c ) (cid:48) is written similarly. We state the asymptotic properties of (cid:98) b AL . Theorem 6.

Suppose that Assumptions 1-3 hold, and that λ n → ∞ and n − / λ n → as n → ∞ . Then (i) Pr[ (cid:98) b A c = 0] → , and (ii) n / (ˆ b A − b ∗A ) → d N (0 , Γ − A Ψ A Γ − A ) , where Γ A and Ψ A are the upper left p × p blocks of Γ and Ψ , respectively. Estimation of distributional regression functions.

Estimators of the dis-tributional regression functions are formed as known functionals of an estimator for b ∗ . Let T A ( X, Y ) denote the subvector of T ( X, Y ) corresponding to the componentsof (cid:98) b A , and deﬁne T A c ( X, Y ) analogously. Let (cid:98) b † denote either the ML estimator (cid:98) b orthe penalized ML estimator ( (cid:98) b (cid:48)A , JK − p ) (cid:48) , and let T † ( X, Y ) = T ( X, Y ) if (cid:98) b † = (cid:98) b , and T † ( X, Y ) = ( T A ( X, Y ) (cid:48) , T A c ( X, Y ) (cid:48) ) (cid:48) otherwise. The estimators for the GT g ∗ ( y, x )are formed as (cid:98) g ∗ ( y, x ) ≡ T † ( x, y ) (cid:48) (cid:98) b † , ( y, x ) ∈ YX . The corresponding estimators forthe distributional regression functions are deﬁned as (cid:98) F ∗ ( y, x ) ≡ Φ( (cid:98) g ∗ ( y, x )) , (cid:98) f ∗ ( y, x ) ≡ φ ( (cid:98) g ∗ ( y, x )) { ∂ y (cid:98) g ∗ ( y, x ) } , ( y, x ) ∈ YX , and (cid:98) Q ∗ ( x, u ) ≡ { y ∈ R : Φ( (cid:98) g ∗ ( y, x )) = u, ∂ y (cid:98) g ∗ ( y, x ) > } , x ∈ X , u ∈ U x ( g ∗ ) . The asymptotic distribution of both ML and adaptive Lasso estimators for distribu-tional regression functions follows by application of the Delta method.

Theorem 7.

Suppose that Ξ ≡ Γ †− Ψ † Γ †− is positive deﬁnite. Under Assumptions1-3 we have: (i) for ( y, x ) ∈ YX , n ( (cid:98) F ∗ ( y, x ) − F ∗ ( y, x )) → d N (cid:0) , φ ( g ∗ ( y, x )) T ( x, y ) (cid:48) Ξ T ( x, y ) (cid:1) , and n ( (cid:98) f ∗ ( y, x ) − f ∗ ( y, x )) → d N (cid:0) , φ ( g ∗ ( y, x )) ∆( x, y ) (cid:48) Ξ∆( x, y ) (cid:1) , where ∆( x, y ) ≡ − g ∗ ( y, x ) { ∂ y g ∗ ( y, x ) } T ( x, y ) + t ( x, y ) ; (ii) for x ∈ X and u ∈ U x ( g ∗ ) , n ( (cid:98) Q ∗ ( x, u ) − Q ∗ ( x, u )) → d N (cid:18) , { ∂ y g ∗ ( y , x ) } T ( x, y ) (cid:48) Ξ T ( x, y ) (cid:19) , where y = Q ∗ ( u, x ) . he asymptotic variance of both the unpenalized and the penalized estimators dependson the asymptotic variance-covariance matrix Ξ of (cid:98) b † , and is computed by substitutingthe corresponding estimator according to Theorems 5 or 6, respectively. Remark . For implementation of the unpenalized estimator (4.1) we expand theoriginal parameter space Θ to Θ n = { b ∈ R JK : b (cid:48) t ( x i , y i ) > , i ∈ { , . . . , n }} , theeﬀective domain of Q n ( b ). This implies that there exists b ∈ Θ n such that b (cid:48) t ( X, Y ) ≤ (cid:98) b ∈ Θ holds after estimation by checking thequasi-global monotonicity (QGM) property (cid:98) b (cid:48) t ( x, (cid:98) Q ∗ ( x, u )) > X , for each quantile level u of interest. If QGM is violated for some ( x, u )in this grid, then (cid:98) b is reestimated repeatedly by adding an increasing number of linearinequality constraints of the form b (cid:48) T ( x, y ) ≥ (cid:15) on a coarse grid covering Y × X , forsome small constant (cid:15) >

0, until QGM is satisﬁed. (cid:3)

Remark . For implementation of the penalized estimator (4.2) we also expand theoriginal parameter space Θ to Θ n but do not consider adding monotonicity constraints.Instead, we rule out penalization parameter values λ n for which the QGM propertydoes not hold. (cid:3) Duality Theory

Considerable computational advantages accrue from the concave likelihood formula-tion we propose, where the GT is expressed in closed-form. To the GT regressionproblem (4.1) corresponds a dual formulation that can be cast into the modern con-vex programming framework (Boyd and Vandenberghe, 2004). We derive this dualformulation and establish the properties of the corresponding dual solutions.

Theorem 8.

If Assumptions 1-3 are satisﬁed then the following hold.(i) The dual of (4.1) is min ( u,v ) ∈ R n × ( −∞ , n − n (cid:18)

12 log(2 π ) + 1 (cid:19) + n (cid:88) i =1 (cid:26) u i − log ( − v i ) (cid:27) (5.1) subject to − n (cid:88) i =1 { T ( x i , y i ) u i + t ( x i , y i ) v i } = 0(5.2) the dual GT regression problem, with solution (cid:98) α = ( (cid:98) u (cid:48) , (cid:98) v (cid:48) ) (cid:48) . ii) The dual GT regression program (5.1)-(5.2) admits the method-of-moments rep-resentation n (cid:88) i =1 (cid:26) − T ( x i , y i ) { b (cid:48) T ( x i , y i ) } + t ( x i , y i ) b (cid:48) t ( x i , y i ) (cid:27) = 0 , the ﬁrst-order conditions of (4.1).(iii) With probability approaching one we have: (a) existence and uniqueness, i.e.,there exists a unique pair ( (cid:98) b (cid:48) , (cid:98) α (cid:48) ) (cid:48) that solves (4.1) and (5.1)-(5.2), and (5.3) (cid:98) u i = (cid:98) b (cid:48) T ( x i , y i ) , ˆ v i = − (cid:98) b (cid:48) t ( x i , y i ) , i ∈ { , . . . , n } ; (b) strong duality, i.e., the value of (4.1) equals the value of (5.1)-(5.2). The dual formulation established in Theorem 8 demonstrates important computationalproperties of GT regression. The Hessian matrix of the dual problem (5.1)-(5.2) is (cid:34) I n n × n n × n diag(1 /v i ) (cid:35) , a positive deﬁnite diagonal matrix for all v ∈ ( −∞ , n , with I n denoting the n × n iden-tity matrix and diag(1 /v i ) the n × n diagonal matrix with elements (1 /v , . . . , /v n ).Thus the dual problem is a strictly convex mathematical program with sparse Hessianmatrix and J K linear constraints. This computationally convenient formulation isexploited by state-of-the-art convex programming solvers like ECOS (Domahidi, Chu,and Boyd, 2013) and SCS (O’Donoghue, Chu, Parikh, and Boyd, 2016) that we usein our implementation.In addition to KLIC optimality of the solution and the presence of a logarithmic barrierfor global monotonicity in the objective, linearity of the constraints is an importantadvantage of the dual formulation (5.1)-(5.2) relative to the alternative generalizeddual regression characterization of CQFs and conditional CDFs (Spady and Stouli,2018a) for which the mathematical program is of the form(5.4) max e ∈ R n (cid:40) y (cid:48) e : n (cid:88) i =1 T ( x i , e i ) = 0 (cid:41) , where T ( x i , e i ) is a speciﬁed vector of known functions of x i and e i including e i and( e i − /

2, so that the parameter vector e enters nonlinearly into the constraints. Theﬁrst-order conditions of (5.4) are(5.5) y i = (cid:98) b (cid:48) { ∂ e i T ( x i , e i ) } , i ∈ { , . . . , n } , here (cid:98) b is the Lagrange multiplier vector for the constraints in (5.4), but where thesolution is now determined by a system of n nonlinear equations instead of havinga closed-form expression as in (5.3). This is a further illustration of the importantbeneﬁts accruing from closed-form modeling of the GT e = g ( Y, X ) and its derivativefunction, compared to direct modeling of the outcome y i in (5.5).The dual formulation extends to the penalized estimator (4.2). Theorem 9.

The dual of (4.2) is min ( u,v ) ∈ R n × ( −∞ , n − n (cid:18)

12 log(2 π ) + 1 (cid:19) + n (cid:88) i =1 (cid:26) u i − log( − v i ) (cid:27) , subject to (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 { T i,l u i + t i,l v i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ n (cid:98) w l , l ∈ { , . . . , J K } . the dual adaptive Lasso GT regression problem.Remark . For (cid:98) w l = 1 for each l ∈ { , . . . , J K } , the dual adaptive Lasso GT re-gression problem reduces to the dual Lasso GT regression problem, with constraints (cid:107) (cid:80) ni =1 { T i u i + t i v i }(cid:107) ∞ ≤ λ n . (cid:3) An Illustrative Example

In this section we illustrate our framework with the estimation of a distributionalAR(1) model for daily temperatures in Melbourne, Australia. The dataset consists of3,650 consecutive daily maximum temperatures, and was originally analyzed by Hyn-dman, Bashtannyk, and Grunwald (1996). The estimation of distributional regressionfunctions for this dataset is challenging because the shape of the outcome distribution,today’s temperatures Y t , given yesterday’s temperature, Y t − , varies across values of Y t − . Applying quantile regression to this data set, Koenker (2000) ﬁnds that tem-peratures following very hot days are bimodally distributed, with the lower modecorresponding to a break in the temperature, that is, a much cooler temperature,whereas temperatures of days following cool days are unimodally distributed. Com-pared to Koenker (2000), we obtain CQFs that are well-behaved across the entiresupport of the data, we estimate the corresponding conditional PDFs and CDFs, andwe provide conﬁdence bands for all distributional regression functions. e illustrate the main features of the GT regression methodology by implementingboth unpenalized and penalized estimation for four diﬀerent classes of model speciﬁ-cations for e ∗ = g ∗ ( Y t , Y t − ) = [ W ( Y t − ) ⊗ S ( Y t )] (cid:48) b ∗ and its derivative function:(1) Linear-Linear: we set s ( Y t ) = (0 , (cid:48) , S ( Y t ) = (1 , Y t ) (cid:48) and W ( Y t − ) = (1 , Y t − ) (cid:48) .(2) Linear- Y and Spline- X : we set s ( Y t ) = (0 , (cid:48) , S ( Y t ) = (1 , Y t ) (cid:48) and W ( Y t − ) =(1 , (cid:102) W ( Y t − ) (cid:48) ) (cid:48) , with (cid:102) W ( Y t − ) a vector of K − Y and Linear- X : we set s ( Y t ) = (0 , , (cid:101) s ( Y t ) (cid:48) ) (cid:48) , with (cid:101) s ( Y t ) a vector of J − S ( Y t ) = (1 , Y t , (cid:101) S ( Y t ) (cid:48) ) (cid:48) where (cid:101) S j ( y t ) = ´ y t −∞ (cid:101) s ( r ) dr , j ∈ { , . . . , J − } , and W ( Y t − ) = (1 , Y t − ) (cid:48) .(4) Spline-Spline: we set s ( Y t ) = (0 , , (cid:101) s ( Y t ) (cid:48) ) (cid:48) , S ( Y t ) = (1 , Y t , (cid:101) S ( Y t ) (cid:48) ) (cid:48) and W ( Y t − ) = (1 , (cid:102) W ( Y t − ) (cid:48) ) (cid:48) .For speciﬁcation classes 2 and 4, we consider a set of models including cubic B-splinetransformations in W ( Y t − ) with K ∈ { , . . . , } and equispaced knots. For classes3 and 4 we consider a set of models including quadratic B-spline transformationsin s ( Y t ) with J ∈ { , } and of models including cubic B-splines with J ∈ { , } ,with equispaced knots. In total, we thus consider 50 diﬀerent model speciﬁcations.Spline functions satisfy the conditions of our modeling framework and have beendemonstrated to be remarkably eﬀective when applied to the related problems of logdensity estimation (Kooperberg and Stone, 2001) or monotone regression functionestimation (Ramsay, 1988).For each model speciﬁcation, we implement three steps. First, we run the penalizedestimator for each of 5 λ n values in a small logarithmically spaced grid in [0 . , . Y coeﬃcients, i.e., we set (cid:98) w = (cid:98) w J +1 = 0. Second,following the literature on adaptive Lasso ML (Lu, Goldberg, and Fine, 2012; Horowitzand Nesheim, 2020), we select the value of λ n that minimizes the Bayes informationcriterion (BIC) among penalized estimates that satisfy QGM (cf. Remark 5). Third,we record the BIC value of the corresponding selected estimate. In the SupplementaryMaterial we describe in detail the implementation of the QGM property, and allcomputational procedures can be implemented in the software R (R DevelopmentCore Team, 2020) using open source software packages for convex optimization suchas CVX, and its R implementation CVXR (Fu, Narasimhan, and Boyd, 2017).Figure 6.1 shows CQFs for the models with smallest recorded BIC within each ofthe speciﬁcation classes 1-3, illustrating the diﬀerent features of the data that each Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (a) Spec. 1.

Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (b) Spec. 2.

Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (c) Spec. 3.

Figure 6.1.

CQF with scatterplot, for u ∈ { . , . , . . . , . } .speciﬁcation class captures, as well as the corresponding restrictions on the implieddistribution of Y t given Y t − . For both classes 1 and 2, this implied distribution isrestricted to Gaussianity across all values of Y t − . Figure 6.1(A) shows that speciﬁca-tion class 1 also strongly restricts the shape of the CQFs across values of Y t − , but isable to capture some nonlinearity in Y t − . Figure 6.1(B) shows that speciﬁcation class2 further allows for nonmonotonicity of the CQFs in Y t − , while capturing substantialheteroskedasticity in the data, a reﬂection of the more ﬂexible functional forms for theconditional ﬁrst and second moments of Y t given Y t − . In contrast with speciﬁcationclasses 1-2, for class 3 the GT g ∗ ( Y t , Y t − ) is nonlinear in Y t which allows for devia-tions of the conditional distribution of Y t given Y t − from Gaussianity, through thedependence of the derivative function on both Y t and Y t − . Figure 6.1(C) illustratesthe ability of speciﬁcation class 3 to capture asymmetry of the distribution of Y t given Y t − , as well as changes in the mode location of this distribution across values of Y t − ,in addition to allowing for nonlinearity of the CQF and heteroskedasticity.Figures 6.2-6.3 show distributional regression functions for the model speciﬁcationwith smallest BIC within the Spline-Spline speciﬁcation class 4. The selected modelhas smallest BIC among all speciﬁcation classes 1-4, and features quadratic splines in s ( Y t ) with J = 5, and cubic splines in W ( Y t − ) with K = 7. In total, this parametriza-tion includes 35 parameters, of which 25 are estimated to be nonzero after penalization.This parsimonious Spline-Spline model is able to simultaneously capture all important Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (a) CQF with scatterplot, for u ∈ { . , . , . . . , . } . Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on Yesterday's Temperature C ond i t i ona l Q uan t il e F un c t i on (b) CQF with conﬁdence bands, for u ∈ { . , . , . , . . , . } . Figure 6.2.

Unpenalized (left) and penalized (right) CQF.data features described above, including nonlinearity in Y t − and the varying shape ofthe conditional distribution of Y t . In particular, for the unpenalized CQF in Figure6.2, the uneven spacing of the quantiles at higher values of lagged temperature sug-gests that the conditional PDF of current temperature is bimodal at such values. Thetwo modes are especially apparent from the unpenalized PDFs displayed in 6.3(B),and are also reﬂected by the two inﬂection points in the corresponding CDF in Fig-ure 6.3(A). The right panels of Figures 6.2-6.3 show the penalized versions of the .000.250.500.751.00 10 20 30 40 Today's Temperature C ond i t i ona l C u m u l a t i v e D i s t r i bu t i on F un c t i on Today's Temperature C ond i t i ona l C u m u l a t i v e D i s t r i bu t i on F un c t i on (a) Conditional CDF.

Today's Temperature C ond i t i ona l D en s i t y F un c t i on Today's Temperature C ond i t i ona l D en s i t y F un c t i on (b) Conditional PDF.

Figure 6.3.

Unpenalized (left) and penalized (right) conditional PDFand CDF with conﬁdence bands, for y t − ∈ { . , . , . , . , . } .distributional regression functions. In this example, the penalized estimator yieldsvery similar conclusions, the main diﬀerences being the somewhat less pronouncedbimodality for days with high temperatures, as well as the tighter conﬁdence bandsat the CQF boundaries.Overall, we ﬁnd that parsimonious Gaussian representations are able to capture com-plex features of the data, such as nonmonotonicity and conditional distributions with arying shapes, while providing complete estimates of distributional regression func-tions and their conﬁdence bands. Importantly, the corresponding CQF estimates areendowed with the no-crossing property of quantiles over the full data support. In theSupplementary Material we assess the robustness of the selected Spline-Spline modeland ﬁnd that its main features are well-preserved across speciﬁcations with similar BICvalues. Thus, although establishing the BIC properties in our context is an impor-tant topic for future research, we ﬁnd that GT regression estimates of distributionalregression functions exhibit reassuring stability within a given speciﬁcation class.7. Conclusion

The formulation of distributional regression models through the speciﬁcation of aGT e = g ( Y, X ) leads to a unifying framework for the global estimation of statisti-cal objects of general interest, such as conditional quantile, density and distributionfunctions. The implied convex programming formulation is easy to implement andallows for estimation of sparse models. The linear form of the proposed GT regres-sion models also constitutes a good starting point for nonparametric estimation ofdistributional regression functions. In this paper we have considered a few extensionsto our original formulation such as misspeciﬁcation, multiple outcomes and penalizedestimation. Our framework can also be extended to allow for outcomes with discreteor mixed discrete-continuous distributions by appropriately modifying the form of thelog-likelihood function. An important further extension we will consider in future workis the generalization of our results to distributional regression models with endogenousregressors.

Appendix A. Proof of Theorem 1

For the conditional CDF of Y given X , for all y ∈ R ,(A.1) Φ( b (cid:48) T ( X, y )) = Pr[ b (cid:48) T ( X, Y ) ≤ b (cid:48) T ( X, y ) | X ] = Pr[ Y ≤ y | X ] = F Y | X ( y | X ) , where the ﬁrst equality follows from e = b (cid:48) T ( X, Y ) and e | X ∼ N (0 , y (cid:55)→ b (cid:48) T ( X, y ) strictly increasing with probability one by Lemma 1below, and the last equality is by deﬁnition of F Y | X ( y | X ). For the conditional PDF,upon diﬀerentiating y (cid:55)→ Φ( b (cid:48) T ( X, y )) and y (cid:55)→ F Y | X ( y | X ) in (A.1), we obtain φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } = f Y | X ( y | X ) , y ∈ R , ith probability one . For the CQF, the result in (A.1) and strict monotonicity of both y (cid:55)→ b (cid:48) T ( X, y ) and e (cid:55)→ Φ( e ) together imply Φ( b (cid:48) T ( X, Q Y | X ( u | X ))) = u . Therefore,recalling that e (cid:55)→ h ( X, e ) is the inverse of y (cid:55)→ b (cid:48) T ( X, y ), we obtain Q Y | X ( u | X ) = h ( X, Φ − ( u )) , u ∈ (0 , , with probability one . (cid:3) Lemma 1.

For each b ∈ Θ , the mapping y (cid:55)→ Φ( b (cid:48) T ( X, y )) is strictly increasing in y ∈ R with probability one.Proof. We note that ∂ y Φ( b (cid:48) T ( X, y )) = φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } for all y ∈ R , with y (cid:55)→ φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } continuous, with probability one. Hence, for any α, β ∈ R , α < β , by the Fundamental Theorem of Calculus,Φ( b (cid:48) T ( X, β )) − Φ( b (cid:48) T ( X, α )) = ˆ βα φ ( b (cid:48) T ( X, y )) { b (cid:48) t ( X, y ) } dy > , b ∈ Θ , with probability one, since b (cid:48) t ( X, Y ) > φ ( e ) > e ∈ R , which implies that y (cid:55)→ b (cid:48) T ( X, y ) is strictly increasing on R , with probability one. (cid:3) Appendix B. Proofs of Theorems 2-3 and Corollary 1

B.1.

Deﬁnitions and notation.

Deﬁne L ( Y, X, b ) ≡ −

12 [log(2 π ) + ( b (cid:48) T ( X, Y )) ] + log( b (cid:48) t ( X, Y )) , b ∈ Θ , and f ( Y, X, b ) ≡ φ ( b (cid:48) T ( X, Y )) { b (cid:48) t ( X, Y ) } , b ∈ Θ , and note that Q ( b ) = E [ L ( Y, X, b )] = E [log f ( Y, X, b )] , b ∈ Θ . In Appendix B.2 we establish the main properties of Q ( b ) used in Appendix B.3 andB.5 for the proofs of Theorems 2 and 3, respectively.B.2. Auxiliary lemmas.Lemma 2.

2] is ﬁnite by Cauchy-Schwartz inequality and by E [ || T ( X, Y ) || ] < ∞ . For the second term, applying a mean-value expansion around¯ b = ( b , JK − ), b >

0, gives for some intermediate values ˜ b , | log( b (cid:48) t ( X, Y )) | = | log( b ) + (˜ b (cid:48) t ( X, Y )) − [( b − ¯ b ) (cid:48) t ( X, Y )] |≤ | log( b ) | + | (˜ b (cid:48) t ( X, Y )) − | || b − ¯ b || || t ( X, Y ) || . Thus E [ | log( b (cid:48) t ( X, Y )) | ] < ∞ , since we have that ˜ b (cid:48) t ( X, Y ) > E [ || t ( X, Y ) || ] < ∞ . Therefore E [ | L ( Y, X, b ) | ] < ∞ . Continuity of Q ( b ) thenfollows from continuity of b (cid:55)→ L ( Y, X, b ) and dominated convergence. (cid:3)

Lemma 3.

If Assumption 1 holds then Q ( b ) is twice continuously diﬀerentiable overany compact subset Θ ⊂ Θ , and ∇ bb E [ L ( Y, X, b )] = E [ ∇ bb L ( Y, X, b )] .Proof. By lemma 2, E [ | L ( Y, X, b ) | ] < ∞ . Moreover, for b ∈ Θ, ||∇ b L ( Y, X, b ) || = || − T ( X, Y )( b (cid:48) T ( X, Y )) + ( b (cid:48) t ( X, Y )) − t ( X, Y ) ||≤ || T ( X, Y )( b (cid:48) T ( X, Y )) || + | ( b (cid:48) t ( X, Y )) − | || t ( X, Y ) ||≤ C (cid:8) || T ( X, Y ) || + || t ( X, Y ) || (cid:9) , (B.1)for some ﬁnite constant C >

0. Therefore, E [ || T ( X, Y ) || ] < ∞ and E [ || t ( X, Y ) || ] < ∞ imply that E [sup b ∈ Θ ||∇ b L ( Y, X, b ) || ] < ∞ under Assumption 1. Lemma 3.6 inNewey and McFadden (1994) then implies that Q ( b ) is continuously diﬀerentiable in b , and that the order of diﬀerentiation and integration can be interchanged.Continuous diﬀerentiability of ∇ b Q ( b ) in b ∈ Θ follows from applying steps similar to(B.1). By ||∇ bb L ( Y, X, b ) || ≤ || T ( X, Y ) || + C || t ( X, Y ) || , for some ﬁnite constant C >

0, we have that E [ || T ( X, Y ) || ] < ∞ and E [ || t ( X, Y ) || ] < ∞ imply that E [sup b ∈ Θ ||∇ bb L ( Y, X, b ) || ] < ∞ under Assumption 1. Lemma 3.6 inNewey and McFadden (1994) then implies that ∇ bb Q ( b ) is continuously diﬀerentiablein b , and that the order of diﬀerentiation and integration can be interchanged. (cid:3) emma 4. If Assumption 1 holds then, for any compact subset Θ ⊂ Θ , we have that −∇ bb Q ( b ) exists for b ∈ Θ , with smallest eigenvalue bounded away from zero uniformlyin b ∈ Θ .Proof. By Lemma 3, Q ( b ) is twice continuously diﬀerentiable over Θ and the order ofdiﬀerentiation and integration can be interchanged. Therefore, ∇ bb {− Q ( b ) } = Γ + Γ ( b ) , Γ ≡ E [ T ( X, Y ) T ( X, Y ) (cid:48) ] , Γ ( b ) ≡ E (cid:20) t ( X, Y ) t ( X, Y ) (cid:48) ( b (cid:48) t ( X, Y )) (cid:21) , exists for all b ∈ Θ under Assumption 1. Denoting the smallest eigenvalue of amatrix A by λ min ( A ), the result then follows from Weyl’s Monotonicity Theorem(e.g., Corollary 4.3.12 in Horn and Johnson, 2012) which implies λ min (Γ + Γ ( b )) ≥ λ min (Γ ) ≥ B, b ∈ Θ , for some constant B >

0, by Γ ( b ) being positive semideﬁnite for all b ∈ Θ and thesmallest eigenvalue of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] being bounded away from zero. (cid:3) B.3.

Proof of Theorem 2.

B.3.1.

Uniqueness.

We show that b is a point of maximum of Q ( b ) in Θ. For b (cid:54) = b , b ∈ Θ, by E [log f ( Y, X, b )] = E [log f Y | X ( Y | X )] and Jensen’s inequality, we obtain E (cid:20) log (cid:18) f ( Y, X, b ) f ( Y, X, b ) (cid:19)(cid:21) = E (cid:20) − log (cid:18) f ( Y, X, b ) f Y | X ( Y | X ) (cid:19)(cid:21) ≥ − log E (cid:20)(cid:18) f ( Y, X, b ) f Y | X ( Y | X ) (cid:19)(cid:21) = − log E (cid:20) ˆ R f ( y, X, b ) dy (cid:21) ≥ , since ˆ R f ( y, X, b ) dy = lim y →∞ Φ( b (cid:48) T ( X, y )) − lim y →−∞ Φ( b (cid:48) T ( X, y )) ∈ (0 , y (cid:55)→ Φ( b (cid:48) T ( X, y ))being strictly increasing by Lemma 1. Therefore, b is a point of maximum. Strictconcavity in Lemma 4 then implies that Q ( b ) admits at most one maximizer in everycompact subset in Θ, and in particular every compact subset that contains b . Hencethere is no (cid:101) b (cid:54) = b that maximizes Q ( b ) in Θ, and b uniquely maximizes Q ( b ) in Θ. (cid:3) B.3.2.

Identiﬁcation.

By uniqueness of the point of maximum, for b (cid:54) = b , b ∈ Θ,we have E [log f ( Y, X, b )] − E [log f ( Y, X, b )] >

0, which implies that f ( Y, X, b ) (cid:54) = f ( Y, X, b ) = f Y | X ( Y | X ), and hence that b is identiﬁed. Identiﬁcation of g ( Y, X ) and he distributional regression functions then follows by the fact that they are knownfunctions of b , by Theorem 1. (cid:3) B.4.

Proof of Corollary 1.

The proof follows by application of Theorem 2 and bythe argument in the main text, using that e = b (cid:48) T ( Y, X ) | X ∼ N (0 , Proof of Theorem 3.

B.5.1.

Proof of existence of b ∗ . We ﬁrst show that the level sets B α = { b ∈ Θ : − Q ( b ) ≤ α } , α ∈ R , of − Q ( b ) are closed and bounded, hence compact, and then usethe fact that − Q ( b ) is continuous over Θ, which implies existence of a minimizer.Step 1. This step shows that B α is bounded.Given b , b ∈ B α , let t = || b − b || and u = b − b || b − b || , so that || u || = 1 and b = b + tu .By Lemma 3, Q ( b ) is twice continuously diﬀerentiable for b ∈ B α . Thus, by deﬁnitionof b , a second-order Taylor expansion of t (cid:55)→ − Q ( b + tu ) around t = 0 yields, forsome ¯ b on the line connecting b and b and some constant B > α ≥ − Q ( b ) = − Q ( b + tu ) = − Q ( b ) − t ∇ b Q ( b ) (cid:48) u − t u (cid:48) ∇ bb Q (¯ b ) u ≥ − Q ( b ) − t ∇ b Q ( b ) (cid:48) u + B t ≥ − Q ( b ) − t ||∇ b Q ( b ) || + B t , where the penultimate inequality follows by Lemma 4. Fixing b ∈ B α , the aboveinequality implies that t is bounded and therefore B α is bounded.Step 2. This step shows that B α is closed.Deﬁne the boundary ∂ Θ of Θ as ∂ Θ = { b ∈ R JK : Pr[ b (cid:48) t ( X, Y ) = 0] > } . For b ∈ ∂ Θ with b (cid:48) t ( X, Y ) < b (cid:48) t ( X, Y )) takes on the value −∞ on that set (e.g., Section 11.2.1 in Boyd and Vandenberghe, 2004). Consider asequence ( b n ) in B α such that b n → ˇ b ∈ ∂ Θ as n → ∞ . Steps 2.1 and 2.2 below showthat − Q ( b n ) = E [ − L ( Y, X, b n )] → ∞ as n → ∞ , and hence that B α is closed.Step 2.1. This step shows that E [lim n →∞ − L ( Y, X, b n )] ≤ lim n →∞ E [ − L ( Y, X, b n )]. y B α being bounded, there exists a constant C > b (cid:48) t ( X, Y )) ≤ C || t ( Y, X ) || with probability one for all b ∈ B α , and hence such that − L ( Y, X, b ) = 12 [log(2 π ) + ( b (cid:48) T ( X, Y )) ] − log( b (cid:48) t ( X, Y )) ≥ − C || t ( X, Y ) || , b ∈ B α , with probability one. Therefore, ϕ ( Y, X, b ) ≡ − L ( Y, X, b ) + δ ( Y, X ) ≥ , δ ( Y, X ) ≡ C || t ( X, Y ) || , b ∈ B α , with probability one, and where E [ | δ ( Y, X ) | ] < ∞ under Assumption 1.Moreover, by deﬁnition of ∂ Θ, we have that lim n →∞ log( b (cid:48) n t ( X, Y )) = −∞ on a subset (cid:103) YX of the joint support of ( Y, X ) with positive probability, and hence(B.2)lim n →∞ − L ( Y, X, b n ) = 12 [log(2 π ) + lim n →∞ { b (cid:48) n T ( X, Y ) } ] − lim n →∞ log( b (cid:48) n t ( X, Y )) = ∞ , on (cid:103) YX , by { b (cid:48) T ( X, Y ) } / ≥ b ∈ R JK .Letting χ (cid:103) YX ( Y, X ) ≡ { ( Y, X ) ∈ (cid:103) YX } and χ (cid:103) YX c ( Y, X ) ≡ { ( Y, X ) ∈ (cid:103) YX c } , with (cid:103) YX c denoting the complement of (cid:103) YX , we have E [ lim n →∞ ϕ ( Y, X, b n )] = E [ χ (cid:103) YX ( Y, X ) lim n →∞ ϕ ( Y, X, b n )] + E [ χ (cid:103) YX c ( Y, X ) lim n →∞ ϕ ( Y, X, b n )]= E [ χ (cid:103) YX ( Y, X ) lim n →∞ − L ( Y, X, b n )] + E [ χ (cid:103) YX ( Y, X ) δ ( Y, X ) } ]+ E [ χ (cid:103) YX c ( Y, X ) lim n →∞ − L ( Y, X, b n )] + E [ χ (cid:103) YX c ( Y, X ) δ ( Y, X )]= E [ lim n →∞ − L ( Y, X, b n )] + E [ δ ( Y, X )] , (B.3)where the second equality follows from lim n →∞ − L ( Y, X, b n ) and δ ( Y, X ) being non-negative functions on (cid:103) YX (e.g., Proposition 5.2.6(ii) in Rana, 2002), and δ ( Y, X ) andlim n →∞ − L ( Y, X, b n ) having ﬁnite expectation on (cid:103) YX c , since lim n →∞ b (cid:48) n t ( X, Y ) > (cid:103) YX c and E [ | − L ( Y, X, b ) | ] < ∞ for all b ∈ Θ by Lemma 2.By ϕ ( Y, X, b n ) being nonnegative, Fatou’s lemma implies that(B.4) E [ lim n →∞ ϕ ( Y, X, b n )] ≤ lim n →∞ E [ ϕ ( Y, X, b n )] , with(B.5) lim n →∞ E [ ϕ ( Y, X, b n )] = lim n →∞ E [ − L ( Y, X, b n )] + E [ δ ( Y, X )] , by E [ | δ ( Y, X ) | ] < ∞ and E [ | − L ( Y, X, b n ) | ] < ∞ for b n ∈ Θ by Lemma 2. Therefore,(B.6) E [ lim n →∞ − L ( Y, X, b n )] ≤ lim n →∞ E [ − L ( Y, X, b n )] ollows by (B.3), (B.4) and (B.5) .Step 2.2. This step shows that E [lim n →∞ − L ( Y, X, b n )] = ∞ , and hence B α is closed.The limit in (B.2) and the fact that f Y X ( Y, X ) bounded away from 0 withprobability one together imply that E [ χ (cid:103) YX ( Y, X ) lim n →∞ − L ( Y, X, b n )] = ∞ ,and hence that E [ χ (cid:103) YX ( Y, X ) lim n →∞ ϕ ( Y, X, b n )] = ∞ by E [ | δ ( Y, X ) | ] < ∞ . Moreover, E [ χ (cid:103) YX c ( Y, X ) lim n →∞ − L ( Y, X, b n )] < ∞ , and hence E [ χ (cid:103) YX c ( Y, X ) lim n →∞ ϕ ( Y, X, b n )] < ∞ by E [ | δ ( Y, X ) | ] < ∞ . Therefore, E [lim n →∞ ϕ ( Y, X, b n )] = ∞ , and (B.3) now implies that E [lim n →∞ − L ( Y, X, b n )] = ∞ .This fact and the bound (B.6) together imply that E [ − L ( Y, X, b n )] = − Q ( b n ) → ∞ as n → ∞ .We have established that the limit ˇ b of a convergent sequence ( b n ) in B α is in Θ. Bycontinuity of − Q ( b ) over Θ, we then have that − Q (ˇ b ) = lim n →∞ Q ( b n ) ≤ α , and henceˇ b ∈ B α and B α is closed.Step 3. This step concludes.Pick α ∈ R such that B α is nonempty. From Steps 1-2, B α is compact by the Heine-Borel theorem. Since Q ( b ) is continuous over B α , there is at least one minimizer to − Q ( b ) in B α by the Weierstrass theorem. The existence result follows. (cid:3) B.5.2.

Proof of uniqueness of b ∗ . The uniqueness result follows by strict concavity of Q ( b ) in Lemma 4 (cid:3) B.5.3.

Proof of uniqueness of g ∗ ( Y, X ) , F ∗ ( Y, X ) , Q ∗ ( u, X ) and f ∗ ( Y, X ) . For (cid:101) b (cid:54) = b ∗ ,by nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] we have that E [ { ( (cid:101) b − b ∗ ) (cid:48) T ( X, Y ) } ] = ( (cid:101) b − b ∗ ) (cid:48) E [ T ( X, Y ) T ( X, Y ) (cid:48) ]( (cid:101) b − b ∗ ) > , which implies ( (cid:101) b − b ∗ ) (cid:48) T ( X, Y ) (cid:54) = 0. Therefore, g ∗ ( Y, X ) (cid:54) = (cid:101) m ( Y, X ) for (cid:101) m ∈ E with (cid:101) b (cid:54) = b ∗ , by deﬁnition of E . By strict monotonicity of e (cid:55)→ Φ( e ), this also implies thatΦ( g ∗ ( Y, X )) (cid:54) = Φ( (cid:101) m ( Y, X )), and hence F ∗ ( Y, X ) (cid:54) = (cid:101) F ( Y, X ) for (cid:101) F ∈ F with (cid:101) m (cid:54) = g ∗ ,by deﬁnition of F . For m ∈ E , let (cid:101) Y x ( m ) ≡ { y ∈ Y x : F ∗ ( y, x ) (cid:54) = Φ( m ( y, x )) } , where Y x denotes the conditional support of Y given X = x , and (cid:101) U x ( m ) ≡ { u ∈ (0 ,

1) : F ∗ ( y, x ) = u for some y ∈ (cid:101) Y x ( m ) } . With probability one, by strict monotonicity of y (cid:55)→ m ( y, X ) for all m ∈ E , the composition y (cid:55)→ Φ( (cid:101) m ( y, X )) is also strictly monotone,and hence Q ∗ (Φ − ( u ) , X ) (cid:54) = (cid:101) q ( X, Φ − ( u )), u ∈ (cid:101) U X ( (cid:101) m ), for (cid:101) q ∈ Q with (cid:101) m (cid:54) = g ∗ , bydeﬁnition of Q . Finally, by b ∗ being the unique maximizer of Q ( b ) in Θ, we have that [log f ∗ ( Y, X )] > E [log( φ ( (cid:101) m ( Y, X )) { ∂ y (cid:101) m ( Y, X ) } ] for (cid:101) m ∈ E with (cid:101) b (cid:54) = b ∗ , and hence f ∗ ( Y, X ) (cid:54) = (cid:101) f ( Y, X ) for (cid:101) f ∈ D with (cid:101) m (cid:54) = g ∗ , by deﬁnition of D . (cid:3) Appendix C. Proof of Theorem 4

C.1.

Auxiliary lemma.Lemma 5.

If the boundary conditions (3.2) hold for all b ∈ Θ with probability one,then the sets Θ and D are equivalent.Proof. Recall that two sets A and B are equivalent if there is a one-to-one corre-spondence between them, i.e., if there exists some function ϕ : A → B that is bothone-to-one and onto. The two sets then have the same cardinality (Dudley, 2002).We note that by nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ] the two sets Θ and E areequivalent. Hence it suﬃces to show that E and D are equivalent. For each f ∈ D , m ∈ E , and ( y, x ) ∈ YX , we deﬁne( ϕ ( f ))( y, x ) ≡ Φ − (cid:18) ˆ y −∞ f ( t, x ) dt (cid:19) , ( ψ ( m ))( y, x ) ≡ ∂ y Φ( m ( y, x )) , We ﬁrst verify that ϕ : D → F and ψ : F → D , and then establish that ϕ is one-to-oneand onto by showing that ϕ and ψ are inverse functions of each other.By deﬁnition of f ∈ D , the Fundamental Theorem of Calculus, and the boundaryconditions (3.2), we have( ϕ ( f ))( y, X ) = Φ − (cid:18) ˆ y −∞ φ ( T ( X, v ) (cid:48) b ) { b (cid:48) t ( X, v ) } dv (cid:19) = Φ − (cid:18) Φ( T ( X, y ) (cid:48) b ) − lim α →−∞ Φ( T ( X, α ) (cid:48) b ) (cid:19) = T ( X, y ) (cid:48) b for some b ∈ Θ and all y ∈ Y , and hence T ( X, Y ) (cid:48) b ∈ E . Therefore ϕ : D → E . Bydeﬁnition of m ∈ E we have( ψ ( m ))( y, X ) = ∂ y Φ( T ( X, y ) (cid:48) b ) = φ ( T ( X, y ) (cid:48) b ) { t ( X, y ) (cid:48) b } , y ∈ Y , for some b ∈ Θ, and hence φ ( T ( X, Y ) (cid:48) b ) { t ( X, Y ) (cid:48) b } ∈ D . Therefore ϕ : E → D .The conclusion then follows from ψ being both the left-inverse of ϕ , since( ψ ( ϕ ( f )))( y, X ) = ∂ y (cid:26) Φ (cid:18) Φ − (cid:18) ˆ y −∞ f ( t, X ) dt (cid:19)(cid:19)(cid:27) = ∂ y (cid:26) ˆ y −∞ f ( t, X ) dt (cid:27) = f ( y, X ) or all y ∈ Y , and the right-inverse of ϕ , since( ϕ ( ψ ( m )))( y, X ) = Φ − (cid:18) ˆ y −∞ ∂ y Φ( m ( t, X )) dt (cid:19) = m ( y, X ) , y ∈ Y . Therefore, ψ is the inverse function of ϕ and the result follows. (cid:3) C.2.

Proof of Theorem 4.

By Theorem 3, b ∗ = arg max b ∈ Θ E [log( φ ( T ( X, Y ) (cid:48) b ) { t ( X, Y ) (cid:48) b } )] . Thus, f ( Y, X ) = φ ( T ( X, Y ) (cid:48) b ) { t ( X, Y ) (cid:48) b } ∈ D for each b ∈ Θ, and the fact that Θand D are equivalent by Lemma 5, together imply that f ∗ is the well-deﬁned point ofmaximum of E [log f ( Y, X )] in D , and hence f ∗ = arg max f ∈D E [log f ( Y, X )]= arg min f ∈D − E [log f ( Y, X )] = arg min f ∈D E (cid:20) log (cid:18) f Y | X ( Y | X ) f ( Y, X ) (cid:19)(cid:21) . (C.1)Moreover, by the boundary conditions (3.2), each f ∈ D satisﬁes(C.2) ˆ R f ( y, X ) dy = lim y →∞ Φ( b (cid:48) T ( X, y )) − lim y →−∞ Φ( b (cid:48) T ( X, y )) = 1for some b ∈ Θ with probability one. Therefore, (C.1) implies that f ∗ ( Y, X ) is theKLIC closest probability distribution to f Y | X ( Y | X ) in D .By F ∗ ( Y, X ) = Φ( g ∗ ( Y, X )) and f ∗ ( Y, X ) = φ ( g ∗ ( Y, X )) { ∂ y g ∗ ( Y, X ) } , we have ∂ y F ∗ ( Y, X ) = φ ( g ∗ ( Y, X )) { ∂ y g ∗ ( Y, X ) } = f ∗ ( Y, X ) . Since y (cid:55)→ f ∗ ( y, X ) is continuous, we obtain F ∗ ( y, X ) = ´ y −∞ f ∗ ( t, X ) dt for all y ∈ R by the Fundamental Theorem of Calculus, with lim y →−∞ F ∗ ( y, X ) = 0 andlim y →∞ F ∗ ( y, X ) = 1 by deﬁnition of F ∗ ( y, X ) and (C.2).By f ∗ ∈ D we have that f ∗ ( Y, X ) >

0, and by Lemma 1 that y (cid:55)→ F ∗ ( y, X ) isstrictly increasing, with probability one. Hence, the inverse function of y (cid:55)→ F ∗ ( y, X )is well-deﬁned, denoted u (cid:55)→ Q ∗ ( X, u ), with ∂Q ∗ ( X, u ) ∂u = 1 f ∗ ( Q ∗ ( X, u ) , X ) > , u ∈ (0 , , with probability one, by continuous diﬀerentiability of y (cid:55)→ F ∗ ( y, X ) and the InverseFunction Theorem. (cid:3) ppendix D. Asymptotic Theory

D.1.

Proof of Theorem 5.

Parts (i)-(ii).

We verify the conditions of Theorem 2.7 in Newey and McFadden(1994). By Theorem 3, b ∗ ∈ Θ is the unique minimizer of Q ( b ), and their Condi-tion (i) is veriﬁed. By Θ convex and open, existence of b ∗ ∈ Θ established in Theorem3 and concavity of Q n ( b ) together imply that their Condition (ii) is satisﬁed. Finally,since the sample is i.i.d. by Assumption 3(i), pointwise convergence of Q n ( b ) to Q ( b )follows from Q ( b ) bounded (established in the proof of Theorem 2) and applicationof Khinchine’s law of large numbers. Hence, all conditions of Newey and McFadden’sTheorem 2.7 are satisﬁed. Therefore, there exists ˆ b ∈ Θ with probability approachingone, and ˆ b → p b ∗ . (cid:3) Part (ii).

The asymptotic normality result n / (ˆ b − b ∗ ) → d N (0 , Γ − Ψ(Γ − ) (cid:48) ) followsfrom verifying the assumptions of Theorem 3.1 in Newey and McFadden (1994), forinstance. Symmetry and nonsingularity of Γ then implies that V = Γ − ΨΓ − .By Theorem 3, b ∗ is in the interior of Θ so that their Condition (i) is satisﬁed. Con-dition (ii) holds by inspection. Condition (iii) holds by E [ ψ ( Y, X, b ∗ )] = 0, existenceof Γ and the Lindberg-Levy central limit theorem.For their Condition (iv), we apply Lemma 2.4 in Newey and McFadden (1994) with a ( Y, X, b ) ≡ ∇ bb L ( Y, X, b ). Let Θ denote a compact subset of Θ containing b ∗ in itsinterior. By the proof of Lemma 3 we have that E [sup b ∈ Θ ||∇ bb L ( Y, X, b ) || ] < ∞ .In addition, by Assumption 3(i) the data is i.i.d., and ∇ bb L ( Y, X, b ) is continuous ateach b ∈ Θ by inspection. The conditions of the Lemma 2.4 in Newey and McFadden(1994) are veriﬁed, and therefore their Condition (iv) in Theorem 3.1 also is. Finally,Γ is nonsingular by Lemma 4 which veriﬁes their Condition (v). The result follows.In order to show that ˆΓ − ˆΨˆΓ − → p Γ − ΨΓ − , we verify the conditions given in thediscussion of Theorem 4.4 in Newey and McFadden (1994, bottom of page 2158).First, by Theorem 5 we have (cid:98) b → p b ∗ . Second, with probability one, by inspectionlog f ( Y, X, b ) is twice continuously diﬀerentiable and f ( Y, X, b ) > b ∈ Θ.Moreover, Γ exists and is nonsingular by Lemma 4. Thus Conditions (ii) and (iv) of heorem 3.3 in Newey and McFadden (1994) are veriﬁed. Third, || ψ ( Y, X, b ) || = || − T ( X, Y )( b (cid:48) T ( X, Y )) + ( b (cid:48) t ( X, Y )) − t ( X, Y ) || ≤ || T ( X, Y )( b (cid:48) T ( X, Y )) || + 2 | ( b (cid:48) t ( X, Y )) − | || t ( X, Y ) || ≤ C (cid:8) || T ( X, Y ) || + || t ( X, Y ) || (cid:9) , so that E [sup θ ∈ Θ || ψ ( Y, X, b ) || ] < ∞ , by Assumption 1 and 3(ii). Hence, for aneighborhood N of b ∗ , we have that E [sup b ∈N || ψ ( Y, X, b ) || ] < ∞ . Moreover, b (cid:55)→ ψ ( Y, X, b ) is continuous at b ∗ with probability one. The result follows. (cid:3) D.2.

Proof of Theorem 6.

The proof builds on the proof strategy in Zou (2006)and Lu, Goldberg, and Fine (2012). Deﬁne D n ( u ) ≡ Q n ( b ∗ + n − / u ) − Q n ( b ∗ ) , where u is deﬁned by b = b ∗ + n − / u . Also let (cid:98) u n = arg max u D n ( u ), so that (cid:98) u n = √ n ( (cid:98) b AL − b ∗ ). By a mean-value expansion, D n ( u ) = n − n (cid:88) i =1 ψ ( y i , x i , b ∗ ) (cid:48) u + (2 n ) − u (cid:48) (cid:40) n (cid:88) i =1 ∇ b ψ ( y i , x i , b ) (cid:41) u + n − λ n JK (cid:88) l =1 (cid:98) w l n ( | b ∗ l | − | b ∗ l + n − u l | ) ≡ D (1) n ( u ) + D (2) n ( u ) + D (3) n ( u ) , for some intermediate values b . Under Assumptions 1-3, D (1) n ( u ) → d N (0 , u (cid:48) Ψ u ) and D (2) n ( u ) → p u (cid:48) Γ u by the results in Theorem 5 and the Law of Large Numbers. For D (3) n ( u ), Zou (2006, proof of Theorem 2) shows n − λ n (cid:98) w l n ( | b ∗ l | − | b ∗ l + n − u l | ) → p  , b ∗ l (cid:54) = 00 , b ∗ l = 0 , u l = 0 −∞ , b ∗ l = 0 , u l (cid:54) = 0 . Therefore D n ( u ) → p D ( u ) for every u , where D ( u ) = (cid:40) u (cid:48)A Γ A u A + u (cid:48)A W, u l = 0 , l / ∈ A , −∞ otherwise , with W ∼ N (0 , Ψ A ). Moreover, steps similar to those of Lu, Goldberg, and Fine(2012, proof of Theorem 2) show that (cid:98) u n → p (cid:98) u , upon using that (cid:98) w l → p /b ∗ l when b ∗ l (cid:54) = 0 and n / (cid:98) b l = O p (1) when b ∗ l = 0 by Theorem 5(ii), and the fact that the Hessianmatrix Γ is negative deﬁnite by Lemma 4. This yields part (ii), i.e., n / ( (cid:98) b A − b ∗A ) → d N (0 , Γ − A Ψ A Γ − A ). Steps similar to Lu, Goldberg, and Fine (2012, proof of Theorem ) also show that Pr[ (cid:98) b A c = 0] →

1, upon substituting for Q n ( b ) for their objectivefunction, which establishes part (i). (cid:3) D.3.

Proof of Theorem 7.

By Theorems 5 and 6 we have n / ( (cid:98) b † − b ∗ ) → d N (0 , Ξ)with Ξ = Γ †− Ψ † Γ †− positive deﬁnite by assumption. Moreover, for ( y, x ) ∈ YX , b (cid:55)→ Φ( b (cid:48) T ( x, y )) ≡ F ( y, x, b ) and b (cid:55)→ f ( y, x, b ) are continuously diﬀerentiable, withderivative functions ∇ b F ( y, x, b ) = φ ( b (cid:48) T ( x, y )) T ( x, y ) and ∇ b f ( y, x, b ) = −{ b (cid:48) T ( x, y ) } φ ( b (cid:48) T ( x, y )) { b (cid:48) t ( x, y ) } T ( x, y ) + φ ( b (cid:48) T ( x, y )) t ( x, y )= φ ( b (cid:48) T ( x, y )) [ −{ b (cid:48) T ( x, y ) }{ b (cid:48) t ( x, y ) } T ( x, y ) + t ( x, y )] , respectively, by the properties of the normal PDF. For all ( y, x ) ∈ YX with f ( y, x, b ) > b ∈ Θ, we have that y (cid:55)→ F ( y, x, b ) is invertible, and its inverse function u (cid:55)→ F − ( u, x, b ) is continuously diﬀerentiable with derivative 1 /f ( F − ( u, x, b ) , x, b ) for all x ∈ X and u ∈ U x ( m ), m ( y, x, b ) ≡ b (cid:48) T ( x, y ), by the Inverse Function Theorem.Hence, by F − (Φ( b (cid:48) T ( x, y )) , x, b ) = y , we have for ( y, x ) ∈ YX , ∇ b F − (Φ( b (cid:48) T ( x, y )) , x, b ) = φ ( b (cid:48) T ( x, y )) f ( F − ( u , x, b ) , x, b ) T ( x, y ) + ∇ b F − ( u , x, b ) = 0 , with u = Φ( b (cid:48) T ( x, y )), and hence, for x ∈ X and u ∈ U x ( m ), ∇ b F − ( u, x, b ) = − φ ( b (cid:48) T ( x, y )) f ( y , x, b ) T ( x, y ) = − b (cid:48) t ( x, y ) T ( x, y )where y = F − ( u, x, b ), which is continuous in b on Θ, so that b (cid:55)→ F − ( u, x, b ) iscontinuously diﬀerentiable on Θ. Parts (i) and (ii) in the statement of Theorem 7 thenfollow by application of the Delta method (e.g., Lemma 3.9 in Wooldridge, 2010). (cid:3) Appendix E. Duality Theory

E.1.

Auxiliary lemma.

In this Section, we write T i ≡ T ( y i , x i ) and t i ≡ t ( y i , x i ),for i ∈ { , . . . , n } . We ﬁrst show the following result used in the proof of Theorem 8. Lemma 6. If { ( y i , x i ) } ni =1 is i.i.d. and E [ T ( X, Y ) T ( X, Y ) (cid:48) ] is nonsingular then (cid:80) ni =1 T i T (cid:48) i is nonsingular with probability approaching one.Proof. We note that (cid:80) ni =1 T (cid:48) i T i is nonsingular if for all λ (cid:54) = 0 we have λ (cid:48) ( (cid:80) ni =1 T i T (cid:48) i ) λ = (cid:80) ni =1 ( λ (cid:48) T i ) >

0, and hence if, for some i ∈ { , . . . , n } , we have λ (cid:48) T i (cid:54) = 0 for all λ (cid:54) = 0. y nonsingularity of E [ T ( X, Y ) T ( X, Y ) (cid:48) ], for all λ (cid:54) = 0 we have λ (cid:48) T ( X, Y ) (cid:54) = 0 on aset (cid:103) YX with Pr[ (cid:103) YX ] >

0. Hence for { ( y i , x i ) } ni =1 i.i.d.,Pr[ ∩ i ∈{ ,...,n } { ( y i , x i ) / ∈ (cid:103) YX } ] = n (cid:89) i =1 Pr[( y i , x i ) / ∈ (cid:103) YX ]= n (cid:89) i =1 (1 − Pr[ (cid:103) YX ]) = (1 − Pr[ (cid:103) YX ]) n → , as n → ∞ . Since the complement of the event ∩ i ∈{ ,...,n } { ( y i , x i ) / ∈ (cid:103) YX } is the event { ( y i , x i ) ∈ (cid:103) YX for some i ∈ { , . . . , n }} , we obtainPr[( y i , x i ) ∈ (cid:103) YX for some i ∈ { , . . . , n } ] = 1 − (1 − Pr[ (cid:103) YX ]) n → , as n → ∞ . The result now follows from the deﬁnition of (cid:103) YX . (cid:3) E.2.

Proof of Theorem 8.

Part (i).

Let R − ≡ ( −∞ , R + ≡ (0 , + ∞ ). Introducing the variables e i = b (cid:48) T i , η i = b (cid:48) t i , an equivalent formulation for the GT regression problem ismax ( b,e,η ) ∈ Θ × R n × R n + nκ − n (cid:88) i =1 (cid:26) e i − log( η i ) (cid:27) , κ ≡ −

12 log(2 π ) , subject to e i = b (cid:48) T i , η i = b (cid:48) t i , i ∈ { , . . . , n } . For all ( u, v ) ∈ R n × R n − , deﬁne the Lagrange function for this problem as L ( b, e, η, u, v ) = nκ − n (cid:88) i =1 (cid:26) e i − log( η i ) (cid:27) + n (cid:88) i =1 u i { e i − b (cid:48) T i } + n (cid:88) i =1 v i { η i − b (cid:48) t i } , and the Lagrange dual function (Boyd and Vandenberghe (2004), Chapter 5) as g ( u, v ) ≡ sup ( b,e,η ) ∈ Θ × R n × R n + L ( b, e, η, u, v )= sup ( e,η ) ∈ R n × R n + n (cid:88) i =1 (cid:26) u i e i + v i η i − (cid:20) − nκ + e i − log( η i ) (cid:21)(cid:27) + sup b ∈ Θ (cid:40) − n (cid:88) i =1 u i ( b (cid:48) T i ) − n (cid:88) i =1 v i ( b (cid:48) t i ) (cid:41) ≡ I + I . In order to derive g ( u, v ) we ﬁrst show that for all ( u, v ) ∈ R n × R n − the maximum ofthe mapping ( b, e, η ) (cid:55)→ L ( b, e, η, u, v ) is attained and is unique, and we then evaluate( b, e, η ) (cid:55)→ L ( b, e, η, u, v ) at this value. he ﬁrst term I in the dual function g ( u, v ) is the convex conjugate of the negativelog-likelihood function, deﬁned as a function of the n -vectors e and η . Deﬁne D ( e, η, u, v ) ≡ n (cid:88) i =1 { u i e i + v i η i } − n (cid:88) i =1 L ( e i , η i ) , L ( e i , η i ) ≡ − nκ + e i − log( η i ) . We ﬁrst show that, for all ( u, v ) ∈ R n × R n − , the map ( e, η ) (cid:55)→ D ( e, η, u, v ) admits atleast one maximum in R n × R n + . For i ∈ { , . . . , n } , the ﬁrst-order conditions are ∂ e i D ( e, η, u, v ) = u i − n (cid:88) i =1 ∂ e i L ( e i , η i ) = u i − e i = 0 ∂ η i D ( e, η, u, v ) = v i − n (cid:88) i =1 ∂ η i L ( e i , η i ) = v i + 1 η i = 0 , and upon solving for e i and η i , we obtain e i = u i , η i = − v i , i ∈ { , . . . , n } . Clearly, for all ( u, v ) ∈ R n × R n − there exists ( e, η ) ∈ R n × R n + such that the n ﬁrst-orderconditions hold.We now show that, for all ( u, v ) ∈ R n × R n − , the map ( e, η ) (cid:55)→ D ( e, η, u, v ) admits atmost one maximum in R n × R n + . For i ∈ { , . . . , n } , the second-order conditions are ∂ e i ,e i D ( e, η, u, v ) = − , ∂ e i ,η i D ( e, η, u, v ) = 0 ∂ η i ,e i D ( e, η, u, v ) = 0 , ∂ η i ,η i D ( e, η, u, v ) = − η i . Therefore the Hessian matrix of ( e, η ) (cid:55)→ D ( e, η, u, v ) is negative deﬁnite for all ( u, v ) ∈ R n × R n − . Hence, ( e, η ) (cid:55)→ D ( e, η, u, v ) is strictly concave with unique maximum( e i , η i ) = ( u i , − /v i ), i ∈ { , . . . , n } , for all ( u, v ) ∈ R n × R n − . Evaluating ( e, η ) (cid:55)→D ( e, η, u, v ) at the maximum yields, for all ( u, v ) ∈ R n × R n − ,sup ( e,η ) ∈ R n × R n + D ( e, η, u, v ) = n (cid:88) i =1 u i + n (cid:88) i =1 v i (cid:18) − v i (cid:19) − n (cid:88) i =1 (cid:26) − κ + u i − log (cid:18) − v i (cid:19)(cid:27) = − n (1 − κ ) + n (cid:88) i =1 (cid:26) u i − log ( − v i ) (cid:27) , (E.1)the conjugate function of the negative log-likelihood. e now consider the second term I in the deﬁnition of the dual function g ( u, v ). Forall ( b, u, v ) ∈ Θ × R n × R n − , deﬁne the penalty function P ( b, u, v ) = n (cid:88) i =1 {− u i ( b (cid:48) T i ) − v i ( b (cid:48) t i ) } . The map b (cid:55)→ P ( b, u, v ) is linear with partial derivative − (cid:80) ni =1 { u i T i + v i t i } . Thevalue of sup b ∈ Θ P ( b, u, v ) is thus determined by the set of all ( u, v ) ∈ R n × R n − suchthat the ﬁrst-order conditions,(E.2) ∇ b P ( b, u, v ) = − n (cid:88) i =1 { u i T i + v i t i } = 0 , hold. For all such ( u, v ) ∈ R n × R n − and any solution b ∈ Θ, we have thatsup b ∈ Θ P ( b, u, v ) = n (cid:88) i =1 (cid:110) − u i ( b (cid:48) T i ) − v i ( b (cid:48) t i ) (cid:111) = − b (cid:48) n (cid:88) i =1 { u i T i + v i t i } = 0 . Therefore, for all ( u, v ) ∈ R n × R n − such that ∇ b P ( b, u, v ) = 0, the optimal value of P ( b, u, v ) is 0.Combining the expression for the likelihood conjugate (E.1) and the ﬁrst-orderconditions (E.2) gives the Lagrange dual function g ( u, v ) for all ( u, v ) such that ∇ b P ( b, u, v ) = 0. The form of the dual GT regression problem (5.2) follows. Part (ii).

The Lagrangian for (5.2) is L ( u, v, b ) = − n (1 − κ ) + n (cid:88) i =1 (cid:26) u i − log( − v i ) (cid:27) − b (cid:48) n (cid:88) i =1 { T i u i + t i v i } , with ﬁrst-order conditions ∂ u i L ( u, v, b ) = u i − b (cid:48) T i = 0 , ∂ v i L ( u, v, b ) = − v i − b (cid:48) t i = 0 , i ∈ { , . . . , n } , equivalently, upon solving for u i and v i , u i = b (cid:48) T i , v i = − b (cid:48) t i , i ∈ { , . . . , n } . (E.3)Upon substituting in the constraints of (5.2) we obtain the method-of-moments rep-resentation of (5.2). Part (iii).

Existence of a solution (cid:98) b ∈ Θ is shown in the proof of Theorem 5(i). Thesample Hessian matrix is − Σ ni =1 { T i T (cid:48) i + t i t (cid:48) i / ( b (cid:48) t i ) } which is negative deﬁnite with robability approaching one by Lemma 6. Therefore there exists a unique solution (cid:98) b to the GT regression problem (4.1), with probability approaching one.Existence of a solution ( (cid:98) u (cid:48) , (cid:98) v (cid:48) ) (cid:48) to program (5.2) follows from existence of a solution (cid:98) b ∈ Θ to the ﬁrst-order conditions of the ML problem (4.1) and the method-of-momentsrepresentation of the dual problem (5.2), upon setting (cid:98) u i = (cid:98) b (cid:48) T i , (cid:98) v i = − / ( (cid:98) b (cid:48) t i ), for i ∈ { , . . . , n } . We now show that, for all b ∈ Θ, the map ( u, v ) (cid:55)→ L ( u, v, b ) admitsat most one maximum in R n × R n − . For all ( u, v ) ∈ R n × R n − and i ∈ { , . . . , n } , thesecond-order conditions for the dual problem (5.2) are ∂ u i ,u i L ( u, v, b ) = 1 , ∂ u i ,v i L ( u, v, b ) = 0 ∂ v i ,u i L ( u, v, b ) = 0 , ∂ v i ,v i L ( u, v, b ) = 1 v i . Therefore, the Hessian matrix of ( u, v ) (cid:55)→ L ( u, v, b ) is positive deﬁnite for all b ∈ Θ.Hence, the map ( u, v ) (cid:55)→ L ( u, v, b ) is strictly convex with unique solution ( (cid:98) u (cid:48) , (cid:98) v (cid:48) ) (cid:48) . Part (iv).

Upon using (E.3) and with (cid:98) e i = (cid:98) b (cid:48) T i and (cid:98) η i = (cid:98) b (cid:48) t i , i ∈ { , . . . , n } , the valueof program (5.2) is L ( (cid:98) u, (cid:98) v, (cid:98) b ) = − n (1 − κ )+ n (cid:88) i =1 (cid:26) (cid:98) e i (cid:98) η i ) (cid:27) − n (cid:88) i =1 (cid:8)(cid:98) e i − (cid:9) = nκ − n (cid:88) i =1 (cid:26) (cid:98) e i − log ( (cid:98) η i ) (cid:27) , the value of the ML problem (4.1) at a solution.E.3. Proof of Theorem 9.

Let (cid:107) b (cid:107) , (cid:98) w = (cid:80) JKl =1 (cid:98) w l | b l | . Analogously to the proof ofTheorem 8(i), an equivalent formulation for the adaptive Lasso GT regression problemis max ( b,e,η ) ∈ Θ × R n × R n + nκ − n (cid:88) i =1 (cid:26) e i − log( η i ) (cid:27) − λ n (cid:107) b (cid:107) , (cid:98) w subject to e i = b (cid:48) T i , η i = b (cid:48) t i , i ∈ { , . . . , n } , and, letting ( u, v ) ∈ R n × R n − denote Lagrange multiplier vectors, the correspondingLagrange dual function can be written as g ( u, v ) = sup ( e,η ) ∈ R n × R n + n (cid:88) i =1 (cid:26) u i e i + v i η i − (cid:20) − nκ + e i − log( η i ) (cid:21)(cid:27) + sup b ∈ Θ (cid:40) − n (cid:88) i =1 u i ( b (cid:62) T i ) − n (cid:88) i =1 v i ( b (cid:62) t i ) − λ n (cid:107) b (cid:107) , (cid:98) w (cid:41) , here the ﬁrst term is the convex conjugate of (cid:80) ni =1 {− nκ + e i / − log( η i ) } derivedin the proof of Theorem 8(i).For the second term, deﬁne F ( b ) = n (cid:88) i =1 u i ( b (cid:48) T i ) + n (cid:88) i =1 v i ( b (cid:48) t i ) + λ n (cid:107) b (cid:107) , (cid:98) w , b ∈ R JK , which is convex in b but not smooth. In order to compute the subgradients of F ( b ),we ﬁrst compute the subgradients of (cid:107) b (cid:107) , (cid:98) w . Recalling that the weights satisfy (cid:98) w l > (cid:98) b l (cid:54) = 0 and (cid:98) w l = 0 otherwise, the weighted norm (cid:107) b (cid:107) , (cid:98) w can be written as themaximum of 2 n linear functions: (cid:107) b (cid:107) , (cid:98) w = max { s (cid:48) b : s l ∈ {− (cid:98) w l , (cid:98) w l }} . The functions s (cid:48) b are diﬀerentiable and have a unique subgradient s . The subdiﬀeren-tial of (cid:107) b (cid:107) , (cid:98) w is given by all convex combinations of gradients of the active functions at b (Boyd and Vandenberghe, 2008). We ﬁrst identify an active function s (cid:48) b , by ﬁndingan s = ( s , . . . , s JK ) (cid:48) , s l ∈ {− (cid:98) w l , (cid:98) w l } , such that s (cid:48) b = (cid:107) b (cid:107) , (cid:98) w . Choose s l = (cid:98) w l if b l > s l = − (cid:98) w l if b l <

0, for each l . If b l = 0, choose either s l = − (cid:98) w l or s l = (cid:98) w l . Wecan therefore take z l =  (cid:98) w l if b l > − (cid:98) w l if b l < − (cid:98) w l or (cid:98) w l if b l = 0 , l ∈ { , . . . , J K } . The subdiﬀerential of (cid:107) b (cid:107) , (cid:98) w is: ∂ (cid:107) b (cid:107) , (cid:98) w = (cid:110) z : | z l | ≤ (cid:98) w l , l ∈ { , . . . , J K } , z (cid:48) b = (cid:107) b (cid:107) , (cid:98) w (cid:111) . Therefore, the subgradient of F ( b ) is: ∂F ( b ) = (cid:40) n (cid:88) i =1 u i T i.l + n (cid:88) i =1 v i t i.l + λ n z l , l ∈ { , . . . , J K } (cid:41) , where | z l | ≤ (cid:98) w l , l ∈ { , . . . , J K } , and z (cid:48) b = (cid:107) b (cid:107) , (cid:98) w , i.e., z is the subgradient of (cid:107) b (cid:107) , (cid:98) w .The subgradient optimality condition is that there exists b such that 0 ∈ ∂F ( b ). Thus b, z should satisfy z l = − n (cid:88) i =1 { u i T i,l + v i t i,l } /λ n , | z l | ≤ (cid:98) w l , z (cid:48) b = || b || , (cid:98) w , l ∈ { , . . . , J K } , hich is equivalent to(E.4) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 { T i,l u i + t i,l v i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ n (cid:98) w l , l ∈ { , . . . , J K } . Upon substituting into F ( b ) gives F ( b ) = inf b ∈ Θ F ( b ) = n (cid:88) i =1 u i ( b (cid:48) T i ) + n (cid:88) i =1 v i ( b (cid:48) t i ) + λ n JK (cid:88) l =1 − (cid:80) ni =1 { u i T i,l + v i t i,l } λ n b l = n (cid:88) i =1 (cid:110) u i ( b (cid:48) T i ) + v i ( b (cid:48) t i ) (cid:111) − n (cid:88) i =1 (cid:40) u i JK (cid:88) l =1 T i,l b l + v i JK (cid:88) l =1 t i,l b l (cid:41) = 0 . Hence the optimal value of F ( b ) is 0, and combining the expression for the likelihoodconjugate (E.1) and the optimality conditions (E.4) gives the Lagrange dual function g ( u, v ) for all ( u, v ) such that (E.4) holds. The form of (4.2) follows. ReferencesAndrews, D. (1997). A conditional Kolmogorov test.

Econometrica (65, September),pp. 1097–1128.

Boyd, S. P. and Vandenberghe, L. (2004).

Convex Optimization . CambridgeUniversity Press.

Boyd, S. P. and Vandenberghe, L. (2008).

Subgradients. Notes for EE364b .Stanford University, Winter 2006, 7, pp.1-7.

Chen, L.H., Goldstein, L. and Shao, Q.M. (2010).

Normal approximation byStein’s method . Springer Science & Business Media.

Chernozhukov, V., Fernandez-Val, I., and Galichon, A. (2010).

Quantileand probability curves without crossing.

Econometrica

Chernozhukov, V., Fernandez-Val, I., and Melly, B. (2013).

Inference oncounterfactual distributions.

Econometrica

Chernozhukov, V., W¨uthrich, K., and Zhu, Y. (2019). Distributional confor-mal prediction. eprint arXiv:1909.07889.

Chernozhukov, V., Fernandez-Val, I., Newey, W., Stouli, S., and Vella,F. (2020).

Semiparametric estimation of structural functions in nonseparable tri-angular models.

Quantitative Economics

11, pp. 503–533.

Chesher, A. (2003). Identiﬁcation in nonseparable models.

Econometrica (71, Sep-tember), pp. 1405–1441. hesher, A. and Spady, R. H. (1991). Asymptotic expansions of the informationmatrix test statistic. Econometrica (59, May), pp. 787–815.

Curry, H. B. and Schoenberg, I. J. (1966). On Polya frequency functions IV:The fundamental spline functions and their limits.

J. Analyse Math.

17, pp.71–107,1966.

DiNardo, J., Fortin, N.M. and Lemieux, T. (1996). Labor market institutionsand the distribution of wages, 1973-1992: A semiparametric approach.

Economet-rica

Domahidi A., Chu, E., and Boyd, S. (2013). ECOS: an SOCP Solver for embed-ded systems. In

Proceedings of the European Control Conference , pp. 3071–3076.

Dudley, R.M. (2002).

Real Analysis and Probability . Cambridge University Press,2nd Edition.

Foresi, S. and Peracchi, F. (1995). The conditional distribution of excess returns:An empirical analysis.

Journal of the American Statistical Association , 90(430),pp. 451–466.

Fu, A., Narasimhan, B. and Boyd, S. (2017). CVXR: An R package for disci-plined convex optimization. arXiv preprint arXiv:1711.07582 . Hall, P., Wolff, R.C. and Yao, Q. (1999). Methods for estimating a conditionaldistribution function.

Journal of the American Statistical Association , 94(445),pp.154–163.

Horn, R. A. and Johnson, C. R. (2012).

Matrix Analysis . 2nd ed., CambridgeUniversity Press.

Horowitz, J. and Nesheim, L. (2020). Using penalized likelihood to select param-eters in a random coeﬃcients multinomial logit model.

Journal of Econometrics ,forthcoming.

Hyndman, R.J., Bashtannyk, D.M. and Grunwald, G.K. (1996). Estimat-ing and visualizing conditional densities.

Journal of Computational and GraphicalStatistics , 5(4), pp.315–336.

Imbens, G. and Newey , W. K. (2009). Identiﬁcation and estimation of triangularsimultaneous equations models without additivity.

Econometrica

Koenker, R. (2000). Galton, Edgeworth, Frisch, and prospects for quantile regres-sion in econometrics.

Journal of Econometrics

Koenker, R. and Bassett, G. (1978). Regression quantiles.

Econometrica (46),pp. 33–50. ooperberg, C. and Stone, C. J. (1991). A study of logspline density estimation. Computational Statistics & Data Analysis

Lu, W., Goldberg, Y., and Fine, J. P. (2012). On the robustness of the adaptivelasso to model misspeciﬁcation.

Biometrika

99, pp. 717–731.

Matzkin, R. (2003). Nonparametric estimation of nonadditive random functions.

Econometrica (71, September), pp. 1339–1375.

Newey, W. and Mc Fadden, D. (1994). Large sample estimation and hypoth-esis testing. In

Handbook of Econometrics , vol. 4, ch. 36, 1st ed., pp. 2111–2245.Amsterdam: Elsevier.

O’donoghue, B., Chu, E., Parikh, N. and Boyd, S. (2016). Conic optimizationvia operator splitting and homogeneous self-dual embedding.

Journal of Optimiza-tion Theory and Applications , 169(3), pp. 1042–1068.

Owen, A. B. (2007). A robust hybrid of lasso and ridge regression.

ContemporaryMathematics

R Development Core Team (2020).

R: A language and environment for statisticalcomputing . Vienna, Austria: R Foundation for Statistical Computing.

Ramsay , J. O. (1988). Monotone regression splines in action.

Statistical Science

Rana, I. K. (2002).

An Introduction to Measure and Integration . Vol. 45. AmericanMathematical Soc..

Rosenblatt , M. (1952). Remarks on a multivariate transformation.

The Annals ofMathematical Statistics

Spady, R. H. and Stouli, S. (2018a). Dual regression.

Biometrika

Spady, R. H. and Stouli, S. (2018b). Simultaneous mean-variance regression.eprint arXiv:1804.01631.

White, H. (1982). Maximum likelihood estimation of misspeciﬁed models.

Econo-metrica (50, January), pp. 1–25.

Wooldridge, J. M. (2010).

Econometric analysis of cross section and panel data .MIT Press.

Zou, H. (2006). The adaptive lasso and its oracle properties.

Journal of the Americanstatistical association , 101(476), pp.1418–1429., 101(476), pp.1418–1429.